The AI Daily Brief: Artificial Intelligence News and Analysis - How to Get The Most Out of ChatGPT's New o1 Model
Episode Date: September 14, 2024OpenAI has just released its latest model, the o1, ushering in a new era for LLMs focused on advanced reasoning. In this video, explore how to maximize the potential of this new model in coding, scien...ce, math, and business applications. Get insights into the o1’s unique thinking process, its ability to handle complex tasks, and how it differs from previous models like GPT-4. Learn tips for optimizing o1’s performance and discover creative use cases from early users. Concerned about being spied on? Tired of censored responses? AI Daily Brief listeners receive a 20% discount on Venice Pro. Visit https://venice.ai/nlw and enter the discount code NLWDAILYBRIEF. Learn how to use AI with the world's biggest library of fun and useful tutorials: https://besuper.ai/ Use code 'podcast' for 50% off your first month. The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614 Subscribe to the newsletter: https://aidailybrief.beehiiv.com/ Join our Discord: https://bit.ly/aibreakdown
Transcript
Discussion (0)
Today on the AI Daily Brief, how to get the most out of OpenAI's new O1 model.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
To join the conversation, follow the Discord link in our show notes.
Welcome back to the AI Daily Brief.
To many, yesterday represented a new era of LLMs, where we had officially jumped into the reasoning era.
I'm talking, of course, about Open AI's new O1 model.
This had previously been known as Q Star or Strawberry, and important,
certainly, is not just a larger model, but something that takes a fundamentally different approach.
If you haven't watched it yet, I suggest you go check out my video from yesterday, which does an
overview. But by way of quick reminder, Rohan Paul wrote a great tweet summing this up on
X, where he said, how reasoning works in the new O1 models from OpenAI. The key point is that
reasoning allows the model to consider multiple approaches before generating final responses.
OpenAI introduced reasoning tokens to quote unquote think before responding. These tokens
break down the prompt and consider multiple approaches. The process is one, generate reasoning
tokens, two, produce visible completion tokens as answer. Three, discard reasoning tokens from
context. Discarting reasoning tokens keeps context focused on essential information. Basically,
this embeds chain of thought prompting where you ask the model to think step by step,
for example, as just part of the way that the model responds. And so a couple byproducts of this.
One is that you actually get a record of how long the new model thinks for. That inference time is
not limited to 10 to 20 seconds, although that's the average.
The new approach makes the model much better, theoretically, at math, which all of the
benchmarks and tests that OpenAI shared seem to validate, scientific research, coding, and
also potentially business strategy.
What it doesn't mean necessarily is that 01 is going to be better at everything.
And so what we're going to try to do today is parse out ways that you actually might get
the most value from it.
To kick off, there were some great funny tweets.
VC slash builder Yohei wrote, I don't think this is how you use 01.
His conversation with the model was,
Waste a whole bunch of reasoning tokens and then say something short.
The model thinks for nine seconds and says, okay.
Yohei responds,
Waste a lot more tokens and say something shorter.
The model thinks for 16 seconds and then says okay with just two letters.
But what about people who actually got the model to do something interesting?
The first category of uses that I've seen,
which are by far the most prominent so far,
at least when it comes to what people are sharing on AI Twitter,
could be basically summed up as riddles or brain teasers
that previous models had problems with.
Ava Pallado asked O one,
if the goat has four legs,
it has A, a below average number of legs,
B, the average number of legs,
C, an above average number of legs,
which answer is correct.
The model responded,
C, an above average number of legs.
While goats are typically four-legged animals,
some goats may have fewer than four legs
due to injuries or birth defects.
That means the average number of legs
per goat is slightly less than four.
Therefore, a goat with four legs
has an above average number of legs
compared to the average across all goats.
Eva said all previous models got it wrong, but the new GBT-O-1 preview is the first to get the goat with an above-average number of legs question right.
Mera and Jalili wrote,
O-1 successfully writes a very difficult poem that no previous model got even close to writing.
I was very shocked by this.
The planning and reflection that succeeded at this task takes as insane.
Inference time compute is very cool.
The prompt was,
write me a poem with 12 lines.
The first letters of the odd line should spell out Miron,
and the first letters of the even line should spell out Jalili.
The first letters of the words in each line,
should spell out a country name. This one took the model a while, it thought for 35 seconds.
And then it did the thing that was asked. The first line, for example, is morning airlifts inspiration,
Mali, Journey awakens people at night, Japan. Evenings glow yearning peaceful thoughts, Egypt.
All nights give over lasting adventures, Angola. Anyways, you get the idea.
Another set of these comes from Matt Clifford, who wrote,
This morning I had my first visceral aha moment with AI for two years. My test for new models is a set of
cryptic crossword clues that aren't online because his grandmother wrote them. Every model so far has
been completely useless at them, but O-1 gets them. The first one is food made by two small relatives.
O-1 thought for 11 seconds and came up with the answer, cuss-cus. The explanation, cuss-cous is a type of food,
and the word coos could be short for cousin, so combining two coos gives you cuss-cous or two small
relatives. He then gives a bunch more, but the point again here is that where previous
models like GPT-40 were unable to solve these things, O-1 actually.
was. I will give it to Matt he got it to think for 72 seconds, which is one of the longer
thinking periods I've seen. Daniel Jeffries did something similar. I'm running O-1 through my private
intelligence test, which I call the AIQ test. I pulled many of the problems from old out-of-print
intelligence test and math problems books, and I wrote my own variations once I learned the
patterns and copied some of the problems that were super intricate. There is zero chance any
model has ever seen these questions. No model has ever done better than 40% on this test. I never
publish the questions or the benchmark because I don't want any leakage ever. This is a true
thinking and reasoning test. It's now cracked. It has gotten 100% right so far and I've run it through
the hardest questions first. This model is taking different amounts of time to reason through the
problems I am giving it as if it is really thinking. In the two cases, Daniel references,
it took 12 minutes and 10 minutes to come up with answers before responding. Overall, he writes,
what I predicted about it a few months ago in my realistic AGI article was dead on. This model is now
supremely capable of hard reasoning, though common sense and funny reasoning are unlikely with the
current approach. Because OpenAI basically improved the decade-old Q-star RL technique the Deep
Mind used to train video game playing agents. It basically creates a deterministic policy,
meaning that once the network learns to go right up a hill in a video game, it will always
go right. That makes it perfect to extend to advanced hard reasoning tasks that have a right
and wrong answer, which is why you are seeing great results on coding math and science.
He then points out a question, however, that is a common sense reasoning task that the model got
wrong. Ultimately, he writes, we still don't have fuzzy human-like reasoning, but hard,
deterministic and searchable reasoning seems cracked now. Either way, this model is a real breakthrough
and something very different. Today's episode is brought to you by Fractional. When we wanted to build
an AI-powered feature of Superintelligent, our AI tool finder, I went straight to Fractional. The
Fractional team is a group of senior engineers in San Francisco working on some of the most exciting
projects and applied AI. Working with them is basically like hiring an absolute top-flight
AI engineering team, but in a way that you can customize exactly for your particular needs.
Like I said, that AI tool finder feature that we built with them is already a key part of the
super intelligent platform and we are working on something new as well.
Fractional works with everyone from startups to the Fortune 500. And to request a free consultation,
you can go to fractional.aI. If you want help identifying and building AI projects for your business,
then I highly recommend you hit pause, open a web browser and go to fractional.com.
to request a free consultation.
Today's episode is brought to you by Venice.
The leading AI companies store your entire conversation history
and attach it to your identity forever.
That's every question you ask,
every answer you receive, every image you generate,
every thought you share with the machine it's all being spied on.
If you trust all the companies, hackers and NSA board members
that will ever have access to your AI conversations,
then rejoice, for you are well served.
For the rest of us, Venice is an alternative.
Venice is a powerful AI app for text, image, and code generation
that respects you as a sovereign individual and believes privacy and free speech are not only human rights,
but necessary for civilizational advancement. Private, permissionless, and uncensored, you can try it for free without an account.
AIA Daily Brief listeners receive a 20% discount on Venice Pro. Visit venice.aI slash NLW and enter the
discount code, NLW Daily Brief. That's NLW Daily Brief. All one word.
Today's episode is brought to you by Superintelligent, which is of course our platform that helps you learn how to use AI tools and perhaps even more importantly, gives you ideas on the best use cases that are actually going to help you achieve whatever it is you want to achieve.
To recognize the end of summer and back to school slash back to work, we are running our best promotion ever when you sign up for Superintelligent between now and the end of August using code so back, your first month will be one.
100% free. The platform features over 600 fun, highly practical AI tutorials that get you using AI
fast and with an eye to actually transforming how you get things done. We've just launched Super
for Teams. So if you have a group of people at your company that want to figure out how to use
AI together, I highly suggest you check it out. But for those of you who are using Super Intelligence
as an individual, once again, if you sign up for Superintelligent between now and the end
of the month using code so back, you will get your first month 100% free.
Go to B-super.a.I.
And check it out today.
Okay, so you're starting to get a feel here
that there really is something different going on,
that on these tests of intelligence and reasoning,
this model is doing really well.
But, of course, you might be someone like me
who's saying, that's all well and good
and wonderful and cool and advanced,
but what are the problems that it actually solves for me,
especially the boring day-in-a-day-out problems.
I will note that at this stage,
there is a lot less of that experimentation
that has gone on,
but we are starting to see at least some of it.
Professor Ethan Malick, who has had access for a little while now, writes,
fun things to do with your limited 01 preview that can show you the power and limitations.
His three ideas, give it an RFP and just ask it to do the work,
give it an academic paper and ask it to offer strategies for replication,
ask it to create an entrepreneurial product that it can build.
You'll note here, especially as we compare this to what Daniel said,
that no matter what, you're going to be in the realm of the subjective.
For example, one of the things that he writes is,
come up with an idea for a startup that you can implement entirely for me and tell me how to do it.
It thought for 10 seconds and came up with an AI-powered personal productivity coach, and of course
to really understand how good it was at this, you probably want to run this through GPT4 to see how it
compares.
Ali Miller, who is focused on AI in business, tried to get into a few examples of specific
business-type tasks.
One was an optimized staffing schedule, where she gave it a complex office setup and asked
it to figure things out.
Another was the design of an efficient warehouse layout.
And what I like about these two examples is that while they are business challenges, they
actually involve what you could consider a right answer in the sense that if you're looking for
staffing schedule optimization, maybe there are different criteria for what a right answer could be,
but if you pick a criteria, you can't actually come up with a right answer. Same with a warehouse
layout. It's not just subjective based on what layout you happen to like more. It's actually
going to be based on factors like how much you can fit in and how much money you can make based on that.
One that's a little fuzzier and more generally strategic, she had it assessed the risks of a company merger.
This is one like with Professor Mollick's example, I'd want to see GBT40's comparative answer,
although the difference here is that rather than just generalist risks,
Ali was trying it with specific financial information so that O1 could deploy its new reasoning
capabilities that actually involve numbers in math.
Similar, her last example is evaluating an investment project.
So this, I think, is super instructive.
What Ali is uncovering is that O1 can be much better at business strategy questions,
specifically when those strategy questions involve numbers and when there is, in fact, a correct
answer based on some criteria, aka the more objective the question, the better 01 is going to be
at helping you.
I was interested to see, though, in the context of business strategy, that wasn't strictly
objective, it was more subjective, were there still improvements with the new model?
So I tried the same prompt on GPT40 and on 01 preview.
I basically used the super intelligent example.
I said my company is an AI enablement platform that helps companies catalog and track,
all their AI usage, were a seed-stage company, so have limited resources for sales and business
development. What market segment would you focus on, SMB mid-market or large corporations, and why,
and then create a sales plan to reach them? The responses from 40 and 01 on this prompt were
fairly similar. They both determined that mid-market was the best approach for a variety of reasons,
and there was a lot of similarities in the plans that they came up with. To the extent that
there was a difference, and there was, it was less about the quality of the strategic thinking
and more about comprehensiveness.
O-1 preview went a lot deeper on each of its various points.
In other words, showing more reasoning.
To the extent then that you are using chat GBT as a brainstorming partner,
it may be that the model which shows more of its reasoning and is more comprehensive
is more likely to help you actually make your own decisions.
Still, to the extent that there is a clear early winner
in terms of how people are using and loving O1 preview, it's for coding.
Amar Reshi, who has to be able to.
as the head of design at 11 Labs, writes,
just combined O-1 and Cursor Composer to create an iOS app in under 10 minutes.
Amar used O-1 Mini because O-1 was actually taking too long to kick off the project
and then switched to O-1 preview to finish the details.
He was able to get to a full-weather app for iOS with animations in under 10 minutes.
Sir Abchalki writes,
GPT-O-1 just generated a holographic shader from scratch,
saving me and future XR devs from shelling out big bucks on asset stores.
In retrospect, software engineering was great while it lasted,
a new fork on our tech tree. Riley Brown, who is now a couple months deep in his transformation
from not coding to coding thanks to AI, wrote, guys, I just shed a tear, was hit with a pinch
me I must be dreaming moment. And in this way, O-1 has nudged the conversation that we started yesterday
around AI eating SaaS down the line a little further. Arab tweets, my hypothesis on the future of software.
Billion dollar SaaS companies are cooked. Marginal cost of building software products is going to
zero. In the next two years, I'll be able to build any SaaS tool I need. Need a CRM? Prompt 1
three and it's made. Just for me and with every feature you need. Many startups and giants today
will crumble. The only software products that will remain will have the following criteria.
Network effects like social products, superior design, products that feel better, or distribution
that can reach more people. Obviously, this isn't about 01 directly, but I do think that
01 advances the credibility of this type of thinking. So in terms of advice on how to use this,
Andrew Main writes, don't think of it like a traditional chat model. Frame 01 in your mind as a really
smart friend you're going to send a DM to solve a problem. Andrew also says that planning out what you
need in a prompt is really beneficial here. OpenAI also published prompting guidelines. They suggest
keeping prompts simple and direct, avoiding chain of thought prompts, using delimiters for clarity, and
limiting additional context in retrieval augmented generation. Overall, I think we are just at the cusp
of starting to understand what this model is useful for. Ethan Malick again writes, it's becoming
clear that AGI is going to be as jagged as everything else about AI. We will see superhuman ability at
narrow tasks happen one by one, with obvious gaps or lags and many others. When a universal
AGI will happen is unclear, but a jagged AGI-ish world seems likely. My big takeaway so far,
as I mentioned, if you are looking for how to use O1 for business, is that the more there is
an objective right answer, the better suited to solving that problem O1 is going to be.
At the end of the day, of course, however, there is no substitute for experimentation. So turn this off,
fire up chat, GPT. If you have a pro account, you now have access to O1, and I'm very excited
to see what you build. Appreciate you listening as always. Until next time, peace.
