Limitless: An AI Podcast - Revealing Elon’s Secret AI Trading Bot: Is It Worth It?
Episode Date: December 9, 2025The groundbreaking Alpha Arena experiment involved eight AI trading models against each other. Grok 4.2 emerges as the standout winner, achieving 60% profit in just two weeks despite the vola...tility that affected many competitors.What does this experiment mean for you? With strategies and behavioral patterns, we need to question the balance between AI trading success and necessary human oversight.------🌌 LIMITLESS HQ: LISTEN & FOLLOW HERE ⬇️https://limitless.bankless.com/https://x.com/LimitlessFT------TIMESTAMPS0:00 Intro1:39 Season 1 Results2:39 Transition to Season 1.54:22 Mystery Model Revealed5:55 Competition Breakdown8:09 Insights from Competition9:56 Model Trading Styles12:16 AI Personalities in Trading14:11 Comparing Model Performances16:36 Limitations and Future Potential19:53 Trusting AI with Investments24:20 Future of AI Trading Tutorials------RESOURCESJosh: https://x.com/JoshKaleEjaaz: https://x.com/cryptopunk7213------Not financial or tax advice. See our investment disclosures here:https://www.bankless.com/disclosures
Transcript
Discussion (0)
Imagine this. You give eight of the world's most powerful AI models $10,000 each and tell them,
go trade real stocks. No paper trading, but real money with real risk. And two weeks later,
most of them have lost a painful amount of cash, which I guess is kind of expected.
The kind of drawdowns that would get a human portfolio manager totally fired. But now they ran
the same experiment again, except this time with much higher stakes. There's $320,000 at stake.
And we've talked about Alpha Arena before in a previous episode, which I highly recommend checking out.
But now we have the new results from the new season, season 1.5.
And what was exciting is that there was a very clear and obvious winner, but that winner was a mystery.
We don't actually know or we didn't know who the winner was up until recently.
In fact, it won all four of the training competitions in this new season while leaving the other top models like Chat Chupit 5.1 and Google Shemini 3.0 fighting for second place.
So at the core of this is, one, who is this model?
And two, how on earth did they do it?
how are they outperforming everyone, so much so as to make 65% in two weeks in one of these
competitions.
So EJZ, I want to walk through everyone about what just happened, what the model is and what
Alpha Arena is.
So give us the lowdown on who this was that made so much money.
Oh, yeah.
Well, we will get into all of that today.
So Alpha Arena is basically a competition or test to see how well AI models can trade.
And they do this in a few different ways, Josh.
Number one, they give each model $10,000, as you mentioned.
And then they allowed them to trade a range of different financial instruments over a period of two weeks.
So there's like a season, two weeks, and we see which AI models do the best.
And they get all your AI models in there.
You've got chat GPT.
You have got Gemini.
You've got Anthropics Claude and you have GROC as well.
And so they've gone through about two seasons now and the results have been absolutely crazy.
So they started off with season one.
And you can think of this as like the Dgen crypto season.
They gave seven models $10,000 each and allowed them to trade crypto assets like Bitcoin,
Ethereum, stuff like that.
And they did this in something called perpetual.
So they could leverage trade is the only instrument that they were allowed to do this.
And the results were, as you'd probably expect, a lot of these AI models lost a lot of money.
Some of them actually ended up making a decent chunk of money.
And they were primarily Chinese models.
There were Quen, and I think it was Deep Seek that ended up making money.
So there was a lot of takeaways there.
As you mentioned, we've got a previous episode where we spoke about this.
Definitely go give that a watch.
There's a lot of alpha in that one.
And then that brings us to season 1.5, where the AI models, instead of being given
crypto to trade, were given the ability to trade U.S. stocks.
We're talking about equities, which is something that a lot of us listening to this show
are very familiar with.
And I think this is for a few reasons, Josh.
primarily crypto is very volatile, and we kind of want to figure out how the majority of money
that is traded in the financial markets can translate into AI models trading that.
So a few things that they kept the same is that they gave the AI model $10,000.
But there was a number of differences with season 1.5.
Number one, they were allowed to trade U.S. equities and stocks.
Number two, there were two new models that were introduced.
One was a model called Kimi K2, which is a really good open source Chinese model.
but the other was this thing called a mystery model.
I'm going to reveal which this model was in a second.
But before I do, do you have any guesses as to what model this might have been?
Well, I cheated.
I know the answer.
But what I think is very exciting about this is that like the, I think it's important to highlight.
These models made hundreds to even thousands of trades per model.
Yes.
And what we want to answer, like the question that I want more than the mystery model is like,
is this real signal or is this just, I mean, I said earlier.
earlier, is this a GPU intensive scratch-off game where is there any real signal? And I guess
we'll talk about the reality of that and what this means for your portfolio if you ever want to
manage it. But to me, I think that's the important thing to highlight. We probably should just
spill the beans you guys. Do you want to tell them? Who's in this room? I have to. I can't keep it in
any longer. It was an unofficial version of grok, aptly named GROC 4.2 or 420 for the memers out there.
And this was revealed by none other than the Grockman himself, Elon Musk.
And the reason why this mystery model was getting so much attention, Josh, was because it ended up being the winner.
It made the most money out of any other AI models.
And what was more impressive is there wasn't just one competition being run throughout season 1.5.
There were four at the same time.
So these AI models were running across four different competitions at the same time.
That was $320,000 at any one instance, which is a crazy amount of financial money to stake on an experiment.
That's a lot of money could have been lost here.
And Grok 4.20 ended up performing the best.
Josh, I want to go through a few different stats here, which kind of like shows how amazing this particular model was.
So firstly, for some context, there were four.
different competitions that were being run that these AI models were being tested on.
Competition number one was something called new baseline. This is basically the ability for these AI
models to get access to trading AI stocks to get access to all the common news that you and I can
read online and in newspapers to kind of like figure out, okay, what kind of news would affect my stock
positions. They would also get access to sentiment data to see how kind of like the markets and
retail traders would kind of react to certain bits of news. They have. They
had access to a much wider spread amount of data in competition number one.
Competition number two was called Monk Mode.
They kind of amended the investing prompt here.
And so kind of like they traded more conservatively.
Competition number three was called situational awareness, Josh.
So each model had an awareness of other models trading and where they ranked in
accordance to them.
So there was this kind of like ecosystem of peer pressure being put on by each model.
And competition number four was just outright to.
generacy, max leverage. You could only trade with like 20 to 50x leverage, which is just kind of,
I don't think it was 50x, but like 30x. Just crazy amount of risk adjustment to test whether
a model would take that risk or whether it would trade more conservatively. Josh, do you have any
reactions on the results of this competition? The results that we're looking at right now,
actually, I found most interesting. This is from the new baseline competition. It's basically the
full info mode. And one of the big differences between this mode versus
previous competitions that have been held is like you mentioned earlier, it has access to a lot of
data. This is the first time an AI trading model has had access to real-time information outside of
just looking at a chart. So I think in that sense, this is the closest competition to how a human
quant fund would actually operate. So if you're looking for high signal in terms of which AI can
actually make you real money in the real world, this is the one. And what we're seeing here is that
the GROC 4.20 model, the memetic mystery model, outperformed by like a fairly large margin,
to Open A&A and chat GPT 5.1, which is the clear second place.
And those are the only two that actually made profit.
Everybody else lost money in the real world competition, which to me signals a few things.
One of them being, well, perhaps one is really good at understanding real world information.
Perhaps it understands company fundamentals better.
Perhaps it just has access to real world information that's better, like GROC and having access
to the X AI model.
So there's a lot of things to speculate here.
But for me, the new baseline chart that we're looking at right now was the high signal one.
I'm like, oh my God, wait, this has the same type of information flows that I'm now getting.
So now we're even.
We're on the same playing field.
Okay.
I actually had a different answer to that, which is I was more impressed, Josh, by the situational awareness competition.
So this was a competition where each model had access to data and news,
but they also had awareness of who they were competing against.
So GROC 4.20, the winner, knew that GPD5 was in second place.
And so he was always keeping an eye on GPT5 being like, oh, what trades is GPT5 making?
Why did they make that trade?
Oh, that's interesting.
And then he would look at Gemini and be like, oh, what trades are Gemini making?
So he would have this awareness of his competitors, which you didn't have in season one
where they were just kind of like trading in silos, right?
And why this competition was so interesting, Josh, is this was technically where GROC 4.20
made the most money.
In fact, if you look at the top of this leaderboard right here,
the account value at the end of season 1.5
was $16,656.5, which is technically a 60% plus return in two weeks
on $10,000 worth of capital.
I needed to take my money immediately.
Isn't that insane, right?
If you had to pick a competition of where you would have given an AI model money,
just given from this data, and I'm not saying you should do that, you would be most bullish on
situational awareness. And I'm going to like kind of make some implications here that I haven't tested
yet, but it seems to imply that this kind of competitive nature where the models were kind of
aware and exposed to their competitors' trades and thinking, and we're going to get to the model
chat thinking in a second, seems to have given them a better trading advantage, at least in some cases.
Yeah, so like you mentioned, one of my favorite parts, I think we share this, and one of our favorite parts about this competition in particular, is that you can actually see all of the trades.
One thing about these private quant funds, you don't know what the hell is going on, but with these models, you can see exactly what they're thinking every time they think and make a decision.
So maybe you guys can go through a few of them and see kind of what the model is thinking, how they're processing this real world data.
And if there's any tips for us to learn from processing this real world data, because clearly they're a much better trader than I am.
Yeah, so I have a few examples pulled up here on the right side of the screen. It's under model chat.
By the way, any of you listening to this can go onto this website and see for yourself and scroll
through their hundreds and hundreds of posts. But it basically gives us an insight into how each model
thinks about a trade that they currently either have open or they're thinking about opening
or closing or whatever that might be, right? So it's like being in the mind of an actual investor
and figuring out how they make their decisions. An example here at the top of the screen,
is Gemini 3 Pro. He goes, I'm betting on a breakout in Nvidia, seeing a strong setup as it holds
support and leading the market with a target of $189 and a stop just below 180. So what he's
referring to there is kind of a typical quant style of trading where it's kind of like he's looking
at technicals, he's evaluating kind of graphs, momentum of the stock price. It's very price evaluated
type of trading, right Josh? But if you look just below it, you've got GBT 5.1, which actually
actually came in second at the end of this competition, who goes, my analysis indicates
continued strength in AI names like Nvidia and Microsoft. So I'm holding out on existing long
positions over the weekend and potential macro event risk. Now, the point I want to make about
this particular model is it's less price specific and it's more focused on just kind of general
themes, news and data that it's seeing outside of price. And that really goes to demonstrate that
some of these models are very kind of price and quantitative focused, whereas other models are
kind of more thesis driven over a shorter period of time. And it kind of gives rise to these types
of personalities, right, Josh? Yeah, well, now we have to answer the uncomfortable question is
like, is this evidence that Grock is some kind of money printing god, or is this just like
really well-produced content that happens to involve real money? And that kind of comes down to
understanding the AI, understanding the personalities, understanding how each model considers
these trades and how they place themselves in different positions. So I kind of want to go through
one by one, all of the models and kind of what their personalities are like. We see with Deepseek a lot
that it behaves, and we mentioned on a previous episode as well, it behaves like a very disciplined
quant fund. And Deepseek, for those that don't know, it's an open source Chinese model. They are very
systematic, very comfortable with leverage, but able to hedge and adjust mid-trade based on its
decisions and new information. So Deep Seek and and Quartz.
Quen even is kind of similar to this.
If you remember from the last episode EJez,
Quim was my early favorite.
I had hoped that Quen was going to win.
Unfortunately, that's not the case at all in season 1.5.
Quinn has gotten crushed right there with Deep Seek.
I can kind of imagine it as like more similar to me,
maybe that's why I resonated with it,
where it has one big thesis and then it sizes aggressively around that thesis.
So if you remember,
Quinn would only buy Bitcoin or Ether in the last one.
And it wouldn't buy any other all coins.
It just had a thesis that these major coins were going up.
Nothing else was.
Claude is interesting.
It's very reflective of how the actual Claude model works when you engage with it.
It's very patient and it's thoughtful, but it occasionally sizes up too much, and then it gets crushed by leverage.
So, and as we go through these, and EJS, I also noticed, you assigned a masculine personality to Gemini.
You said he, when you were talking about Google Gemini.
And that's kind of because it's daddy, right?
Like, Gemini has been the big boy on top.
But in this training competition, I don't know if it is.
I was going through the trades and very much panic flip-flops from shorts to long after losing.
And it kind of, in a way, Gemini was most reflective of retail behavior.
And I'm not sure what we could tie that to, but Gemini was very reactionary, where if it lost money, it would flip its position.
And if it gained money, it would kind of hedge quickly.
So that was interesting.
And then we have GPT5, which is very sophisticated reasoning.
But in season one, they over-traded and over-leveraged and got absolutely wiped out.
And they were very timid in their way that they went about this.
So that's kind of how you can think about these.
The final one, which is the secret model, GROC 4.2.
If we know anything about GROC, we know that it is a very high risk taker, but a calculated
risk taker.
And that's probably what put it at the top there.
So that's kind of how I would consider all of these models.
They're a little different.
And they are reflective of if you've used these in person, you can kind of understand
the thinking that gets placed behind the trades.
Yeah.
I want to dig into a few things around the personality or rather the trading styles here,
Josh, because it may not be as explicit as we kind of lay it out.
So GROC 4.20 was the winner, right, by far.
And it made money.
It was the top across all of the competitions, all four competitions.
That's great.
But did you look at the results of GROC 4?
It's predecessor.
It was the worst.
It was the worst performing model in this entire competition, which is crazy because
in season one where it was trading crypto, it came.
in at second or third.
And for about 75% of the competition, Josh, it was number one.
So it had some kind of an advantage, an advantage, trading kind of very riskily, right?
And that might be because of the nature of the instruments that it was trading.
Crypto is very volatile.
And it was kind of going blaze.
So when it was like 20x bullish Bitcoin, it benefited a lot when Bitcoin price went up.
But obviously it, like, suffered when it went down.
It's interesting to see the discourse between these two models and 1.1.
right. Groch 4.20, the winner, seems to be a kind of more mature version of GROC 4. It seems to be
thinking more about its trades. It has more kind of like risk percentiles and boundaries in place,
whereas GROC 4 seems to be its kind of usual degenerate self. And I don't know how much of that
is reliant on the fact that it's trading stocks, which is generally a less volatile market
versus GROC 4.20 being a more thesis-driven, sensible trader, as you kind of described.
The other one that we have to call out, because it's the elephant in the room here, GPT5 came in at second in season 1.5, right?
5.1.
5.1.
Sorry, 5.1, right?
In the previous season, season 1, it was the second worst performing?
No, sorry, it was the worst performing.
It was horrible.
It was GPT5.
It was an abomination.
And Gemini.
So whatever Open AI is cooked up in the point one, congrats.
because you must have trained it on some kind of financial data
or you've like kind of like implemented a kind of like risk trading strategy
that made it a lot more sensible because it made some really great trades on this season.
So just two different kind of like jumps from season one to 1.5 that I had to call out.
Yeah, it makes me excited to see the improvements in these like significant improvements
with incremental models because we normally talk about 5 to 5.1 being pretty marginal.
Like there's nothing really noteworthy are exciting.
And yet the results in the small sample.
size at least are pretty reassuring that, hey, there is something new going under the hood.
And maybe this is an appropriate time to address the, I guess the limitations, the kind of bare case
of this starting with the sample size. We do have to say, I mean, this is two weeks, EJS.
This is not a long time. They place some trades. Some people maybe got lucky. Some models maybe did not.
Is there any real signal here? I'm curious your take. Do you think this is reflective of future
performance? Like, is there what is here that's actually valuable versus what is here is actually
kind of lucky? I don't think we have enough information to make that call. At least for me,
I'll speak for myself personally. The real test is, you know, I asked myself before we recorded
this episode, would I give my money to GROC 4.20, the winner, that won across all categories.
And the simple answer is like, no. I don't, I don't know if it's going to repeat that over week three,
week four, week five. It was only two weeks to your point, right? So I want to see this
experiment kind of rehash like a million times of what I'm like, okay, that's cool. Even then,
it's still kind of like risky, right? It's like I can justify giving my money to a human that I can
kind of relate to that I can call up in speed to less so when it comes to an AI model, right? But maybe
that's my thing he needs to kind of evolve. The other way I'm thinking about this is there's just
a lot of unknowns around this, Josh, right? Like I can see its thinking. I can see kind of like
how the model kind of completes its trades.
But I don't really know what's going under the hood.
Is this just kind of like a pattern matching thing?
Does it inherit the risks that a lot of humans have already done?
Because it's trained on the same kind of corpus of trading data that we have kind of evaluated on.
Or is it kind of net better?
Do you feel the same or?
Yeah, it's probably, I mean, it's not the new gold standard of AI benchmarks.
But it is a standard that I think is interesting because this is a benchmark that happens in the real world.
with real dynamic data that cannot be games. So in that case, I love it. But I saw one writer,
they called it Schrodinger's benchmark, because it's simultaneously serious and degenerate at the same
time. And it's like it's entertainment with real money that happens to produce some legitimate
insights about AI behavior, but it's not really indicative of future returns at the small
of a sample size at least. And that's kind of where I feel about it. There is one breakthrough that
we mentioned earlier, that does provide real value, which is the transparency. Every trade being on
chain and every step reason being logged is actually really helpful to understanding how these models
think and how you can consider thinking. So, for example, you could show me every decision GROC 4.20
made on Tesla after the Fed announcement or something like that. And it'll walk you through
its chain of thought. And if anything, make you into a better investor. Would I trust the model
with my own money? Maybe a little bit. Maybe with a small sample size.
How much would you get?
It is?
That's a good question.
I'd give it a couple thousand dollars to play around with and see what happens.
I think that would be interesting and fun.
And it's low enough stakes, but I would trust it enough to not lose it.
Like I'd say, I would probably trust Grock more with my money than I would the average day trader off the street.
Which, granted, they don't have a very good reputation.
But I think there is some sort of an edge there that doesn't exist in the average person.
And if you assume that these models are going to continue to get better and better,
well, you have to assume that they're going to form some sort of an edge,
but I don't know how much.
It's an interesting question.
Because as a quant trading fund, too, if your job, or as just a trader in general,
if your job is to make money off of trading, what are you doing about this information?
Are you leaning into AI?
Are you trying to get these models to help with your information flows and make decisions?
Are you using them to help you actually transact trades?
or are you just kind of looking the other way and saying,
oh, this is just a dumb experiment to benchmark models.
There's no actual signal here.
And the answer is probably somewhere in the middle, right?
Yeah, I mean, well, my initial reaction to that is,
okay, quant funds already use algorithms.
It would make a lot of sense if they started using AI algorithms, right?
If you could get a smarter algorithm to trade for your fund, absolutely, right?
So it's a no brain to me that these hedge funds,
quant funds are going to be using AI, probably already using AI.
Where I have maybe a hot take is that the transparency is just a nice to have.
It is no way going to win in the best of models.
Why?
Because if you have an AI model that is like better than all the other AI models at trading,
why would you make it out public?
Right.
So like I'm kind of like at ties between this thing because I think the transparency is a really
good thing in kind of like bringing up the floor of trading credibility for people that
get access to this type of information.
Like I have loved reading through these kind of like trade logs here,
seeing how each model thinks and being like, okay, yeah, wow.
I actually didn't think about that myself when I was buying that stock, right?
And these are like stocks that I've seen that I can buy, right?
The Amazon trade, the Nvidia trade.
I'm just like, oh, okay, I didn't think about that, right,
yesterday whenever they made this trade.
But if I am a hedge fund, I'm like, yeah,
if I fine-tuned a model that is like beating all these models,
I don't really want to expose that really.
So it's kind of like a push and pull.
The other thought I had, Josh, is, and maybe this is kind of like, kind of semi-adjacent to what we're discussing here, I couldn't get the thought out of my head that if you could get GROC in X trading some kind of money for you or guaranteeing you like a 5 to 10% annual return, that is something that I would, like, if framed correctly, I would put some money into, right?
Maybe not over two weeks, but maybe over an adjusted kind of yearly period would be super cool to see.
I don't know. Yeah, it's such a fun question to ask is like, what happens when this kind of system runs for two years?
But with your, like, let's say it's a large pension management fund and they just want a manager that doesn't take fees and does a pretty good job.
Like, is there going to be enough trust in these systems to reliably place money at scale with them?
And you have to assume, given the signal this early on, that the answer will be yes. The question is how much of a yes will it be?
what a percentage of management will be AI as it gets better over time. And the sample size sucks. I wish
it was more than two weeks. I wish it was two years. But in two years from now, think about the
progress we're going to see and what type of impact that's going to have on trading model.
So this is, it's interesting. It's fascinating. In fact, I'm really curious to actually run this
experiment for ourselves. I'd love to try to come up with its little trading model that runs
these things and test it out because it's fun and there is some sort of an edge there.
I would say, okay, if I would to summarize my lesson from this entire competition or experiment so far, Josh, it is I'm not convinced to give AI models money to trade, but I am convinced to use AI models to help me trade.
So kind of like a human and AI model kind of work together and kind of become a better trader overall, I think is the main takeaway for me here.
Do you share the same?
It's funny.
I mean, this is how agents work today, right?
Like the, if you go on chat GPT and you say, go book me a reservation, it'll take you to the finish line.
And then you as the human provide the final filter and approve or deny.
And I think that's probably the happy middle ground, while we still don't really trust these models too much, is give me the thesis, give me the trade.
I will either approve or deny, and that's how the money gets managed.
So it's cool.
This is a great experiment.
I love that we got season 1.5.
I mean, it's fascinating.
Even more fascinating is that we have an early look at GROC 4.2, which by all means,
is the best trading model in the world. Where will it rank in the other benchmarks? We will see.
We will be covering it as soon as it comes out. But I guess that's really it for this episode on season
1.5. The question I want to leave everyone else with is, I mean, would you trust an AI with your
part of the portfolio? Like, how much money would you actually give to an AI currently? Grog 4.2,
who just made 60% in two weeks in one of these training competitions? Is that enough for you
to risk your money? Or is it still just this dumb AI system that, you know, you don't really trust?
Well, if you're interested in this experiment, Josh and I were actually discussing about potentially giving you guys a tutorial on how to use an AI to trade money for you and kind of like an experiment, this own end of one experiment, but our own.
But we want to get a little more signal from you guys. Let us know in the comments whether this is something that you'd be interested in seeing.
And I have, Josh, I have a requirement for the listeners.
If we do want to put the tutorial out.
Our last video that we did on AI trading
reached 100,000 views and 3,000 likes.
Biggest video ever.
Thank you.
I'm not going to ask for the 100,000 views,
but I will ask for the likes.
If this video can get more than 3,000,
if it gets 3,000 likes,
we will definitely put out that tutorial
by the end of the year.
And we have a lot of thoughts around this
about how we're going to do it.
We're super excited to do it,
so help us get there.
It is another week of really exciting news.
Josh, I don't know if you saw the rumors.
Did you see the rumors about OpenAI?
Tell me.
Tell me in.
About Open Air releasing a potential new groundbreaking model.
As a matter of fact, the Polymarket is showing that Open AI is very favored to release the best model of the year.
And last I checked, Gemini is the best model of the year.
So that implies we're getting something big in the next few weeks.
I think we will.
And like you said, the Polymarket is kind of like revealing its hands.
So maybe there's some inside information coming out here.
So there certainly is.
Kind of stay tuned to Limelis.
Put the notifications on, guys,
and also subscribe if you want to get the latest videos.
We put out the best content out there.
It's not, it's unchallenged right now.
Josh and I are sitting here on challenge.
You have to like and subscribe if you want to get our content on your feed.
Thank you so, so much for listening.
Again, let us know what you thought of this episode in the comments.
Get that like number up and we will see you on the next one.
