Limitless Podcast - Revealing Elon’s Secret AI Trading Bot: Is It Worth It?
Episode Date: December 9, 2025The groundbreaking Alpha Arena experiment involved eight AI trading models against each other. Grok 4.2 emerges as the standout winner, achieving 60% profit in just two weeks despite the vola...tility that affected many competitors.What does this experiment mean for you? With strategies and behavioral patterns, we need to question the balance between AI trading success and necessary human oversight.------🌌 LIMITLESS HQ: LISTEN & FOLLOW HERE ⬇️https://limitless.bankless.com/https://x.com/LimitlessFT------TIMESTAMPS0:00 Intro1:39 Season 1 Results2:39 Transition to Season 1.54:22 Mystery Model Revealed5:55 Competition Breakdown8:09 Insights from Competition9:56 Model Trading Styles12:16 AI Personalities in Trading14:11 Comparing Model Performances16:36 Limitations and Future Potential19:53 Trusting AI with Investments24:20 Future of AI Trading Tutorials------RESOURCESJosh: https://x.com/JoshKaleEjaaz: https://x.com/cryptopunk7213------Not financial or tax advice. See our investment disclosures here:https://www.bankless.com/disclosures
Transcript
Discussion (0)
Imagine this. You give eight of the world's most powerful AI models $10,000 each and tell them,
go trade real stocks. No paper trading, but real money with real risk. And two weeks later,
most of them have lost a painful amount of cash, which I guess is kind of expected.
They kind of drawdowns that would get a human portfolio manager totally fired. But now they ran
the same experiment again, except this time with much higher stakes. There's $320,000 at stake.
And we've talked about Alpha Arena before in a previous episode, which I highly recommend checking out.
But now we have the new results from the new season, season 1.5.
And what was exciting is that there was a very clear and obvious winner, but that winner was a mystery.
We don't actually know or we didn't know who the winner was up until recently.
In fact, it won all four of the training competitions in this new season while leaving the other top models like Chat Shoebbyt 5.1 and Google Shemini 3.0 fighting for second place.
So at the core of this is, one, who is this model?
And two, how on earth did they do it?
how are they outperforming everyone, so much so as to make 65% in two weeks in one of these
competitions.
So, EJAS, I want to walk through everyone about what just happened, what the model is and what
Alpha Arena is.
So give us the lowdown on who this was that made so much money.
Oh, yeah.
Well, we will get into all of that today.
So Alpha Arena is basically a competition or test to see how well AI models can trade.
And they do this in a few different ways, Josh.
Number one, they give each model $10,000, as you mentioned.
And then they allowed them to trade a range of different financial instruments over a period of two weeks.
So there's like a season, two weeks, and we see which AI models do the best.
And they get all your AI models in there.
You've got chat GPT.
You have got Gemini.
You've got Anthropics Claude and you have GROC as well.
And so they've gone through about two seasons now and the results have been absolutely crazy.
So they started off with season one.
And you can think of this as like the Dgen crypto season.
They gave seven models $10,000 each and allowed them to trade crypto assets like Bitcoin,
Ethereum, stuff like that.
And they did this in something called perpetual.
So they could leverage trade is the only instrument that they were allowed to do this.
And the results were, as you'd probably expect, a lot of these AI models lost a lot of money.
Some of them actually ended up making a decent chunk of money.
And they were primarily Chinese models.
Quen and I think it was Deep Seek that ended up making money.
So there was a lot of takeaways there.
As you mentioned, we've got a previous episode where we spoke about this.
Definitely you'll give that a watch.
There's a lot of alpha in that one.
And then that brings us to season 1.5, where the AI models, instead of being given
crypto to trade, were given the ability to trade U.S. stocks.
And we're talking about equities, which is something that a lot of us listening to this show
are very familiar with.
And I think this is for a few reasons, Josh.
primarily crypto is very volatile, and we kind of want to figure out how the majority of money
that is traded in the financial markets can translate into AI models trading that.
So a few things that they kept the same is that they gave the AI model $10,000.
But there was a number of differences with season 1.5.
Number one, they were allowed to trade U.S. equities and stocks.
Number two, there were two new models that were introduced.
One was a model called Kimi K2, which is a really good open source Chinese model.
but the other was this thing called a mystery model.
I'm going to reveal which this model was in a second.
But before I do, do you have any guesses as to what model this might have been?
Well, I cheated. I know the answer.
But what I think is very exciting about this is that, like, I think it's important to highlight.
These models made hundreds to even thousands of trades per model.
And what we want to answer, like, the question that I want more than the mystery model is like,
is this real signal or is this just, I mean, Luke said earlier.
earlier, is this a GPU intensive scratch-off game where is there any real signal? And I guess
we'll talk about the reality of that and what this means for your portfolio if you ever want to
manage it. But to me, I think that's the important thing to highlight. We probably should just
spill the beans you guys. Do you want to tell them? Who's in this room? I have to. I can't keep it in
any longer. It was an unofficial version of GROC, aptly named GROC 4.2 or 420 for the memers out there.
And this was revealed by none other than the Grockman himself, Elon Musk.
And the reason why this mystery model was getting so much attention, Josh, was because it ended up being the winner.
It made the most money out of any other AI models.
And what was more impressive is there wasn't just one competition being run throughout season 1.5.
There were four at the same time.
So these AI models were running across four different competitions at the same time.
That was $320,000 at any one instance, which is a crazy amount of financial money to stake on an experiment.
That's a lot of money could have been lost here.
And Grok 4.20 ended up performing the best.
Josh, I want to go through a few different stats here, which kind of like shows how amazing this particular model was.
So firstly, for some context, there were four.
different competitions that were being run that these AI models were being tested on.
Competition number one was something called new baseline. This is basically the ability for these
AI models to get access to trading AI stocks to get access to all the common news that you
and I can read online and in newspapers to kind of like figure out, okay, what kind of news would
affect my stock positions. They would also get access to sentiment data to see how kind of like
the markets and retail traders would kind of react to certain bits of news.
had access to a much wider spread amount of data in competition number one.
Competition number two was called Monk Mode.
They kind of amended the investing prompt here.
And so kind of like they traded more conservatively.
Competition number three was called situational awareness, Josh.
So each model had an awareness of other models trading and where they ranked in accordance
to them.
So there was this kind of like ecosystem of peer pressure being put on by each model.
And competition number four was just outright.
generacy, max leverage. You could only trade with like 20 to 50x leverage, which is just kind of,
I don't think it was 50x, but like 30x. Just crazy amount of risk adjustment to test whether
a model would take that risk or whether it would trade more conservatively. Josh, do you have any
reactions on the results of this competition? The results that we're looking at right now, actually,
I found most interesting. This is from the new baseline competition. It's basically the full info mode.
And one of the big differences between this mode versus previous competitions that have been held
is like you mentioned earlier, it has access to a lot of data.
This is the first time an AI trading model has had access to real-time information outside of just looking at a chart.
So I think in that sense, this is the closest competition to how a human quant fund would actually operate.
So if you're looking for high signal in terms of which AI can actually make you real money in the real world, this is the one.
And what we're seeing here is that the GROC 4.20 model, the Mometic mystery,
model outperformed by like a fairly large margin to open A&ChapyTP2.1, which is the clear second
place. And those are the only two that actually made profit. Everybody else lost money in the real
world competition, which to me signals a few things. One of them being, well, perhaps one is
really good at understanding real world information. Perhaps it understands company fundamentals
better. Perhaps it just has access to real world information that's better, like GROC and having
access to the X AI model. So there's a lot of things to speculate here, but for me, the new baseline
chart that we're looking at right now was the highest signal one. I'm like, oh my God, wait,
this has the same type of information flows that I'm now getting. So now we're even. We're on
the same playing field. Okay. I actually had a different answer to that, which is I was more
impressed, Josh, by the situational awareness competition. So this was a competition where
each model had access to data and news,
but they also had awareness of who they were competing against.
So Grok 4.20, the winner, knew that GPT5 was in second place.
And so he was always keeping an eye on GPT5 being like,
oh, what trades is GPT5 making?
Why did they make that trade?
Oh, that's interesting.
And then he would look at Gemini and be like,
oh, what trades are Gemini making?
So he would have this awareness of his competitors,
which you didn't have in season one,
where they were just kind of like trading in silos, right?
right? And why this competition was so interesting, Josh, is this was technically where GROC 4.20 made the most money. In fact, if you look at the top of this leaderboard right here, the account value at the end of season 1.5 was $16,656.5, which placed it, which is technically a 60% plus return in two weeks on $10,000 worth of capital.
I needed to take my money immediately. Isn't that?
insane, right? Like if you had to pick a competition of where you would have given an AI model money,
just given from this data, and I'm not saying you should do that, you would be most bullish on
situational awareness. And I'm going to like kind of make some implications here that I haven't tested
yet, but it seems to imply that this kind of competitive nature where the models were kind of
aware and exposed to their competitors' trades and thinking, and we're going to get to the model
chat thinking in a second seems to have given them a better trading advantage, at least in some cases.
Yeah, so like you mentioned, one of my favorite parts, I think we share this, and one of our favorite
parts about this competition in particular, is that you can actually see all of the trades.
One thing about these private quant funds, you don't know what the hell is going on, but with
these models, you can see exactly what they're thinking every time they think and make a decision.
So maybe you guys can go through a few of them and see kind of what the model is thinking,
how they're processing this real world data.
and if there's any tips for us to learn from processing this railroad data,
because clearly they're a much better trader than I am.
Yeah, so I have a few examples pulled up here on the right side of the screen.
It's under model chat.
By the way, any of you listening to this can go onto this website and see if yourself
and scroll through their hundreds and hundreds of posts.
But it basically gives us an insight into how each model thinks about a trade that they currently
either have open or they're thinking about opening or closing or whatever that might be, right?
So it's like being in the mind of an actual investor and figuring out how they make their decisions.
An example here at the top of the screen is Gemini 3 Pro. He goes, I'm betting on a breakout in
Nvidia, seeing a strong setup as it holds support and leading the market with a target of $189 and a
stop just below 180. So what he's referring to there is kind of a typical quant style of trading
where it's kind of like he's looking at technicals, he's evaluating kind of graphs,
momentum of the stock price.
It's very price evaluated type of trading, right, Josh?
But if you look just below it, you've got GPT 5.1,
which actually came in second at the end of this competition,
who goes, my analysis indicates continued strength in AI names like Nvidia and Microsoft,
so I'm holding out on existing long positions over the weekend and potential macro event risk.
Now, the point I want to make about this particular model is it's less price-specific
and it's more focused on just kind of general themes, news, and data that it's seeing outside of price.
And that really goes to demonstrate that some of these models are very kind of price and quantitative focused,
whereas other models are kind of more thesis driven over a shorter period of time.
And it kind of gives rise to these types of personalities, right, Josh?
Yeah, well, now we have to answer the uncomfortable question is like,
is this evidence that Grock is some kind of money printing god,
or is this just like really well-produced content that happens to involve real money?
And that kind of comes down to understanding the AI, understand the personalities,
understanding how each model considers these trades and how they place themselves in different
positions.
So I kind of want to go through one by one, all of the models and kind of what their
personalities are like.
We see with DeepSeek a lot that it behaves, and we mentioned on a previous episode as well,
it behaves like a very disciplined quant fund.
And Deep Seek, for those that don't know, it's an open.
source Chinese model. They are very systematic, very mathematic, very comfortable with leverage,
but able to hedge and adjust mid-trade based on its decisions and new information. So Deepseek
and Quinn even is kind of similar to this. If you remember from the last episode,
Ejjazz, Quim was my early favorite. I had hoped that Quinn was going to win. Unfortunately,
that's not the case at all in season 1.5. Quen has gotten crushed right there with Deepseek. I can
kind of imagine it as like more similar to me, maybe that's why I resonated with it, where it has one
big thesis and then it sizes aggressively around that thesis. So if you remember, Quinn would only buy
Bitcoin or Ether in the last one. And it wouldn't buy any other all coins. It just had a thesis that
these major coins were going up. Nothing else was. Claude is interesting. It's very reflective of how
the actual Claude model works when you engage with it. It's very patient and it's thoughtful,
but it occasionally sizes up too much and then it gets crushed by leverage. So, and like as we go
through these, and EJ's, I also noticed, you assigned a masculine personality to Gemini. You said
he when you were talking about Google Gemini. And that's kind of because it's daddy, right?
Like Gemini's been the big boy on top. But in this training competition, I don't know if it is.
I was going through the trades and it very much panic flip-flops from shorts to long after losing.
And it kind of, in a way, Gemini was most reflective of retail behavior.
Because, and I'm not sure what we could tie that to, but Gemini was very reactionary.
Where if it lost money, it would flip its position. And if it gained money, it would kind of hedge
quickly. So that was interesting. And then we have GPT5, which is very sophisticated reasoning.
But in season one, they over-traded and over-leveraged and got absolutely wiped out. And they were
very timid in their way that they went about this. So that's kind of how you can think about
these. The final one, which is the secret model, GROC 4.2. If we know anything about GROC,
we know that it is a very high risk taker, but a calculated risk taker. And that's probably
what I put it at the top there. So that's kind of how I would consider all of these models.
They're a little different. And they are reflective of, if you've used these in person, you
could kind of understand the thinking that gets placed behind the trades.
Yeah.
I want to dig into a few things around the personality or rather the trading styles here, Josh,
because it may not be as explicit as we kind of lay it out.
Like, so GROC 4.20 was the winner, right?
By far.
And it made money.
It was the top across all of the competitions, all four competitions.
That's great.
But did you look at the results of GROC four?
It's predecessor.
It was the-
Absolutely.
It was the worst performing model in this entire competition, which is crazy because in season one where it was trading crypto, it came in at second or third.
And for about 75% of the competition, Josh, it was number one.
So it had some kind of an advantage trading kind of very riskily, right?
And that might be because of the nature of the instruments that it was trading.
Crypto is very volatile.
And it was kind of going blaze.
So when it was like 20x bullish Bitcoin, it benefited a lot when Bitcoin price went up.
But obviously it suffered when it went down.
It's interesting to see the discourse between these two models and 1.5, right?
GROC 4.20, the winner, seems to be a kind of more mature version of GROC 4.
It seems to be thinking more about its trades.
It has more kind of like risk percentiles and boundaries in place, whereas GROC 4 seems to be its kind of usual
degenerate self. And I don't know how much of that is reliant on the fact that it's trading stocks,
which is generally a less volatile market versus GROC 4.20 being a more thesis-driven, sensible
trader, as you kind of described. The other one that we have to call out, because it's the elephant
in the room here, GPT5 came in at second in season 1.5, right?
5.1. Sorry, 5.1, right? In the previous season, season 1, it was the second
worst performing? No, sorry, it was the worst performing. It was horrible. It was, it was, it was
GPT5. It was an abomination and Gemini. So whatever open AI is cooked up in the point one,
congrats. Because you must have traded it on some kind of financial data or you've,
you've like kind of like implemented a kind of like risk trading strategy that made it a lot more
sensible because it made some really great trades on this season. So just two different kind of like
jumps from season one to one point five that I had to call out. Yeah, it makes me
excited to see the improvements in these, like significant improvements with incremental models,
because we normally talk about 5 to 5.1 being pretty marginal. Like, there's nothing really
note where they're exciting. And yet the results in the small sample size, at least, are pretty
reassuring that, hey, there is something new going under the hood. And maybe this is an appropriate
time to address the, I guess the limitations, the kind of bare case of this starting with the sample
size. We do have to say, I mean, this is two weeks, EGS. This is not a long time.
They placed some trades.
Some people maybe got lucky.
Some models maybe did not.
Is there any real signal here?
I'm curious your take.
Do you think this is reflective of future performance?
Like, is there what is here that's actually valuable versus what is here is actually kind of lucky?
I don't think we have enough information to make that call.
At least for me, I'll speak for myself personally.
The real test is, you know, I asked myself before we recorded this episode, would I give my money to GROC 4.20?
the winner that won across all categories.
And the simple answer is like, no.
I don't know if it's going to repeat that over week three, week four, week five.
It was only two weeks to your point, right?
So I want to see this experiment kind of rehash like a million times before.
I'm like, okay, that's cool.
Even then, it's still kind of like risky, right?
It's like I can justify giving my money to a human that I can kind of relate to
that I can call up in speed to less so when it comes to an AI model, right?
But maybe that's my thing he needs to kind of evolve.
The other way I'm thinking about this is there's just a lot of unknowns around this, Josh, right?
Like I can see its thinking.
I can see kind of like how the model kind of completes its trades.
But I don't really know what's going under the hood.
Is this just kind of like a pattern matching thing?
Does it inherit the risks that a lot of humans have already done because it's trained on the same kind of corpus of trading data that we have kind of evaluated on?
or is it kind of net better?
Do you feel the same or?
Yeah, it's probably, I mean, it's not the new gold standard of AI benchmarks,
but it is a standard that I think is interesting
because this is a benchmark that happens in the real world
with real dynamic data that cannot be games.
So in that case, I love it.
But I saw one writer, they called it Schrodinger's benchmark
because it's simultaneously serious and degenerate at the same time.
And it's like, it's entertainment with real money
that happens to produce some legitimate insights about AI behavior, but it's not really indicative of
future returns at the small of a sample size at least. And that's kind of where I feel about it.
There is one breakthrough that we mentioned earlier that does provide real value, which is the
transparency. Every trade being on chain and every step reason being logged is actually really
helpful to understanding how these models think and how you can consider thinking. So for example,
you could show me every decision GROC 4.20 made on Tesla after the Fed announcement or something.
like that and it'll walk you through a chain of thrott and if anything make you into a better
investor would i trust the model with my own money maybe a little bit maybe with the small sample size
how much would you get it is that's a good question i'd give it a couple thousand dollars to play around
with and see what happens i think that that would be interesting and fun and it's it's low enough
stakes but i would trust it enough to not lose it like i'd say i would probably trust grock more with my money than i
would the average day trader off the street, which credit they don't have a very good reputation,
but I think there is some sort of an edge there that doesn't exist in the average person.
And if you assume that these models are going to continue to get better and better,
well, you have to assume that they're going to form some sort of an edge, but I don't know
how much. It's an interesting question. Because as a quant trading fund, too, if your job, or as
just a trader in general, if your job is to make money off of trading, what are you doing about
this information. Are you leaning into AI? Are you trying to get these models to help with your
information flows and make decisions? Are you using them to help you actually transact trades? Or are you
just kind of looking the other way and saying, oh, this is just a dumb experiment to benchmark models?
There's no actual signal here. And the answer is probably somewhere in the middle, right? Yeah, I mean,
well, my initial reaction to that is, okay, quant funds already use algorithms. It would make a lot of
sense if they started using AI algorithms, right? If you could get a smarter algorithm,
them to trade for your fund? Absolutely. Right. So it's a no brainer to me that these hedge funds,
quant funds are going to be using AI, probably already using AI. Where I have maybe a hot take is that
the transparency is just a nice to have. It is no way going to win in the best of models. Why?
Because if you have an AI model that is like better than all the other AI models at trading,
why would you make the out public? Right. So like I'm kind of like at ties between this thing because
I think the transparency is a really good thing in kind of like bringing up the floor of trading
credibility for people that get access to this type of information. Like I have loved reading
through these kind of like trade logs here, seeing how each model thinks and being like,
okay, yeah, wow, I actually didn't think about that myself when I was buying that stock, right?
And these are like stocks that I've seen that I can buy, right? The Amazon trade, the Ambidia trade.
I'm just like, oh, okay, I didn't think about that, right, yesterday whenever they made this trade.
but if I am a hedge fund, I'm like, yeah, if I fine-tuned a model that is like beating all these models, I don't really want to expose that really. So it's kind of like a push and pull. The other thought I had, Josh, is, and maybe this is kind of like kind of semi-adjacent to what we're discussing here, I couldn't get the thought out of my head that if you could get GROC in X trading some kind of money for you or guaranteeing you like a five to 10% annual return, that is something that I would, like,
if framed correctly, I would put some money into, right? Maybe not over two weeks, but maybe over
an adjusted kind of yearly period would be super cool to see. Yeah, that's such a, it's such a fun
question to ask is like, what happens when this kind of system runs for two years? But with your,
like, let's say it's a large pension management fund and they just want a manager that doesn't
take fees and does a pretty good job. Like, is there going to be enough trust in these systems to
reliably place money at scale with them. And you have to assume, given the signal this early on,
that the answer will be yes. The question is how much of a yes will it be? What a percentage of
management will be AI as it gets better over time? And the sample size sucks. I wish it was more
than two weeks. I wish it was two years. But in two years from now, think about the progress we're
going to see and what type of impact that's going to have on trading models. So this is,
it's interesting. It's fascinating. In fact, I'm really curious to actually run this experience.
for ourselves. I'd love to try to come up with a little trading model that runs in these things
and test it out because it's fun and there is some sort of an edge there. I would say, okay,
if I would to summarize my lesson from this entire competition or experiment so far, Josh,
it is I'm not convinced to give AI models money to trade, but I am convinced to use AI models
to help me trade. So kind of like a human and AI model kind of work together and kind of become a
better trader overall, I think is the main takeaway for me here. Do you share the same? It's funny.
I mean, this is how agents work today, right? Like the, if you go on chat GPT and you say, go book me a
reservation, it'll take you to the finish line and then you as the human provide the final filter
and approve or deny. And I think that's probably the happy middle ground, while we still don't really
trust these models too much, is give me the thesis, give me the trade. I will either approve or deny,
and that's how the money gets managed.
So it's cool.
This is a great experiment.
I love that we got season 1.5.
I mean, it's fascinating.
Even more fascinating is that we have an early look at GROC 4.2,
which by all means is the best trading model in the world.
Where will it rank in the other benchmarks?
We will see.
We will be covering it as soon as it comes out.
But I guess that's really it for this episode on season 1.5.
The question I want to leave everyone else with is,
I mean, would you trust an AI with your part of the portfolio?
Like, how much money would you actually give to an AI currently?
Grog 4.2, who just made 60% in two weeks in one of these training competitions?
Is that enough for you to risk your money?
Or is it still just this dumb AI system that you don't really trust?
Well, if you're interested in this experiment,
Josh and I were actually discussing about potentially giving you guys a tutorial
on how to use an AI to trade money for you
and kind of like an experiment, this own end of one experiment,
but our own.
but we want to get a little more signal from you guys.
Let us know in the comments whether this is something that you'd be interested in seeing.
And I have, Josh, I have a requirement for the listeners.
What do you got?
If we do want to put the tutorial out, our last video that we did on AI trading reached 100,000 views and 3,000 likes.
So I'm not going to ask for the 100,000 views, but I will ask for the likes.
If this video can get more than 3,000, if it gets 3,000 likes,
we will definitely put out that tutorial by the end of the year.
And we have a lot of thoughts around this about how we're going to do it.
We're super excited to do it.
So help us get there.
It is another week of really exciting news.
Josh, I don't know if you saw the rumors.
Did you see the rumors about Open AI?
Tell me.
Tell me in.
About Open Air releasing a potential new groundbreaking model.
As a matter of fact, the polymarket is showing that Open AI is very favored to release the best model of the year.
And last I checked, Gemini is the best model of the year.
So that implies we're getting something big in the next few weeks.
I think we will.
And like you said, the Polymarket is kind of like revealing its hands.
So maybe there's some inside information coming out here.
So there certainly is.
Kind of stay tuned to Limelis.
Put the notifications on guys and also subscribe.
If you want to get the latest videos, we put out the best content out there.
It's not, it's unchallenged right now.
Josh and I are sitting here unchallenged.
You have to like and subscribe if you want to get our content on your feed.
Thank you so, so much for listening.
Again, let us know what you thought of this episode in the comments.
Get that like number up and we will see you on the next one.
