Limitless Podcast - Revealing Elon’s Secret AI Trading Bot: Is It Worth It?

Episode Date: December 9, 2025

The groundbreaking Alpha Arena experiment involved eight AI trading models against each other. Grok 4.2 emerges as the standout winner, achieving 60% profit in just two weeks despite the vola...tility that affected many competitors.What does this experiment mean for you? With strategies and behavioral patterns, we need to question the balance between AI trading success and necessary human oversight.------🌌 LIMITLESS HQ: LISTEN & FOLLOW HERE ⬇️https://limitless.bankless.com/https://x.com/LimitlessFT------TIMESTAMPS0:00 Intro1:39 Season 1 Results2:39 Transition to Season 1.54:22 Mystery Model Revealed5:55 Competition Breakdown8:09 Insights from Competition9:56 Model Trading Styles12:16 AI Personalities in Trading14:11 Comparing Model Performances16:36 Limitations and Future Potential19:53 Trusting AI with Investments24:20 Future of AI Trading Tutorials------RESOURCESJosh: https://x.com/JoshKaleEjaaz: https://x.com/cryptopunk7213------Not financial or tax advice. See our investment disclosures here:https://www.bankless.com/disclosures⁠

Transcript
Discussion (0)
Starting point is 00:00:00 Imagine this. You give eight of the world's most powerful AI models $10,000 each and tell them, go trade real stocks. No paper trading, but real money with real risk. And two weeks later, most of them have lost a painful amount of cash, which I guess is kind of expected. They kind of drawdowns that would get a human portfolio manager totally fired. But now they ran the same experiment again, except this time with much higher stakes. There's $320,000 at stake. And we've talked about Alpha Arena before in a previous episode, which I highly recommend checking out. But now we have the new results from the new season, season 1.5. And what was exciting is that there was a very clear and obvious winner, but that winner was a mystery.
Starting point is 00:00:38 We don't actually know or we didn't know who the winner was up until recently. In fact, it won all four of the training competitions in this new season while leaving the other top models like Chat Shoebbyt 5.1 and Google Shemini 3.0 fighting for second place. So at the core of this is, one, who is this model? And two, how on earth did they do it? how are they outperforming everyone, so much so as to make 65% in two weeks in one of these competitions. So, EJAS, I want to walk through everyone about what just happened, what the model is and what Alpha Arena is.
Starting point is 00:01:11 So give us the lowdown on who this was that made so much money. Oh, yeah. Well, we will get into all of that today. So Alpha Arena is basically a competition or test to see how well AI models can trade. And they do this in a few different ways, Josh. Number one, they give each model $10,000, as you mentioned. And then they allowed them to trade a range of different financial instruments over a period of two weeks. So there's like a season, two weeks, and we see which AI models do the best.
Starting point is 00:01:43 And they get all your AI models in there. You've got chat GPT. You have got Gemini. You've got Anthropics Claude and you have GROC as well. And so they've gone through about two seasons now and the results have been absolutely crazy. So they started off with season one. And you can think of this as like the Dgen crypto season. They gave seven models $10,000 each and allowed them to trade crypto assets like Bitcoin,
Starting point is 00:02:08 Ethereum, stuff like that. And they did this in something called perpetual. So they could leverage trade is the only instrument that they were allowed to do this. And the results were, as you'd probably expect, a lot of these AI models lost a lot of money. Some of them actually ended up making a decent chunk of money. And they were primarily Chinese models. Quen and I think it was Deep Seek that ended up making money. So there was a lot of takeaways there.
Starting point is 00:02:33 As you mentioned, we've got a previous episode where we spoke about this. Definitely you'll give that a watch. There's a lot of alpha in that one. And then that brings us to season 1.5, where the AI models, instead of being given crypto to trade, were given the ability to trade U.S. stocks. And we're talking about equities, which is something that a lot of us listening to this show are very familiar with. And I think this is for a few reasons, Josh.
Starting point is 00:02:57 primarily crypto is very volatile, and we kind of want to figure out how the majority of money that is traded in the financial markets can translate into AI models trading that. So a few things that they kept the same is that they gave the AI model $10,000. But there was a number of differences with season 1.5. Number one, they were allowed to trade U.S. equities and stocks. Number two, there were two new models that were introduced. One was a model called Kimi K2, which is a really good open source Chinese model. but the other was this thing called a mystery model.
Starting point is 00:03:29 I'm going to reveal which this model was in a second. But before I do, do you have any guesses as to what model this might have been? Well, I cheated. I know the answer. But what I think is very exciting about this is that, like, I think it's important to highlight. These models made hundreds to even thousands of trades per model. And what we want to answer, like, the question that I want more than the mystery model is like, is this real signal or is this just, I mean, Luke said earlier. earlier, is this a GPU intensive scratch-off game where is there any real signal? And I guess
Starting point is 00:04:00 we'll talk about the reality of that and what this means for your portfolio if you ever want to manage it. But to me, I think that's the important thing to highlight. We probably should just spill the beans you guys. Do you want to tell them? Who's in this room? I have to. I can't keep it in any longer. It was an unofficial version of GROC, aptly named GROC 4.2 or 420 for the memers out there. And this was revealed by none other than the Grockman himself, Elon Musk. And the reason why this mystery model was getting so much attention, Josh, was because it ended up being the winner. It made the most money out of any other AI models. And what was more impressive is there wasn't just one competition being run throughout season 1.5.
Starting point is 00:04:47 There were four at the same time. So these AI models were running across four different competitions at the same time. That was $320,000 at any one instance, which is a crazy amount of financial money to stake on an experiment. That's a lot of money could have been lost here. And Grok 4.20 ended up performing the best. Josh, I want to go through a few different stats here, which kind of like shows how amazing this particular model was. So firstly, for some context, there were four. different competitions that were being run that these AI models were being tested on.
Starting point is 00:05:24 Competition number one was something called new baseline. This is basically the ability for these AI models to get access to trading AI stocks to get access to all the common news that you and I can read online and in newspapers to kind of like figure out, okay, what kind of news would affect my stock positions. They would also get access to sentiment data to see how kind of like the markets and retail traders would kind of react to certain bits of news. had access to a much wider spread amount of data in competition number one. Competition number two was called Monk Mode. They kind of amended the investing prompt here.
Starting point is 00:06:01 And so kind of like they traded more conservatively. Competition number three was called situational awareness, Josh. So each model had an awareness of other models trading and where they ranked in accordance to them. So there was this kind of like ecosystem of peer pressure being put on by each model. And competition number four was just outright. generacy, max leverage. You could only trade with like 20 to 50x leverage, which is just kind of, I don't think it was 50x, but like 30x. Just crazy amount of risk adjustment to test whether
Starting point is 00:06:32 a model would take that risk or whether it would trade more conservatively. Josh, do you have any reactions on the results of this competition? The results that we're looking at right now, actually, I found most interesting. This is from the new baseline competition. It's basically the full info mode. And one of the big differences between this mode versus previous competitions that have been held is like you mentioned earlier, it has access to a lot of data. This is the first time an AI trading model has had access to real-time information outside of just looking at a chart. So I think in that sense, this is the closest competition to how a human quant fund would actually operate. So if you're looking for high signal in terms of which AI can actually make you real money in the real world, this is the one.
Starting point is 00:07:13 And what we're seeing here is that the GROC 4.20 model, the Mometic mystery, model outperformed by like a fairly large margin to open A&ChapyTP2.1, which is the clear second place. And those are the only two that actually made profit. Everybody else lost money in the real world competition, which to me signals a few things. One of them being, well, perhaps one is really good at understanding real world information. Perhaps it understands company fundamentals better. Perhaps it just has access to real world information that's better, like GROC and having access to the X AI model. So there's a lot of things to speculate here, but for me, the new baseline chart that we're looking at right now was the highest signal one. I'm like, oh my God, wait,
Starting point is 00:07:55 this has the same type of information flows that I'm now getting. So now we're even. We're on the same playing field. Okay. I actually had a different answer to that, which is I was more impressed, Josh, by the situational awareness competition. So this was a competition where each model had access to data and news, but they also had awareness of who they were competing against. So Grok 4.20, the winner, knew that GPT5 was in second place. And so he was always keeping an eye on GPT5 being like, oh, what trades is GPT5 making?
Starting point is 00:08:29 Why did they make that trade? Oh, that's interesting. And then he would look at Gemini and be like, oh, what trades are Gemini making? So he would have this awareness of his competitors, which you didn't have in season one, where they were just kind of like trading in silos, right? right? And why this competition was so interesting, Josh, is this was technically where GROC 4.20 made the most money. In fact, if you look at the top of this leaderboard right here, the account value at the end of season 1.5 was $16,656.5, which placed it, which is technically a 60% plus return in two weeks on $10,000 worth of capital.
Starting point is 00:09:09 I needed to take my money immediately. Isn't that? insane, right? Like if you had to pick a competition of where you would have given an AI model money, just given from this data, and I'm not saying you should do that, you would be most bullish on situational awareness. And I'm going to like kind of make some implications here that I haven't tested yet, but it seems to imply that this kind of competitive nature where the models were kind of aware and exposed to their competitors' trades and thinking, and we're going to get to the model chat thinking in a second seems to have given them a better trading advantage, at least in some cases. Yeah, so like you mentioned, one of my favorite parts, I think we share this, and one of our favorite
Starting point is 00:09:49 parts about this competition in particular, is that you can actually see all of the trades. One thing about these private quant funds, you don't know what the hell is going on, but with these models, you can see exactly what they're thinking every time they think and make a decision. So maybe you guys can go through a few of them and see kind of what the model is thinking, how they're processing this real world data. and if there's any tips for us to learn from processing this railroad data, because clearly they're a much better trader than I am. Yeah, so I have a few examples pulled up here on the right side of the screen.
Starting point is 00:10:19 It's under model chat. By the way, any of you listening to this can go onto this website and see if yourself and scroll through their hundreds and hundreds of posts. But it basically gives us an insight into how each model thinks about a trade that they currently either have open or they're thinking about opening or closing or whatever that might be, right? So it's like being in the mind of an actual investor and figuring out how they make their decisions. An example here at the top of the screen is Gemini 3 Pro. He goes, I'm betting on a breakout in Nvidia, seeing a strong setup as it holds support and leading the market with a target of $189 and a
Starting point is 00:10:56 stop just below 180. So what he's referring to there is kind of a typical quant style of trading where it's kind of like he's looking at technicals, he's evaluating kind of graphs, momentum of the stock price. It's very price evaluated type of trading, right, Josh? But if you look just below it, you've got GPT 5.1, which actually came in second at the end of this competition, who goes, my analysis indicates continued strength in AI names like Nvidia and Microsoft, so I'm holding out on existing long positions over the weekend and potential macro event risk.
Starting point is 00:11:28 Now, the point I want to make about this particular model is it's less price-specific and it's more focused on just kind of general themes, news, and data that it's seeing outside of price. And that really goes to demonstrate that some of these models are very kind of price and quantitative focused, whereas other models are kind of more thesis driven over a shorter period of time. And it kind of gives rise to these types of personalities, right, Josh? Yeah, well, now we have to answer the uncomfortable question is like, is this evidence that Grock is some kind of money printing god, or is this just like really well-produced content that happens to involve real money?
Starting point is 00:12:05 And that kind of comes down to understanding the AI, understand the personalities, understanding how each model considers these trades and how they place themselves in different positions. So I kind of want to go through one by one, all of the models and kind of what their personalities are like. We see with DeepSeek a lot that it behaves, and we mentioned on a previous episode as well, it behaves like a very disciplined quant fund. And Deep Seek, for those that don't know, it's an open.
Starting point is 00:12:31 source Chinese model. They are very systematic, very mathematic, very comfortable with leverage, but able to hedge and adjust mid-trade based on its decisions and new information. So Deepseek and Quinn even is kind of similar to this. If you remember from the last episode, Ejjazz, Quim was my early favorite. I had hoped that Quinn was going to win. Unfortunately, that's not the case at all in season 1.5. Quen has gotten crushed right there with Deepseek. I can kind of imagine it as like more similar to me, maybe that's why I resonated with it, where it has one big thesis and then it sizes aggressively around that thesis. So if you remember, Quinn would only buy Bitcoin or Ether in the last one. And it wouldn't buy any other all coins. It just had a thesis that
Starting point is 00:13:09 these major coins were going up. Nothing else was. Claude is interesting. It's very reflective of how the actual Claude model works when you engage with it. It's very patient and it's thoughtful, but it occasionally sizes up too much and then it gets crushed by leverage. So, and like as we go through these, and EJ's, I also noticed, you assigned a masculine personality to Gemini. You said he when you were talking about Google Gemini. And that's kind of because it's daddy, right? Like Gemini's been the big boy on top. But in this training competition, I don't know if it is. I was going through the trades and it very much panic flip-flops from shorts to long after losing. And it kind of, in a way, Gemini was most reflective of retail behavior.
Starting point is 00:13:50 Because, and I'm not sure what we could tie that to, but Gemini was very reactionary. Where if it lost money, it would flip its position. And if it gained money, it would kind of hedge quickly. So that was interesting. And then we have GPT5, which is very sophisticated reasoning. But in season one, they over-traded and over-leveraged and got absolutely wiped out. And they were very timid in their way that they went about this. So that's kind of how you can think about these. The final one, which is the secret model, GROC 4.2. If we know anything about GROC, we know that it is a very high risk taker, but a calculated risk taker. And that's probably what I put it at the top there. So that's kind of how I would consider all of these models.
Starting point is 00:14:23 They're a little different. And they are reflective of, if you've used these in person, you could kind of understand the thinking that gets placed behind the trades. Yeah. I want to dig into a few things around the personality or rather the trading styles here, Josh, because it may not be as explicit as we kind of lay it out. Like, so GROC 4.20 was the winner, right? By far. And it made money.
Starting point is 00:14:46 It was the top across all of the competitions, all four competitions. That's great. But did you look at the results of GROC four? It's predecessor. It was the- Absolutely. It was the worst performing model in this entire competition, which is crazy because in season one where it was trading crypto, it came in at second or third. And for about 75% of the competition, Josh, it was number one.
Starting point is 00:15:13 So it had some kind of an advantage trading kind of very riskily, right? And that might be because of the nature of the instruments that it was trading. Crypto is very volatile. And it was kind of going blaze. So when it was like 20x bullish Bitcoin, it benefited a lot when Bitcoin price went up. But obviously it suffered when it went down. It's interesting to see the discourse between these two models and 1.5, right? GROC 4.20, the winner, seems to be a kind of more mature version of GROC 4.
Starting point is 00:15:45 It seems to be thinking more about its trades. It has more kind of like risk percentiles and boundaries in place, whereas GROC 4 seems to be its kind of usual degenerate self. And I don't know how much of that is reliant on the fact that it's trading stocks, which is generally a less volatile market versus GROC 4.20 being a more thesis-driven, sensible trader, as you kind of described. The other one that we have to call out, because it's the elephant in the room here, GPT5 came in at second in season 1.5, right? 5.1. Sorry, 5.1, right? In the previous season, season 1, it was the second worst performing? No, sorry, it was the worst performing. It was horrible. It was, it was, it was
Starting point is 00:16:29 GPT5. It was an abomination and Gemini. So whatever open AI is cooked up in the point one, congrats. Because you must have traded it on some kind of financial data or you've, you've like kind of like implemented a kind of like risk trading strategy that made it a lot more sensible because it made some really great trades on this season. So just two different kind of like jumps from season one to one point five that I had to call out. Yeah, it makes me excited to see the improvements in these, like significant improvements with incremental models, because we normally talk about 5 to 5.1 being pretty marginal. Like, there's nothing really note where they're exciting. And yet the results in the small sample size, at least, are pretty
Starting point is 00:17:08 reassuring that, hey, there is something new going under the hood. And maybe this is an appropriate time to address the, I guess the limitations, the kind of bare case of this starting with the sample size. We do have to say, I mean, this is two weeks, EGS. This is not a long time. They placed some trades. Some people maybe got lucky. Some models maybe did not. Is there any real signal here? I'm curious your take.
Starting point is 00:17:33 Do you think this is reflective of future performance? Like, is there what is here that's actually valuable versus what is here is actually kind of lucky? I don't think we have enough information to make that call. At least for me, I'll speak for myself personally. The real test is, you know, I asked myself before we recorded this episode, would I give my money to GROC 4.20? the winner that won across all categories. And the simple answer is like, no. I don't know if it's going to repeat that over week three, week four, week five.
Starting point is 00:18:02 It was only two weeks to your point, right? So I want to see this experiment kind of rehash like a million times before. I'm like, okay, that's cool. Even then, it's still kind of like risky, right? It's like I can justify giving my money to a human that I can kind of relate to that I can call up in speed to less so when it comes to an AI model, right? But maybe that's my thing he needs to kind of evolve. The other way I'm thinking about this is there's just a lot of unknowns around this, Josh, right?
Starting point is 00:18:31 Like I can see its thinking. I can see kind of like how the model kind of completes its trades. But I don't really know what's going under the hood. Is this just kind of like a pattern matching thing? Does it inherit the risks that a lot of humans have already done because it's trained on the same kind of corpus of trading data that we have kind of evaluated on? or is it kind of net better? Do you feel the same or? Yeah, it's probably, I mean, it's not the new gold standard of AI benchmarks,
Starting point is 00:18:59 but it is a standard that I think is interesting because this is a benchmark that happens in the real world with real dynamic data that cannot be games. So in that case, I love it. But I saw one writer, they called it Schrodinger's benchmark because it's simultaneously serious and degenerate at the same time. And it's like, it's entertainment with real money that happens to produce some legitimate insights about AI behavior, but it's not really indicative of
Starting point is 00:19:24 future returns at the small of a sample size at least. And that's kind of where I feel about it. There is one breakthrough that we mentioned earlier that does provide real value, which is the transparency. Every trade being on chain and every step reason being logged is actually really helpful to understanding how these models think and how you can consider thinking. So for example, you could show me every decision GROC 4.20 made on Tesla after the Fed announcement or something. like that and it'll walk you through a chain of thrott and if anything make you into a better investor would i trust the model with my own money maybe a little bit maybe with the small sample size how much would you get it is that's a good question i'd give it a couple thousand dollars to play around
Starting point is 00:20:05 with and see what happens i think that that would be interesting and fun and it's it's low enough stakes but i would trust it enough to not lose it like i'd say i would probably trust grock more with my money than i would the average day trader off the street, which credit they don't have a very good reputation, but I think there is some sort of an edge there that doesn't exist in the average person. And if you assume that these models are going to continue to get better and better, well, you have to assume that they're going to form some sort of an edge, but I don't know how much. It's an interesting question. Because as a quant trading fund, too, if your job, or as just a trader in general, if your job is to make money off of trading, what are you doing about
Starting point is 00:20:46 this information. Are you leaning into AI? Are you trying to get these models to help with your information flows and make decisions? Are you using them to help you actually transact trades? Or are you just kind of looking the other way and saying, oh, this is just a dumb experiment to benchmark models? There's no actual signal here. And the answer is probably somewhere in the middle, right? Yeah, I mean, well, my initial reaction to that is, okay, quant funds already use algorithms. It would make a lot of sense if they started using AI algorithms, right? If you could get a smarter algorithm, them to trade for your fund? Absolutely. Right. So it's a no brainer to me that these hedge funds, quant funds are going to be using AI, probably already using AI. Where I have maybe a hot take is that
Starting point is 00:21:26 the transparency is just a nice to have. It is no way going to win in the best of models. Why? Because if you have an AI model that is like better than all the other AI models at trading, why would you make the out public? Right. So like I'm kind of like at ties between this thing because I think the transparency is a really good thing in kind of like bringing up the floor of trading credibility for people that get access to this type of information. Like I have loved reading through these kind of like trade logs here, seeing how each model thinks and being like, okay, yeah, wow, I actually didn't think about that myself when I was buying that stock, right? And these are like stocks that I've seen that I can buy, right? The Amazon trade, the Ambidia trade.
Starting point is 00:22:06 I'm just like, oh, okay, I didn't think about that, right, yesterday whenever they made this trade. but if I am a hedge fund, I'm like, yeah, if I fine-tuned a model that is like beating all these models, I don't really want to expose that really. So it's kind of like a push and pull. The other thought I had, Josh, is, and maybe this is kind of like kind of semi-adjacent to what we're discussing here, I couldn't get the thought out of my head that if you could get GROC in X trading some kind of money for you or guaranteeing you like a five to 10% annual return, that is something that I would, like, if framed correctly, I would put some money into, right? Maybe not over two weeks, but maybe over an adjusted kind of yearly period would be super cool to see. Yeah, that's such a, it's such a fun question to ask is like, what happens when this kind of system runs for two years? But with your, like, let's say it's a large pension management fund and they just want a manager that doesn't take fees and does a pretty good job. Like, is there going to be enough trust in these systems to reliably place money at scale with them. And you have to assume, given the signal this early on,
Starting point is 00:23:10 that the answer will be yes. The question is how much of a yes will it be? What a percentage of management will be AI as it gets better over time? And the sample size sucks. I wish it was more than two weeks. I wish it was two years. But in two years from now, think about the progress we're going to see and what type of impact that's going to have on trading models. So this is, it's interesting. It's fascinating. In fact, I'm really curious to actually run this experience. for ourselves. I'd love to try to come up with a little trading model that runs in these things and test it out because it's fun and there is some sort of an edge there. I would say, okay, if I would to summarize my lesson from this entire competition or experiment so far, Josh,
Starting point is 00:23:49 it is I'm not convinced to give AI models money to trade, but I am convinced to use AI models to help me trade. So kind of like a human and AI model kind of work together and kind of become a better trader overall, I think is the main takeaway for me here. Do you share the same? It's funny. I mean, this is how agents work today, right? Like the, if you go on chat GPT and you say, go book me a reservation, it'll take you to the finish line and then you as the human provide the final filter and approve or deny. And I think that's probably the happy middle ground, while we still don't really trust these models too much, is give me the thesis, give me the trade. I will either approve or deny, and that's how the money gets managed.
Starting point is 00:24:33 So it's cool. This is a great experiment. I love that we got season 1.5. I mean, it's fascinating. Even more fascinating is that we have an early look at GROC 4.2, which by all means is the best trading model in the world. Where will it rank in the other benchmarks? We will see.
Starting point is 00:24:47 We will be covering it as soon as it comes out. But I guess that's really it for this episode on season 1.5. The question I want to leave everyone else with is, I mean, would you trust an AI with your part of the portfolio? Like, how much money would you actually give to an AI currently? Grog 4.2, who just made 60% in two weeks in one of these training competitions? Is that enough for you to risk your money? Or is it still just this dumb AI system that you don't really trust?
Starting point is 00:25:12 Well, if you're interested in this experiment, Josh and I were actually discussing about potentially giving you guys a tutorial on how to use an AI to trade money for you and kind of like an experiment, this own end of one experiment, but our own. but we want to get a little more signal from you guys. Let us know in the comments whether this is something that you'd be interested in seeing. And I have, Josh, I have a requirement for the listeners.
Starting point is 00:25:40 What do you got? If we do want to put the tutorial out, our last video that we did on AI trading reached 100,000 views and 3,000 likes. So I'm not going to ask for the 100,000 views, but I will ask for the likes. If this video can get more than 3,000, if it gets 3,000 likes, we will definitely put out that tutorial by the end of the year. And we have a lot of thoughts around this about how we're going to do it. We're super excited to do it. So help us get there.
Starting point is 00:26:08 It is another week of really exciting news. Josh, I don't know if you saw the rumors. Did you see the rumors about Open AI? Tell me. Tell me in. About Open Air releasing a potential new groundbreaking model. As a matter of fact, the polymarket is showing that Open AI is very favored to release the best model of the year. And last I checked, Gemini is the best model of the year.
Starting point is 00:26:27 So that implies we're getting something big in the next few weeks. I think we will. And like you said, the Polymarket is kind of like revealing its hands. So maybe there's some inside information coming out here. So there certainly is. Kind of stay tuned to Limelis. Put the notifications on guys and also subscribe. If you want to get the latest videos, we put out the best content out there.
Starting point is 00:26:45 It's not, it's unchallenged right now. Josh and I are sitting here unchallenged. You have to like and subscribe if you want to get our content on your feed. Thank you so, so much for listening. Again, let us know what you thought of this episode in the comments. Get that like number up and we will see you on the next one.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.