Limitless: An AI Podcast - Revealing Elon’s Secret AI Trading Bot: Is It Worth It?

Episode Date: December 9, 2025

The groundbreaking Alpha Arena experiment involved eight AI trading models against each other. Grok 4.2 emerges as the standout winner, achieving 60% profit in just two weeks despite the vola...tility that affected many competitors.What does this experiment mean for you? With strategies and behavioral patterns, we need to question the balance between AI trading success and necessary human oversight.------🌌 LIMITLESS HQ: LISTEN & FOLLOW HERE ⬇️https://limitless.bankless.com/https://x.com/LimitlessFT------TIMESTAMPS0:00 Intro1:39 Season 1 Results2:39 Transition to Season 1.54:22 Mystery Model Revealed5:55 Competition Breakdown8:09 Insights from Competition9:56 Model Trading Styles12:16 AI Personalities in Trading14:11 Comparing Model Performances16:36 Limitations and Future Potential19:53 Trusting AI with Investments24:20 Future of AI Trading Tutorials------RESOURCESJosh: https://x.com/JoshKaleEjaaz: https://x.com/cryptopunk7213------Not financial or tax advice. See our investment disclosures here:https://www.bankless.com/disclosures⁠

Transcript
Discussion (0)
Starting point is 00:00:00 Imagine this. You give eight of the world's most powerful AI models $10,000 each and tell them, go trade real stocks. No paper trading, but real money with real risk. And two weeks later, most of them have lost a painful amount of cash, which I guess is kind of expected. The kind of drawdowns that would get a human portfolio manager totally fired. But now they ran the same experiment again, except this time with much higher stakes. There's $320,000 at stake. And we've talked about Alpha Arena before in a previous episode, which I highly recommend checking out. But now we have the new results from the new season, season 1.5. And what was exciting is that there was a very clear and obvious winner, but that winner was a mystery.
Starting point is 00:00:38 We don't actually know or we didn't know who the winner was up until recently. In fact, it won all four of the training competitions in this new season while leaving the other top models like Chat Chupit 5.1 and Google Shemini 3.0 fighting for second place. So at the core of this is, one, who is this model? And two, how on earth did they do it? how are they outperforming everyone, so much so as to make 65% in two weeks in one of these competitions. So EJZ, I want to walk through everyone about what just happened, what the model is and what Alpha Arena is.
Starting point is 00:01:11 So give us the lowdown on who this was that made so much money. Oh, yeah. Well, we will get into all of that today. So Alpha Arena is basically a competition or test to see how well AI models can trade. And they do this in a few different ways, Josh. Number one, they give each model $10,000, as you mentioned. And then they allowed them to trade a range of different financial instruments over a period of two weeks. So there's like a season, two weeks, and we see which AI models do the best.
Starting point is 00:01:43 And they get all your AI models in there. You've got chat GPT. You have got Gemini. You've got Anthropics Claude and you have GROC as well. And so they've gone through about two seasons now and the results have been absolutely crazy. So they started off with season one. And you can think of this as like the Dgen crypto season. They gave seven models $10,000 each and allowed them to trade crypto assets like Bitcoin,
Starting point is 00:02:08 Ethereum, stuff like that. And they did this in something called perpetual. So they could leverage trade is the only instrument that they were allowed to do this. And the results were, as you'd probably expect, a lot of these AI models lost a lot of money. Some of them actually ended up making a decent chunk of money. And they were primarily Chinese models. There were Quen, and I think it was Deep Seek that ended up making money. So there was a lot of takeaways there.
Starting point is 00:02:33 As you mentioned, we've got a previous episode where we spoke about this. Definitely go give that a watch. There's a lot of alpha in that one. And then that brings us to season 1.5, where the AI models, instead of being given crypto to trade, were given the ability to trade U.S. stocks. We're talking about equities, which is something that a lot of us listening to this show are very familiar with. And I think this is for a few reasons, Josh.
Starting point is 00:02:57 primarily crypto is very volatile, and we kind of want to figure out how the majority of money that is traded in the financial markets can translate into AI models trading that. So a few things that they kept the same is that they gave the AI model $10,000. But there was a number of differences with season 1.5. Number one, they were allowed to trade U.S. equities and stocks. Number two, there were two new models that were introduced. One was a model called Kimi K2, which is a really good open source Chinese model. but the other was this thing called a mystery model.
Starting point is 00:03:29 I'm going to reveal which this model was in a second. But before I do, do you have any guesses as to what model this might have been? Well, I cheated. I know the answer. But what I think is very exciting about this is that like the, I think it's important to highlight. These models made hundreds to even thousands of trades per model. Yes. And what we want to answer, like the question that I want more than the mystery model is like,
Starting point is 00:03:52 is this real signal or is this just, I mean, I said earlier. earlier, is this a GPU intensive scratch-off game where is there any real signal? And I guess we'll talk about the reality of that and what this means for your portfolio if you ever want to manage it. But to me, I think that's the important thing to highlight. We probably should just spill the beans you guys. Do you want to tell them? Who's in this room? I have to. I can't keep it in any longer. It was an unofficial version of grok, aptly named GROC 4.2 or 420 for the memers out there. And this was revealed by none other than the Grockman himself, Elon Musk. And the reason why this mystery model was getting so much attention, Josh, was because it ended up being the winner.
Starting point is 00:04:35 It made the most money out of any other AI models. And what was more impressive is there wasn't just one competition being run throughout season 1.5. There were four at the same time. So these AI models were running across four different competitions at the same time. That was $320,000 at any one instance, which is a crazy amount of financial money to stake on an experiment. That's a lot of money could have been lost here. And Grok 4.20 ended up performing the best. Josh, I want to go through a few different stats here, which kind of like shows how amazing this particular model was.
Starting point is 00:05:18 So firstly, for some context, there were four. different competitions that were being run that these AI models were being tested on. Competition number one was something called new baseline. This is basically the ability for these AI models to get access to trading AI stocks to get access to all the common news that you and I can read online and in newspapers to kind of like figure out, okay, what kind of news would affect my stock positions. They would also get access to sentiment data to see how kind of like the markets and retail traders would kind of react to certain bits of news. They have. They had access to a much wider spread amount of data in competition number one.
Starting point is 00:05:56 Competition number two was called Monk Mode. They kind of amended the investing prompt here. And so kind of like they traded more conservatively. Competition number three was called situational awareness, Josh. So each model had an awareness of other models trading and where they ranked in accordance to them. So there was this kind of like ecosystem of peer pressure being put on by each model. And competition number four was just outright to.
Starting point is 00:06:20 generacy, max leverage. You could only trade with like 20 to 50x leverage, which is just kind of, I don't think it was 50x, but like 30x. Just crazy amount of risk adjustment to test whether a model would take that risk or whether it would trade more conservatively. Josh, do you have any reactions on the results of this competition? The results that we're looking at right now, actually, I found most interesting. This is from the new baseline competition. It's basically the full info mode. And one of the big differences between this mode versus previous competitions that have been held is like you mentioned earlier, it has access to a lot of data. This is the first time an AI trading model has had access to real-time information outside of
Starting point is 00:07:00 just looking at a chart. So I think in that sense, this is the closest competition to how a human quant fund would actually operate. So if you're looking for high signal in terms of which AI can actually make you real money in the real world, this is the one. And what we're seeing here is that the GROC 4.20 model, the memetic mystery model, outperformed by like a fairly large margin, to Open A&A and chat GPT 5.1, which is the clear second place. And those are the only two that actually made profit. Everybody else lost money in the real world competition, which to me signals a few things. One of them being, well, perhaps one is really good at understanding real world information.
Starting point is 00:07:38 Perhaps it understands company fundamentals better. Perhaps it just has access to real world information that's better, like GROC and having access to the X AI model. So there's a lot of things to speculate here. But for me, the new baseline chart that we're looking at right now was the high signal one. I'm like, oh my God, wait, this has the same type of information flows that I'm now getting. So now we're even. We're on the same playing field.
Starting point is 00:08:01 Okay. I actually had a different answer to that, which is I was more impressed, Josh, by the situational awareness competition. So this was a competition where each model had access to data and news, but they also had awareness of who they were competing against. So GROC 4.20, the winner, knew that GPD5 was in second place. And so he was always keeping an eye on GPT5 being like, oh, what trades is GPT5 making? Why did they make that trade? Oh, that's interesting.
Starting point is 00:08:32 And then he would look at Gemini and be like, oh, what trades are Gemini making? So he would have this awareness of his competitors, which you didn't have in season one where they were just kind of like trading in silos, right? And why this competition was so interesting, Josh, is this was technically where GROC 4.20 made the most money. In fact, if you look at the top of this leaderboard right here, the account value at the end of season 1.5 was $16,656.5, which is technically a 60% plus return in two weeks
Starting point is 00:09:07 on $10,000 worth of capital. I needed to take my money immediately. Isn't that insane, right? If you had to pick a competition of where you would have given an AI model money, just given from this data, and I'm not saying you should do that, you would be most bullish on situational awareness. And I'm going to like kind of make some implications here that I haven't tested yet, but it seems to imply that this kind of competitive nature where the models were kind of aware and exposed to their competitors' trades and thinking, and we're going to get to the model
Starting point is 00:09:40 chat thinking in a second, seems to have given them a better trading advantage, at least in some cases. Yeah, so like you mentioned, one of my favorite parts, I think we share this, and one of our favorite parts about this competition in particular, is that you can actually see all of the trades. One thing about these private quant funds, you don't know what the hell is going on, but with these models, you can see exactly what they're thinking every time they think and make a decision. So maybe you guys can go through a few of them and see kind of what the model is thinking, how they're processing this real world data. And if there's any tips for us to learn from processing this real world data, because clearly they're a much better trader than I am. Yeah, so I have a few examples pulled up here on the right side of the screen. It's under model chat. By the way, any of you listening to this can go onto this website and see for yourself and scroll through their hundreds and hundreds of posts. But it basically gives us an insight into how each model
Starting point is 00:10:30 thinks about a trade that they currently either have open or they're thinking about opening or closing or whatever that might be, right? So it's like being in the mind of an actual investor and figuring out how they make their decisions. An example here at the top of the screen, is Gemini 3 Pro. He goes, I'm betting on a breakout in Nvidia, seeing a strong setup as it holds support and leading the market with a target of $189 and a stop just below 180. So what he's referring to there is kind of a typical quant style of trading where it's kind of like he's looking at technicals, he's evaluating kind of graphs, momentum of the stock price. It's very price evaluated type of trading, right Josh? But if you look just below it, you've got GBT 5.1, which actually
Starting point is 00:11:14 actually came in second at the end of this competition, who goes, my analysis indicates continued strength in AI names like Nvidia and Microsoft. So I'm holding out on existing long positions over the weekend and potential macro event risk. Now, the point I want to make about this particular model is it's less price specific and it's more focused on just kind of general themes, news and data that it's seeing outside of price. And that really goes to demonstrate that some of these models are very kind of price and quantitative focused, whereas other models are kind of more thesis driven over a shorter period of time. And it kind of gives rise to these types of personalities, right, Josh? Yeah, well, now we have to answer the uncomfortable question is
Starting point is 00:11:58 like, is this evidence that Grock is some kind of money printing god, or is this just like really well-produced content that happens to involve real money? And that kind of comes down to understanding the AI, understanding the personalities, understanding how each model considers these trades and how they place themselves in different positions. So I kind of want to go through one by one, all of the models and kind of what their personalities are like. We see with Deepseek a lot that it behaves, and we mentioned on a previous episode as well, it behaves like a very disciplined quant fund. And Deepseek, for those that don't know, it's an open source Chinese model. They are very systematic, very comfortable with leverage, but able to hedge and adjust mid-trade based on its
Starting point is 00:12:39 decisions and new information. So Deep Seek and and Quartz. Quen even is kind of similar to this. If you remember from the last episode EJez, Quim was my early favorite. I had hoped that Quen was going to win. Unfortunately, that's not the case at all in season 1.5. Quinn has gotten crushed right there with Deep Seek. I can kind of imagine it as like more similar to me,
Starting point is 00:12:59 maybe that's why I resonated with it, where it has one big thesis and then it sizes aggressively around that thesis. So if you remember, Quinn would only buy Bitcoin or Ether in the last one. And it wouldn't buy any other all coins. It just had a thesis that these major coins were going up. Nothing else was. Claude is interesting.
Starting point is 00:13:14 It's very reflective of how the actual Claude model works when you engage with it. It's very patient and it's thoughtful, but it occasionally sizes up too much, and then it gets crushed by leverage. So, and as we go through these, and EJS, I also noticed, you assigned a masculine personality to Gemini. You said he, when you were talking about Google Gemini. And that's kind of because it's daddy, right? Like, Gemini has been the big boy on top. But in this training competition, I don't know if it is. I was going through the trades and very much panic flip-flops from shorts to long after losing.
Starting point is 00:13:45 And it kind of, in a way, Gemini was most reflective of retail behavior. And I'm not sure what we could tie that to, but Gemini was very reactionary, where if it lost money, it would flip its position. And if it gained money, it would kind of hedge quickly. So that was interesting. And then we have GPT5, which is very sophisticated reasoning. But in season one, they over-traded and over-leveraged and got absolutely wiped out. And they were very timid in their way that they went about this. So that's kind of how you can think about these.
Starting point is 00:14:11 The final one, which is the secret model, GROC 4.2. If we know anything about GROC, we know that it is a very high risk taker, but a calculated risk taker. And that's probably what put it at the top there. So that's kind of how I would consider all of these models. They're a little different. And they are reflective of if you've used these in person, you can kind of understand the thinking that gets placed behind the trades.
Starting point is 00:14:31 Yeah. I want to dig into a few things around the personality or rather the trading styles here, Josh, because it may not be as explicit as we kind of lay it out. So GROC 4.20 was the winner, right, by far. And it made money. It was the top across all of the competitions, all four competitions. That's great. But did you look at the results of GROC 4?
Starting point is 00:14:53 It's predecessor. It was the worst. It was the worst performing model in this entire competition, which is crazy because in season one where it was trading crypto, it came. in at second or third. And for about 75% of the competition, Josh, it was number one. So it had some kind of an advantage, an advantage, trading kind of very riskily, right? And that might be because of the nature of the instruments that it was trading.
Starting point is 00:15:22 Crypto is very volatile. And it was kind of going blaze. So when it was like 20x bullish Bitcoin, it benefited a lot when Bitcoin price went up. But obviously it, like, suffered when it went down. It's interesting to see the discourse between these two models and 1.1. right. Groch 4.20, the winner, seems to be a kind of more mature version of GROC 4. It seems to be thinking more about its trades. It has more kind of like risk percentiles and boundaries in place, whereas GROC 4 seems to be its kind of usual degenerate self. And I don't know how much of that
Starting point is 00:15:56 is reliant on the fact that it's trading stocks, which is generally a less volatile market versus GROC 4.20 being a more thesis-driven, sensible trader, as you kind of described. The other one that we have to call out, because it's the elephant in the room here, GPT5 came in at second in season 1.5, right? 5.1. 5.1. Sorry, 5.1, right? In the previous season, season 1, it was the second worst performing? No, sorry, it was the worst performing.
Starting point is 00:16:27 It was horrible. It was GPT5. It was an abomination. And Gemini. So whatever Open AI is cooked up in the point one, congrats. because you must have trained it on some kind of financial data or you've like kind of like implemented a kind of like risk trading strategy that made it a lot more sensible because it made some really great trades on this season.
Starting point is 00:16:48 So just two different kind of like jumps from season one to 1.5 that I had to call out. Yeah, it makes me excited to see the improvements in these like significant improvements with incremental models because we normally talk about 5 to 5.1 being pretty marginal. Like there's nothing really noteworthy are exciting. And yet the results in the small sample. size at least are pretty reassuring that, hey, there is something new going under the hood. And maybe this is an appropriate time to address the, I guess the limitations, the kind of bare case of this starting with the sample size. We do have to say, I mean, this is two weeks, EJS.
Starting point is 00:17:22 This is not a long time. They place some trades. Some people maybe got lucky. Some models maybe did not. Is there any real signal here? I'm curious your take. Do you think this is reflective of future performance? Like, is there what is here that's actually valuable versus what is here is actually kind of lucky? I don't think we have enough information to make that call. At least for me, I'll speak for myself personally. The real test is, you know, I asked myself before we recorded this episode, would I give my money to GROC 4.20, the winner, that won across all categories. And the simple answer is like, no. I don't, I don't know if it's going to repeat that over week three, week four, week five. It was only two weeks to your point, right? So I want to see this
Starting point is 00:18:04 experiment kind of rehash like a million times of what I'm like, okay, that's cool. Even then, it's still kind of like risky, right? It's like I can justify giving my money to a human that I can kind of relate to that I can call up in speed to less so when it comes to an AI model, right? But maybe that's my thing he needs to kind of evolve. The other way I'm thinking about this is there's just a lot of unknowns around this, Josh, right? Like I can see its thinking. I can see kind of like how the model kind of completes its trades. But I don't really know what's going under the hood. Is this just kind of like a pattern matching thing?
Starting point is 00:18:42 Does it inherit the risks that a lot of humans have already done? Because it's trained on the same kind of corpus of trading data that we have kind of evaluated on. Or is it kind of net better? Do you feel the same or? Yeah, it's probably, I mean, it's not the new gold standard of AI benchmarks. But it is a standard that I think is interesting because this is a benchmark that happens in the real world. with real dynamic data that cannot be games. So in that case, I love it. But I saw one writer, they called it Schrodinger's benchmark, because it's simultaneously serious and degenerate at the same
Starting point is 00:19:16 time. And it's like it's entertainment with real money that happens to produce some legitimate insights about AI behavior, but it's not really indicative of future returns at the small of a sample size at least. And that's kind of where I feel about it. There is one breakthrough that we mentioned earlier, that does provide real value, which is the transparency. Every trade being on chain and every step reason being logged is actually really helpful to understanding how these models think and how you can consider thinking. So, for example, you could show me every decision GROC 4.20 made on Tesla after the Fed announcement or something like that. And it'll walk you through its chain of thought. And if anything, make you into a better investor. Would I trust the model
Starting point is 00:19:55 with my own money? Maybe a little bit. Maybe with a small sample size. How much would you get? It is? That's a good question. I'd give it a couple thousand dollars to play around with and see what happens. I think that would be interesting and fun. And it's low enough stakes, but I would trust it enough to not lose it. Like I'd say, I would probably trust Grock more with my money than I would the average day trader off the street.
Starting point is 00:20:21 Which, granted, they don't have a very good reputation. But I think there is some sort of an edge there that doesn't exist in the average person. And if you assume that these models are going to continue to get better and better, well, you have to assume that they're going to form some sort of an edge, but I don't know how much. It's an interesting question. Because as a quant trading fund, too, if your job, or as just a trader in general, if your job is to make money off of trading, what are you doing about this information?
Starting point is 00:20:47 Are you leaning into AI? Are you trying to get these models to help with your information flows and make decisions? Are you using them to help you actually transact trades? or are you just kind of looking the other way and saying, oh, this is just a dumb experiment to benchmark models. There's no actual signal here. And the answer is probably somewhere in the middle, right? Yeah, I mean, well, my initial reaction to that is,
Starting point is 00:21:06 okay, quant funds already use algorithms. It would make a lot of sense if they started using AI algorithms, right? If you could get a smarter algorithm to trade for your fund, absolutely, right? So it's a no brain to me that these hedge funds, quant funds are going to be using AI, probably already using AI. Where I have maybe a hot take is that the transparency is just a nice to have. It is no way going to win in the best of models. Why?
Starting point is 00:21:33 Because if you have an AI model that is like better than all the other AI models at trading, why would you make it out public? Right. So like I'm kind of like at ties between this thing because I think the transparency is a really good thing in kind of like bringing up the floor of trading credibility for people that get access to this type of information. Like I have loved reading through these kind of like trade logs here, seeing how each model thinks and being like, okay, yeah, wow.
Starting point is 00:21:58 I actually didn't think about that myself when I was buying that stock, right? And these are like stocks that I've seen that I can buy, right? The Amazon trade, the Nvidia trade. I'm just like, oh, okay, I didn't think about that, right, yesterday whenever they made this trade. But if I am a hedge fund, I'm like, yeah, if I fine-tuned a model that is like beating all these models, I don't really want to expose that really.
Starting point is 00:22:18 So it's kind of like a push and pull. The other thought I had, Josh, is, and maybe this is kind of like, kind of semi-adjacent to what we're discussing here, I couldn't get the thought out of my head that if you could get GROC in X trading some kind of money for you or guaranteeing you like a 5 to 10% annual return, that is something that I would, like, if framed correctly, I would put some money into, right? Maybe not over two weeks, but maybe over an adjusted kind of yearly period would be super cool to see. I don't know. Yeah, it's such a fun question to ask is like, what happens when this kind of system runs for two years? But with your, like, let's say it's a large pension management fund and they just want a manager that doesn't take fees and does a pretty good job. Like, is there going to be enough trust in these systems to reliably place money at scale with them? And you have to assume, given the signal this early on, that the answer will be yes. The question is how much of a yes will it be? what a percentage of management will be AI as it gets better over time. And the sample size sucks. I wish
Starting point is 00:23:21 it was more than two weeks. I wish it was two years. But in two years from now, think about the progress we're going to see and what type of impact that's going to have on trading model. So this is, it's interesting. It's fascinating. In fact, I'm really curious to actually run this experiment for ourselves. I'd love to try to come up with its little trading model that runs these things and test it out because it's fun and there is some sort of an edge there. I would say, okay, if I would to summarize my lesson from this entire competition or experiment so far, Josh, it is I'm not convinced to give AI models money to trade, but I am convinced to use AI models to help me trade. So kind of like a human and AI model kind of work together and kind of become a better trader overall, I think is the main takeaway for me here. Do you share the same?
Starting point is 00:24:09 It's funny. I mean, this is how agents work today, right? Like the, if you go on chat GPT and you say, go book me a reservation, it'll take you to the finish line. And then you as the human provide the final filter and approve or deny. And I think that's probably the happy middle ground, while we still don't really trust these models too much, is give me the thesis, give me the trade. I will either approve or deny, and that's how the money gets managed. So it's cool. This is a great experiment.
Starting point is 00:24:34 I love that we got season 1.5. I mean, it's fascinating. Even more fascinating is that we have an early look at GROC 4.2, which by all means, is the best trading model in the world. Where will it rank in the other benchmarks? We will see. We will be covering it as soon as it comes out. But I guess that's really it for this episode on season 1.5. The question I want to leave everyone else with is, I mean, would you trust an AI with your part of the portfolio? Like, how much money would you actually give to an AI currently? Grog 4.2, who just made 60% in two weeks in one of these training competitions? Is that enough for you
Starting point is 00:25:07 to risk your money? Or is it still just this dumb AI system that, you know, you don't really trust? Well, if you're interested in this experiment, Josh and I were actually discussing about potentially giving you guys a tutorial on how to use an AI to trade money for you and kind of like an experiment, this own end of one experiment, but our own. But we want to get a little more signal from you guys. Let us know in the comments whether this is something that you'd be interested in seeing. And I have, Josh, I have a requirement for the listeners. If we do want to put the tutorial out. Our last video that we did on AI trading reached 100,000 views and 3,000 likes. Biggest video ever.
Starting point is 00:25:50 Thank you. I'm not going to ask for the 100,000 views, but I will ask for the likes. If this video can get more than 3,000, if it gets 3,000 likes, we will definitely put out that tutorial by the end of the year. And we have a lot of thoughts around this
Starting point is 00:26:04 about how we're going to do it. We're super excited to do it, so help us get there. It is another week of really exciting news. Josh, I don't know if you saw the rumors. Did you see the rumors about OpenAI? Tell me. Tell me in.
Starting point is 00:26:15 About Open Air releasing a potential new groundbreaking model. As a matter of fact, the Polymarket is showing that Open AI is very favored to release the best model of the year. And last I checked, Gemini is the best model of the year. So that implies we're getting something big in the next few weeks. I think we will. And like you said, the Polymarket is kind of like revealing its hands. So maybe there's some inside information coming out here. So there certainly is.
Starting point is 00:26:38 Kind of stay tuned to Limelis. Put the notifications on, guys, and also subscribe if you want to get the latest videos. We put out the best content out there. It's not, it's unchallenged right now. Josh and I are sitting here on challenge. You have to like and subscribe if you want to get our content on your feed. Thank you so, so much for listening.
Starting point is 00:26:55 Again, let us know what you thought of this episode in the comments. Get that like number up and we will see you on the next one.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.