Y Combinator Startup Podcast - GPT-OSS vs. Qwen vs. Deepseek: Comparing Open Source LLM Architectures

Starting point is 00:00:00 Open AI recently dropped GPTOSS, its first open weights model since GPT2 in 2019. It's one of the highest profile open source model launches since DeepSeek R1 made babes back in January. But how does GPTOSS compared to the other top open source models out there architecturally? Let's find out. GPTOSS is one of OpenAI's most anticipated recent launches, a large, fully open weights model from one of the leading American AI labs. Let's take a closer look at the paper to find out how it was actually engineered and trained. GPDOSS is a mixture of experts model available in two sizes, 120 billion parameters and 20 billion parameters.

Starting point is 00:00:42 Each token activates the top four experts, meaning only a portion of the total parameters are used at any given time. This allows for efficient inference without sacrificing the benefits of a larger model. Trained as a decoder-only transformer, GPTOSS incorporates plenty of features typical to modern LLMs. This includes grouped query attention, a modified attention mechanism that lets multiple query heads share the same key value pairs to reduce memory use, and speed up inference. It also includes Sweed glue activations in the feed-forward network layers, which allow for more nuanced transformations than simpler activations like RELU, as well as rotary positional impeddings or rope, which encode token position directly into the attention mechanism to support longer contexts. Finally, the model also makes use of RMS norm with pre-normalization,

Starting point is 00:01:24 a normalization method that scales inputs by their root-mean square for more stable training. One standout capability of the model is its 131,000 token context window, which it achieves by a applying yarn scaling during pre-training rather than as an inference time adjustment. We'll touch on what this means a little bit later in the video. For GPTOSS, OpenAI makes use of their open source, O200K Harmony tokenizer. This byte pairing tokenizer has over 200,000 tokens and builds on the O200K tokenizer used in models like GPT4O. As for the dataset GPTOSS was trained on, OpenAI has only disclosed the broad strokes. The model was trained on a text-only corpus in the trillions of tokens, with a focus on STEM, coding, and general knowledge.

Starting point is 00:02:04 Harmful content was filtered out for safety, but beyond that, there's little else known publicly. Once training was complete, the model was released in a quantized format by default, making it lightweight enough for deployment on modest hardware. This allows it to be run on consumer-grade GPUs, laptops, or other resource-limited hardware. However, there's no unquantized version available. GPTOSS also underwent substantial post-training for safety and alignment, shaping its default behavior for more controlled outputs. It's worth noting that some in the open-source community are experimenting with reducing or removing these light-

Starting point is 00:02:34 in order to explore the raw models capabilities. In the broader landscape of open source AI, GPTOSS arrives as a fully equipped, long-context model ready for immediate use. As impressive as it is, however, it's just one of several models in a rapidly expanding field of open source LLMs. Quem 3, the newest family of models developed by Alibaba Cloud,

Starting point is 00:02:54 dropped this past April to considerable height, with benchmark scores that rivaled those of leading open source-based models like DeepSeek V3 or Lama 4. The Quen 3 family includes both dense models, which activate all of their parameters for each query, and mixture of expert models, which only activate a small subset of their parameters for each query. The dense models come in seven different size classes,

Starting point is 00:03:13 including a .6 billion parameter model, one of the smallest current generation open weight models around, while the MOE models come in two different size classes. Architecturally, Quen3 dense models are very similar to the Quen2.5 models, Alibaba's previous releases. Like Quen 2.5 and GPDOSS, Quinn3 incorporates features like group query attention, sui glue, rope, and RMS norm.

Starting point is 00:03:33 Quinn3's sparse models share the same fundamental architecture as its dense models, but add a mixture of experts layer with 128 total experts, of which eight are activated per token. All Quinn3 models also use the same tokenizer used in previous Quinn models, which implements byte-level byte-parent coatings that allow it to handle any text or symbol without special pre-processing, unlike word or character-based tokenizers. One of the main things that sets Quinn 3 apart from previous Quinn models is the way it controls the scale of the key query and value projections to keep attention score stable at scale. It replaces QKV bias, a static offset that shifts KQV projections in previous models with QK norm,

Starting point is 00:04:10 a normalization step that dynamically rescales that query and key vectors to maintain constant magnitudes. Dataset-wise, Quen3 was trained on 36 trillion pre-training tokens, twice as many as the Quinn 2.5 models. In addition to pulling data from multilingual texts, STEM, encoding sources, and reasoning tasks, quen3 also uses Quen2.5 models to generate trillions of tokens of synthetic data in different formats like textbooks, instructions, and code snippets. Quen3's pre-training occurred in three stages. In stage one, the general stage, models were trained on over 30 trillion tokens covering 119 languages at a sequence length of 4096 tokens.

Starting point is 00:04:46 In stage two, the reasoning stage, models were trained on an additional 5 trillion higher-quality tokens featuring more STEM, reasoning, and coding problems. And in stage, which the Quen team calls the long context stage, context length was extended to over 32,000 tokens using a bunch of clever algorithmic optimizations, including ABF, a technique to adjust rope, so positional signals remain accurate over much longer sequences,

Starting point is 00:05:07 yarn to further scale for longer inputs, and dual chunk attention to process sequences efficiently. Together, all of these optimizations allow the models reason over much longer inputs at inference. Finally, Quinn uses a four-step post-training pipeline with two goals, giving users more control over how much reasoning to use for a given query, and letting them efficiently distill

Starting point is 00:05:26 larger model capabilities into smaller models. The first step in the post-training pipeline is a long chain of thought cold start stage, which involves feeding a model a curated data set of challenging reasoning problems from math, logic, and stem with verifiable reference answers, and then filtering outputs to ensure quality. This is followed by a reasoning RL stage using GRPO, an RL algorithm originally developed by deep seek researchers on roughly 4,000 query verifier pairs to strengthen complex problem solving. Personally, I think it's fascinating that it only takes 4,000 pairs to get great results.

Starting point is 00:05:58 The third step in the post-training pipeline, Thinking Mode Fusion, is a key Quinn-3 innovation that integrates reasoning and non-reasoning into a single model, letting users switch modes without changing models. Essentially, what developers did in this step was fine-tuned the model on a mix of thinking data, which includes intermediate reasoning steps and non-thinking data, which omits them, and then build a chat interface to let users toggle modes. Though this was unique to Quinn when the model first launch, GPT-5 now features a similar toggle. The final step, General RL, broadens capabilities in instruction following, formatting, preference

Starting point is 00:06:28 alignment, tool use, and specialized scenarios. Quinn's developers then use strong to weak distillation, which allows for the training of smaller models from larger ones. All in all, Quinn 3's performance is very impressive, especially given its relatively small size. But just months earlier, a different model had already raised the stakes in open source. Released in December of last year, Deepseek's V3 model was one of the most ambitious open source LLMs to come out of a major lab in recent years. chatbot developed in China. It's called DeepSeek.

Starting point is 00:06:57 Deepseek is such fundamental change to the economics of what's going on. The most downloaded free app in the U.S. This is an update in what people think is possible. At 671 billion parameters, it's a massive general purpose-based model, designed for efficiency as much as capability, laying the groundwork for the reasoning-focused R1 model that would follow. We're not going to get into a ton of detail about V3's architecture or training pipeline here because we put out a comprehensive deep dive into it back in February. But high level, the thing to know about V3 is that it's a mixture of experts model with several hardware and algorithmic optimizations,

Starting point is 00:07:30 including training V3 natively in 8-bit rather than 16 or 32-bit, a huge unlock for cutting training costs. And just recently, DeepSeek push V3 even further with an updated version. The newly released V3.1 builds directly on the original V3-based checkpoint, extending it with a two-phase long-context training approach and adding a hybrid-thinking mode that lets the same model switch between reasoning-heavy and lightweight inference. It also improves tool use and agent performance thanks to a more advanced post-training. In practice, this means V3.1 keeps the same core architecture as V3, but delivers stronger reasoning, smarter tool use, and greater performance.

Starting point is 00:08:06 One thing that sets V3 apart is that it uses a different attention mechanism than GPTOSS and QM3. In modern LLMs, a lot of the compute and memory is tied up in the KV cache, and so V3 makes use of MLA, which compresses keys and values into a smaller latent space before caching them, then decompresses them during inference. Although MLA is a bit more complex to implement, the previous DeepSeek V2 paper found it delivers greater memory savings and better modeling performance than GQA,

Starting point is 00:08:32 especially in huge long-context models like this one. And that's just one of several areas where DeepSeek V3 takes a different path. With all that in mind, let's take a step back. From V3 to Quen to GPDOSS, how should we think about, at a high level, the differences between these models? One big difference is size.

Starting point is 00:08:49 De Quen 3 model family is the only one of the three to offer both dense and mixture of expert variants, with dense models from 0.6 billion to 32 billion parameters and a mixture of experts lineup that includes a 30 billion parameter model and a 235 billion parameter model. Notably, Quinn's mixture of experts' base models match to the dense models' performance with only a fifth as many active parameters. On the other hand, DeepSeek V3 only comes in a mixture of experts' architecture with 671 billion parameters, of which 37 billion are activated for a given token prediction, so considerably

Starting point is 00:09:18 larger than even the biggest Quinn 3 model. GPDOSS sits in the middle. It offers two MOE models, one with 117 billion parameters, of which 5.1 billion are activated for a given token, and a smaller one with 21 billion parameters, of which 3.6 billion are activated for a given token. One of the most interesting technical differences lies in how each model extends its context length. Yarn, short for yet another rope extension, is a technique for stretching the model's rotary positional embeddings so that it can handle far longer sequences than it was originally trained on. Normally, rope starts to break down when you feed it more tokens than its base frequency was set for,

Starting point is 00:09:50 But yarn tweaks that frequency so the same embedding space covers much more ground. What's interesting is how the three models here use it differently. GPTOSS applies yarn right from pre-training, so its weights have learned to work natively with 131,000 token context. DeepSeek takes a staged approach, fine-tuning after pre-training to first reach 32,000 tokens, then further training to achieve 128,000. Quinn also fine-tunes to 32,000, but skips that additional retraining step.

Starting point is 00:10:18 Instead, at inference time, they apply to you to achieve 12, yarn scaling again, increasing the rope base frequency by a factor of four to reach 128,000 tokens without extra retraining. In other words, GBTOSS is born with long context ability. Deepseek is trained into it step by step and Quinn pushes the limits of what a 32,000 train model can do without more long context training. Personally, I think one of the most interesting things about these papers and the state of the art in deep learning more generally is that a lot of these read as empirical findings. Each lab describes the combination of tools that works well for them,

Starting point is 00:10:48 but almost no one gives a first principles justification of why one tool is better than the other. For instance, why MLA is better than GQA, full stop. This is much different from domains like math or theoretical physics, which are all about providing first principles explanations that derive results from axioms or laws. Also, it's interesting that even though most of these models have similar topline benchmark statistics and use broadly the same tools, like attention mechanisms, activation functions, positional embeddings, and so on, they achieve these similar results using often very different techniques. This is quite surprising.

Starting point is 00:11:18 You'd expect that very different training methods would lead to very different results. Also, all of the major models heavily use reinforcement learning as part of the post-training and reasoning portions of their model training efforts. And it's fascinating and pretty surprising how some of these RL efforts require very little amounts of data, just 4,000 data pairs in the case of Quinn. Another point here is that it's very opaque what the differences in data sets are between the labs. It's clear from the papers that there's an enormous amount of work happening behind the scenes in dataset engineering. This work is probably a significant aspect of the moat that makes these companies comfortable releasing their models.

Starting point is 00:11:48 It's very difficult to replicate what they're releasing. So the big takeaway when reading these papers is you shouldn't focus too much on just the benchmark performance or top line stats like context size. Instead, look at the specific methods that these labs are using to achieve those results. There are tons of high-performing open source models that we didn't discuss in this video,

Starting point is 00:12:06 like Kimmy K2 or Google Gemma 3. But when you peek under the hood of many of these, you'll find nuanced differences that I find really interesting. I hope this gives you a framework for how to understand the latest open source releases. and gives you a toolkit to start tinkering with them yourself. Thanks for watching. See you in the next episode.

Y Combinator Startup Podcast - GPT-OSS vs. Qwen vs. Deepseek: Comparing Open Source LLM Architectures

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.