The Good Tech Companies - Fine-Tuning LLMs: A Comprehensive Tutorial

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Fine-tuning LLMs, a comprehensive tutorial by OxyLabs. It costs millions of dollars and months of computing time to train a large language model from the ground up. You most likely never need to do it. Fine tuning lets you adapt pre-trained language models to your needs in hours or days, not months, with a fraction of the resources. This tutorial takes you from theory to practice. you'll learn the four core fine-tuning techniques, code a complete training pipeline in Python,

Starting point is 00:00:33 and learn the techniques that separate production-ready models from expensive experiments. What is LLM fine-tuning? Fine-tuning trains an existing language model on your data to enhance its performance on specific tasks. Pre-trained models are powerful generalists, but exposing them to focused examples can transform them into specialists for your use case. Instead of building a model from scratch, which requires massive compute and data, you're giving an already capable model a crash course in what matters to you, whether that's medical diagnosis, customer support automation, sentiment analysis, or any other particular

Starting point is 00:01:08 task. How does LLM fine-tuning work? Fine-tuning continues the training process on pre-trained language models using your specific dataset. The model processes your provided examples, compares its own outputs to the expected results, and updates internal weights to adapt and minimize loss. This approach can vary based on your goals, available data, and computational resources. Some projects require full fine-tuning, where you update all model parameters, while others work better with parameter-efficient methods like low R-A-that-modify only a small subset. LLM fine-tuning methods. Supervised fine-tuning SFT teaches the model to learn the patterns of the correct question-answer pairs

Starting point is 00:01:49 and adjusts model weights to match those answers exactly. You need a dataset OF pairs. Use this when you want consistent outputs, like making the model always respond in JSON format, following your customer service script or writing emails in your company's tone. Unsupervised fine-tuning feeds the model tons of raw text, no questions or labeled data needed, so it learns the vocabulary and patterns of a particular domain. While this IS technically a pre-training process known as continued pre-training, CPT, thesis usually done after the initial pre-training phase. Use this first when your model needs to

Starting point is 00:02:25 understand specialized content it wasn't originally trained on, like medical terminology, legal contracts, or a new language. Direct preference O-P-T-I-M-I-Z-A-T-I-O-N-D-P-O teaches the model to prefer better responses by showing examples of good versus bad answers to the same question and adjusting it to favor the good ones. Needs triplets. Use DPO after basic training to fix annoying behaviors like stopping the model from making things up, being too wordy, or giving unsafe answers. Reinforcement fine-tuning in RLHF, you first train a reward model on prompts with multiple responses ranked by humans, teaching it to predict which responses people prefer. Then, you use reinforcement learning to optimize and fine-tune a model that generates responses, which the reward model judges.

Starting point is 00:03:13 This helps the model learn over time to produce higher scoring outputs. This process requires datasets in this format. It's best for tasks where judging quality is easier than creating perfect examples, like medical diagnoses, legal research, and other complex domain-specific reasoning. Step-by-step fine-tuning LLM's tutorial, we'll walk you through every step of fine-tuning a small pre-trained model to solve word-based math problems, something it struggles with out of the box. We'll use the Quen 2-5-base model with 0, 5B parameters that already has natural language processing capabilities. The approach works for virtually any use case of fine-tuning LLMs, teaching a model specialized terminology, improving the model's performance on specific tasks, or adapting it to your domain.

Starting point is 00:04:00 Prerequisites install a few Python packages that we'll use throughout this tutorial. In a new project folder, create and activate a Python virtual environment, and then install these libraries using or your preferred package manager. 1. Get in Load THE dataset the fine-tuning process starts with choosing the dataset, which is arguably the most important decision. The dataset should directly reflect the task you want your model to perform. Simple tasks like sentiment analysis need basic input-output pairs. Complex tasks like instruction following or question answering require richer datasets with context, examples, and varied formats.

Starting point is 00:04:37 Fine-tuning data quality and size directly impact training time and your model's performance. The easiest starting point is the Hugging Face dataset library, which hosts thousands of open source data sets for different domains and tasks. Need something specific and high quality? Purchase specialized datasets or build your own by scraping publicly available data. For example, if you want to build a sentiment analysis model for Amazon product reviews, you may want to collect data from real reviews using a web scraping tool. Here's a simple example that uses Oxilab's WebScraper API. 2. Tokenize the data F-O-R processing models don't understand text directly, They work with numbers.

Starting point is 00:05:18 Tokenization converts your text into tokens, numerical representations, that the model can process. Every model has its own tokenizer trained alongside it, so use the one that matches your base model, how we tokenize our data shapes what the model learns. For math problems, we want to fine-tune the model to learn how to answer questions, not generate them. Here's the trick. Tokenize questions and answer separately, then use a masking technique. Setting question tokens to tell the training process to ignore them when calculating loss. The model only learns from the answers, making training more focused and efficient. Apply this tokenization function to both training and

Starting point is 00:05:57 testing datasets. We filter out examples longer than 512 tokens to keep memory usage manageable and ensure the model processes complete information without truncation. Shuffling the training data helps the model learn more effectively. Optional. Want to test the entire pipeline quickly before committing to a full training run. You can train the model on a subset dataset. So, instead of using the full 8, 5K dataset, you can minimize it to 3K in total, making the process much faster. Keep in mind, smaller datasets increase overfitting risk, where the model memorizes training data rather than learning general patterns. For production, aim for at least 5K plus training samples and carefully tune your hyperparameters. 3. Initialize the base model next. Load the pre-trained base model to

Starting point is 00:06:45 fine-tune it by improving its math problem-solving abilities. Four. Fine-tune using the trainer method this is where the magic happens. Training arguments controls how your medellerns, think of it as the recipe determining your final results quality. These settings and hyperparameters can make or break your fine-tuning, SO experiment with different values to find what works for your use case. Key parameters explained. Filled circle epics. More epics equal more learning opportunities, but too many cause overfitting. Filled Circle batch size affects memory usage and training speed. Adjust these based in your hardware. Filled Circle learning rate controls how quickly the model adjusts. Too high and it might miss the optimal solution, too low and training takes forever. Field Circle weight decay can

Starting point is 00:07:33 help to prevent overfitting by deterring the model from leaning too much on any single pattern. If weight decay is too large, it can lead to underfitting by preventing the model from learning the necessary patterns. The optimal configuration below is specialized for CPU training. Removas underscore CPU equals true if you have a GPU. 5. Evaluate THE model after fine tuning. Measure how well your model performs using two common metrics. Filled circle loss. Measures how far off the model's predictions are from the target outputs, where lower values indicate better performance. Filled circle perplexity, the exponential of loss, shows the same information on a more intuitive scale, where lower values mean the model is more confident in its predictions.

Starting point is 00:08:19 For production environments, consider adding metrics like blue or rouge to measure how closely generated responses match reference answers. You can also include other metrics like F1, which measures how good your model is at catching what matters while staying accurate. This Hugging Face lecture is a good starting point to learn the essentials of use using the Transformers library. Complete fine-tuning code example after these five steps, you should have the following code combined into a single Python file. Before executing, take a moment to adjust your trainer configuration and hyperparameters based on what your machine can actually handle.

Starting point is 00:08:54 To give you a real-world reference, here's what worked smoothly for us on a MacBook Air with the M4 chip and 16 gigabytes RAM. With this setup, it took around 6, 5 hours to complete fine tuning, filled circle batch size for training, seven filled circle batch size for aval, seven filled circle gradient accumulation. Five as your model trains, keep an eye on the evaluation loss. If it increases while training loss drops, the model is overfitting. In that case, adjust epics, lower the learning rate, modify weight decay, and other hyperparameters. In the example below, we see healthy results with aval loss decreasing from zero, 496 to 0, 469 and a final perplexity of 1.60.6.

Starting point is 00:09:41 Test the fine-tuned model now for the moment of truth. Was our fine-tuning actually successful? You can manually test the fine-tuned model by prompting it with this Python code. In this side-by-side comparison, you can see how the before-and-after models respond to the same question. The correct answer is 10. With sampling enabled, both models occasionally get it right or wrong due torrandomness. But setting function reveals their true confidence. The model always picks its highest probability answer. The base model confidently outputs, wrong, while the fine-tuned model confidently outputs, correct. That's the fine-tuning at work. Fine-tuning best practices. Model selection filled circle choose the right base model. Domain-specific models and appropriate

Starting point is 00:10:26 context windows save you from fighting against the model's existing knowledge. Filled circle understand the model architecture. Encoder-only models like BERT, Excel at classification tasks, decoder-only models like GPT at text generation and encoder decoder models like T5 at transformation tasks like translation or summarization. Filled circle match your models input format. If your base model was trained with specific prompt templates, use the same format in fine-tuning. Mismatchett formats confuse the model and tank performance. Data preparation fill Filled circle prioritized data quality over quantity. Clean and accurate examples beat massive and noisy datasets every time. Filled circle split training and evaluation samples. Never let your model

Starting point is 00:11:13 see evaluation data during training. This lets you catch overfitting before it ruins your model. Filled circle establish a golden set for evaluation. Automated metrics like perplexity don't tell you if the model actually follows instructions or just predicts words statistically. training strategy filled circle start with a lower learning rate you're making minor adjustments not teaching it from scratch so aggressive rates may erase what it learned during pre-training filled circle use parameter efficient fine tuning laura pftt train only 1% of parameters to get 90% plus performance while using way less memory and time filled circle target all linear layers in laura targeting all layers etc yields models that reason significant

Starting point is 00:11:58 better, not just mimic style. Filled Circle use neftune, noisy embedding fine tuning, random noise in embedding sacks as regularization, which can prevent memorization and boost conversational quality by 35 plus percentage points. Filled Circle after SFT run DPO. Don't just stop after SFT. SFT teaches how to talk. DPO teaches what is good by learning from preference pairs. What are the limitations of LLM fine tuning? Filled Circle Catastrichter. for getting. Fine-tuning overrides existing neural patterns, which can erase valuable general knowledge the model learned during pre-training. Multitask learning, where you train on your specialized task alongside general examples, can help preserve broader capabilities. Filled circle overfitting

Starting point is 00:12:45 on small datasets. The model may memorize your training examples instead of learning patterns, causing it to fail on slightly different inputs. Filled circle high computational cost, Fine-tuning billions of parameters requires expensive GPUs, significant memory, and hours to days or weeks of training time. Filled circle bias amplification. Pre-trained models already carry biases from their training data, and fine-tuning can intensify these biases if your dataset isn't carefully curated. Filled Circle Manual Knowledge Update. New and external knowledge may require retraining the entire model or implementing retrieval augmented generation, rag. while repeated fine tuning often degrades performance.

Starting point is 00:13:28 Conclusion, fine tuning works, but only if your data is clean and your hyperparameters are re-diled in. Combine it with prompt engineering for the best results, where fine tuning handles the task specialization while prompt engineering guides the model's behavior at inference time. Continue by grabbing a model from hugging face that fits your use case for domain-specific fine-tuning, scrape or build a quality dataset for your task, and run your first fine-tuning session on a small subset. Once you see promising results, scale up and experiment with Laura, DPO, or Neftune to squeeze out better performance. The gap between reading this tutorial and having a

Starting point is 00:14:05 working specialized model is smaller than you think. Thank you for listening to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - Fine-Tuning LLMs: A Comprehensive Tutorial

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.