Share Dialog

It’s a technique to train already existing LLMs (these are all a part of Transformer family) which have been pre-trained through various means on specific “downstream” applications to refine the model’s capabilities for any particular application. Examples may include Customer Support Behavior, Coding Assistants, Medical assistance, so on and so forth.
By “re-training” a model on curated datasets (supervised or semi-supervised), we allow the model to perform much better than any generic model for that particular task(s) by directly changing the model’s weights. The datasets do not have to be as large as they are for pre-training, which involves using trillions of tokens (eg, CommonCrawl dataset combined with other datasets, a majority of which may be unstructured) to equip a model with the appropriate language understanding that is required to make the model perform well for a variety of NLP and multimodal applications, depending on the model’s architecture.
Naturally, the costs for fine-tuning are quite large as it requires large amounts of memory in GPUs for storing the model’s weights, gradients and activations during backpropagation and you will have to get your hands on the infrastructure necessary for making the training take place. If one is limited by their infrastructure, they can use the following techniques that reduce the hardware requirements by significant factors to fit their training process on consumer hardware (PS: we will not be covering these):
Parameter Efficient Fine Tuning
LoRA
QLoRA
Quantization
Gradient Accumulation (Covers the deficit from not having a large batch size)
Flash Attention (deals with memory bottleneck of attention mechanism)
Mixed-Precision training
You might question why go to such extents when you have methods like few-shot prompting, Retrieval Augmented Generation (RAG)? - which are quick and easy fixes (well, not quite so easy for RAG) for the kind of instructions that you may want to receive. Let’s try to fix this misunderstanding.
Few-shot prompting works by providing a model with a handful of examples directly in the input. This allows it to generalize from those examples without modifying its weights. It’s useful for quick experimentation but falls apart when consistency matters.
Perks
No need to train a model, just attach a few examples in the prompt.
Works for simple applications.
Simplistic and cost effective.
Non-Perks
Inconsistency. Model generates different responses for the same input. Not sustainable for models with short context windows as the examples remain in the context- increasing both usage costs and limiting total turns in conversations with a large number of tokens.
Vulnerable to prompt injection attacks.
The model’s abilities have not “increased” at all. It is still constrained to the examples provided in the prompt window and has not acquired much in the way of domain-specific knowledge.
With this, the LLM’s capabilities are expanded by providing a database (of vectors) on which similarity search queries can be performed. This helps in pulling a list of documents relevant to the input query, and can also be used to provide access to a database which is updated frequently with relevant information. Perks
Domain specific knowledge can be stored in a vector store, so there is no need for fine-tuning
Knowledge base can be updated regularly, which is not possible with fine-tuning, as updating the model will require re-training the model again. Non-perks
Slower inference speed due to information retrieval.
Model’s performance is tied to the retrieval method’s performance.
Completions will be less fluid and coherent
Fine-tuning offers consistency in outputs, better inference latency, customization with specific guidelines, ethics and alignment, on top of boasting the best performance for various domains in benchmarks. A viable approach which combines the best of both worlds is function calling, which involves using “tools” to interact with the outside environment and any model can be fine-tuned to use these tools to rapidly boost the model’s performance across different domains.
We will discuss this in three stages, and these include

We will begin with dataset preparation, then move on to fine-tuning the model using supervised learning, before finally covering the Reinforcement Learning approach.
A strong model’s foundation is dependent upon the quality of the data that it has been trained on. Depending on our application, the data that has to be used in fine-tuning can be anything ranging from summaries of texts in religious scriptures to pictures of kittens with their breed as the labels. The size of the dataset(s) used, while smaller than the overall data utilized in pre-training a model from scratch- can still be substantially large depending on our use case, ranging from hundreds to thousands of examples. Costs and time consumption can rack up quite fast if you are not prepared.
The data can be in the form of code, images, text files, videos, documents- though files containing textual content will have to be parsed into strings regardless of the file formats. All of this has to be formatted into a prompt in the fine-tuning stage, though we are not dealing with that right now. What that means is that the data that is supervised (labelled) has to be placed together with prompt-completion pairs. For example,
User: Teach me how to bake a chocolate cake
Assistant: Sure! To begin, you will need 5 cups of flour, 3 cups sugar, 3 cups of cream, 10 ounces of dark chocolate and…
Beyond this, we also have to format the prompt-response pairs with the system prompt into common templates.
Before going into deeper details about templating, we have to verify whether the data that we require is even accessible through public repositories such as Hugging Face, Kaggle, Common Crawl, Gutenberg, etc. If not, then it’s time to do the job that everyone surely hates loves- Data Scraping! Use tools and libraries such as Puppeteer, Selenium, Bs4, Scrapy to get the scrape data from accessible sites containing the requisite data, avoid becoming a criminal in the process and get it done- this will be challenging but you get more control over the quality of the data that you would like to train the model upon, which might be a benefit in the long term.
This can be referred to as the process of converting raw or unstructured data into a serialized format for fine tuning a model for any task. The nature of most transformer models is auto-regressive; hence, the model expects a flow of conversation that is sequential in nature. Providing raw data without any formatting into a template whatsoever will lead the model to degrade in performance and produce nonsensical outputs at times.
A chat template helps to avoid this by specifying a conversion schema for input data to a single tokenized string in any particular format suitable for the LLM (ideally, it is the template that the model has been pre-trained with).
For example, here is what an example conversation in the chat template for Mistral-7B looks like
chat = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]This conversation becomes the following string just before tokenization-
[INST] Hello, how are you? [/INST]
I'm doing great. How can I help you today?
[INST] I'd like to show off how chat templating works! [/INST]Tags like [INST] above are used to provide context to each input statement and using them leads to clarity in action space for the model, leading to completions that better align with the user’s needs during multi-turn conversations.
If you still find yourself struggling with finding the right template, ChatML is a good place to start for conversational models. A list of various datasets can be found at hugging face’s datasets directory.
This step is usually performed before the templating tasks to filter out bad samples, by removing duplicate records, cleaning out data that fails to pass certain texts (tests for toxicity in language, or syntax checks for code), augment the dataset by generating different variants from same records (eg., sorting can also be referred to as ordering, rearranging in any order).
This is a process that you have to be concerned about mostly while creating your own dataset. The publicly available datasets are nicely curated and usually come pre-processed already.
Once all of this has been done, we can pad sequences and convert the query-response pairs into tokens with the help of a tokenizer. Here is where our actual work begins, we have two options to work with, supervised finetuning which is a standard method which delivers solid results, or using policy gradients to achieve a Reinforcement Learning-based learning- Group Relative Proximity Optimization will be our helping hand here.
A good starting point would be to pick a pre-trained model among one of these families of open-source models:
Llama
Mistral
DeepSeek
Qwen
Phi
Gemma
And so on!
The listed models all have their weight parameters openly accessible (besides performing well on existing benchmarks as of today), which is the primary requirement as we will be modifying these very same weights in further steps.
More of these models can be looked up on Evaluation Benchmark leaderboards. Your model’s post-fine-tuning performance is heavily reliant on both the data and the model you choose, so pick the benchmark that suits your application the best- or create your own, and see which models perform the best.
While we have covered some details about this process, we still have not discussed how exactly this method works. To begin with, what exactly happens in the retraining process anyways?
This section might appear odd to you at first, ignore all the mathematical jargon. Just remember whatever linear algebra and calculus you studied in high school and college and maybe we can make it to the end.
To train a model, you need its completions for an input prompt- that is retrieved by what we call a forward pass, which is nothing but an entire pass through the input layer of the transformer model up until the output layer, with each layer computing the probabilities that get fed into the succeeding layer. Mathematically, this is represented as
For the initial pass, the input layer hi‘s output is equal to the product of tokenized input and embedding matrix We
Final layer’s vector representation of output tokens is given by,
As we have seen, the training process will involve using the current model’s predictions and adjusting its weights accordingly with an optimizer algorithm, which is Adam in this case. The model’s training involves going through the training dataset over and over again (resulting in multiple “epochs”), though it does not happen one by one, we pass along a batch of random samples each time and the gradient update happens once for that particular batch(a single gradient update happens in a single “step”). There are a lot of training parameters, or hyperparameters as we call them, that we can modify to adjust our training process. Some of them are
Batch size: The batch size that will be used in each gradient update. Usually multiples of 2, like 8, 16, 64 etc.
Gradient Accumulation: Helps in simulating larger batch sizes by accumulating gradients over multiple batches (smaller), then updating the weights all at once.
Learning Rate: A very crucial parameter which adjusts the rate at which the model learns, by adjusting the step size during updates.
Total Training Epochs / Steps: Controls the overall number of epochs/steps that will take place during training.
Gradient Checkpointing: Discards activations during forward pass and recomputes them during backward pass to calculate gradients.
Mixed-precision Training: You can declare types like bf16 and fp16 to store the model parameters to speed up training.
Evaluation Strategy: Determines whether performance evaluation (with loss function) is performed during each step or an epoch.
This is performed with the help of Python/C++ libraries like PyTorch, JAX, HuggingFace Transformers, TRL and so on. Just pick whichever suits you the best and pave your way.
You can start the training with any set of hyperparameter values and start the finetuning. While the training is taking place, platforms like Weights and Biases, TensorBoard can be used to monitor the training’s progress in the form of graphs which helps visualize drops in accuracy and losses for both training and validation sets. This lets us stop training midway if it is not going well and saves costs and time. Naturally, the training will continue for hours and will see a further rise in time consumed as we see an increase in the number of parameters for the base model used. This will require close monitoring to observe any changes midway and the said changes will have to be carefully logged.
Well, this was it for Supervised Fine Tuning. So far, we have been discussing training in the context of general knowledge-based datasets with no actual domain in mind. On our second part, we'll discuss more about Reinforcement Learning into our Solidity model and how we approach evaluations.
Fine-tuning - The process of training a pre-trained language model on specific downstream tasks using additional labeled datasets to refine its capabilities.
Pre-training - The initial phase of training a language model using a large corpus of text data, allowing it to develop a broad understanding of language.
Supervised Fine-tuning - A fine-tuning process where the model is trained on labeled data with explicit input-output pairs.
Hyperparameters - Configurable variables (e.g., learning rate, batch size, number of epochs) that influence the training process.
Batch Size - The number of training examples processed together before updating the model’s weights.
Gradient Accumulation - A technique that allows smaller batch sizes to accumulate gradients over multiple steps before performing a weight update.
Mixed-precision Training - A method that uses lower-precision (e.g., FP16, BF16) computations to reduce memory usage and speed up training.
Retrieval-Augmented Generation (RAG) - A technique that enhances an LLM’s output by retrieving relevant documents from a knowledge base.
Chat Templating- The process of formatting raw data into a structured format that aligns with an LLM’s expected input structure.
Tokenization - The process of converting raw text into tokens that can be processed by a neural network.
Forward Pass - A step in training where input data passes through the model to generate predictions.
Backpropagation
The Layer function contains details we do not need to worry ourselves with as of now.
The softmax activation function is responsible for conversion of the raw token representation of the final layer to the probability distribution over tokens, don’t think too much about it- it’s just a normalized exponential function over the product of final layer’s representation ho and the output weight matrix Wo.
That output (a probability distribution over all the tokens in the LLM’s dictionary) is then used to compare how the model’s output stacks up against the ground truth by computing a loss function over the two labels. A loss function is a measure of your model’s performance against the actual dataset, so, the lower the loss, the closer your model is to achieving the ideal performance you seek. A common choice for loss function is cross-entropy loss, which is defined as
where yi is the true token and y(hat) is the predicted probability for that token.
Once we have computed the loss, we can compute the gradients (a gradient is a vector of partial derivatives, which if you forgot, determine the rate of change of multivariable functions) of the loss function w.r.t the model weights. This is called a backward pass or backpropagation. For a weight matrix W[e]
The gradient tells us how much each weight contributed to the loss value.
Following this, an optimizer algorithm is used to update the model’s weight in a direction which minimizes loss. Adam is a prominently used optimizer, utilizing momentum and RMSProp to control the size of gradient updates. How Adam works is not something we will discuss currently. It’s sufficient to know that it updates the weights steadily around stable loss curves, and speeds up the updates for steep slopes.
where:
η is the learning rate
∇L is the gradient of loss with respect to weights.
This is just a set of calculations that happens for each sample in a dataset, now imagine this happening over batches of multiple samples, hundreds and thousands of times. Why are we doing all of this? This is all a part of a critical process that updates the model’s weights to decrease the loss function as much as possible. As an end product, after many iterations (or, epochs) of this same process, we get a model that performs nicely on applications your training dataset is composed of. And that’s roughly all you should know to start fine-tuning LLMs. Let’s begin by choosing the right model for our use case.
Cross-Entropy Loss - A common loss function used to measure how different the predicted probability distribution is from the actual distribution.
Kolwaii
No comments yet