LLMs — Fine-tuning and Model Evaluation

Ritik Jain
7 min readAug 21, 2023

--

Photo by Andrea De Santis on Unsplash

Original Post: https://data-amplifier.beehiiv.com/p/llms-fine-tuning-and-model-evaluation

Introduction

In our previous articles, I explored the basics of LLMs in “LLMs — A Brief Introduction”, delved into their different architectural types in “LLMs — Model Architectures and Pre-training Objectives”, and examined the intricacies of prompt engineering and model configuration in “LLMs — Mastering LLM Responses through Advanced Prompt Engineering Strategies”. In this article, I will dive into the essential topic of fine-tuning and parameter optimization techniques for tailoring LLMs to specific tasks. I’ll cover single-task and multi-task optimization, scaling instruction models, and benchmarking. By the end, you’ll have a comprehensive understanding of how to harness the power of fine-tuning to enhance LLM response for specific tasks.

A supervised learning procedure entails refining a pre-trained LLM through the utilization of instructional prompts. In high-level speaking, a limited dataset comprising text and corresponding labels is supplied to train the LLM effectively for the specific task or set of tasks.

Prompt Dataset

The creation of a prompt dataset aims to enhance the capabilities and flexibility of a pre-trained language model. This objective is realized by instructing the model for particular tasks using sets of instructional prompts, encompassing pairs of prompts and their completions within the dataset.

Dataset creation for different tasks

Training Process

The training process of a Large Language Model (LLM) closely resembles the conventional training approach and encompasses several key stages:

1. Dataset Preparation and Splitting: The initial step involves preparing datasets and dividing them into distinct subsets: the training set, used to train the model; the validation set, used to fine-tune hyperparameters and prevent overfitting; and the test set, employed to evaluate the model’s generalization.

2. Specification of Optimization, Loss Functions, and Evaluation Metrics: Next, optimization techniques, loss functions, and evaluation metrics are defined. Optimization algorithms, like stochastic gradient descent (SGD) or Adam, are chosen to minimize the loss function. The loss function quantifies the discrepancy between the predicted outputs and the actual targets. In addition, evaluation metrics are chosen to gauge the model’s performance, such as accuracy, precision, recall, F1-score, or specialized metrics for language tasks like BLEU or ROUGE.

3. Training Iterations to Minimize Loss: The core of the training process revolves around iteratively minimizing the loss value. During each iteration or epoch, the model receives batches of training data and calculates predictions. These predictions are then compared to the actual targets using the chosen loss function. The optimization algorithm adjusts the model’s parameters in a way that reduces the loss value, enhancing the model’s predictive accuracy.

Classical Training Process

Fine-tuning

Fine-tuning is the process of training a pre-trained LLM on a narrower dataset to specialize it for a specific task. This process involves adjusting the model’s parameters to optimize its performance for the target domain. Fine-tuning can dramatically enhance the model’s capabilities and adaptability.

LLM fine-tuning can be categorized into two main types:

Type 1: Single-Task Optimization

When focusing on a single task, fine-tuning involves selecting a relevant dataset, designing task-specific prompts, and adjusting hyperparameters. The goal is to strike a balance between maintaining the model’s pre-trained knowledge and honing its skills for the particular task. For e.g. fine-tuning the LLM to summarize the text by understanding the context.

Challenge:

Often, fine-tuning single tasks leads to catastrophic forgetting. Catastrophic forgetting occurs when the model forgets previously learned information as it learns new information. Fine-tuning can significantly improve the performance of a model on specific tasks but leads to a reduction of ability on other tasks.

This process is especially problematic in sequential learning scenarios where the model is trained on multiple tasks over time.

Catastrophic forgetting is a common problem in machine learning, especially in deep learning models.

Example: A sentiment judgment task. We fine-tune the model to give sentiment results instead of sentences, and it works.

Before and After Fine-tuning

To prevent catastrophic forgetting, several strategies can be employed:

  1. Fine-tune on Multi-tasks at the same time
  2. Parameter Efficient Fine-tuning (PEFT)

Type 2: Multi-Task Optimization

In these scenarios where multiple related tasks are at hand, multi-task optimization comes into play. By training the LLM on a diverse set of tasks, it can learn shared representations that lead to improved performance on individual tasks.

This approach requires careful dataset curation and hyperparameter tuning to ensure optimal outcomes.

Model Evaluation

Model evaluation is a critical process that involves assessing the quality, accuracy, and capabilities of these models across various language-related tasks.

Recall: The fraction of relevant information accurately captured by the LLM’s generated output.

Precision: The accuracy of the LLM’s generated output in providing relevant information.

F1-Score: A combined measure of recall and precision, offering a balanced assessment of the LLM’s performance in language tasks.

ROUGE

ROUGE, which stands for “Recall-Oriented Understudy for Gisting Evaluation,” is a suite of metrics used to assess the quality of automatically generated text, particularly in tasks such as text summarization. Higher ROUGE scores indicate better content similarity and overlap between the generated and reference texts. ROUGE comprises several evaluation metrics, each focusing on different aspects:

ROUGE-N: This metric calculates the overlap of n-grams (contiguous sequences of n words) between the generated text and the reference text. ROUGE-1, ROUGE-2, ROUGE-3, and so on, correspond to unigrams, bigrams, trigrams, and higher-order n-grams, respectively.

Rouge-2: Bigram Metric

Here, bigram matches refer to the “number of overlapping words” and “bigram in the reference” refers to “total words in reference text”.

ROUGE-L: ROUGE-L measures the longest common subsequence (LCS) between the generated and reference texts.

ROUGE-L

ROUGE-W: ROUGE-W computes a weighted average of ROUGE-N scores for different n-gram lengths.

ROUGE-S: ROUGE-S focuses on skip-bigrams, which are pairs of words that are separated by a fixed number of words in the text. This metric is designed to capture the coherence and continuity of content, as it evaluates pairs of words that might not be adjacent but still maintain meaningful relationships.

BLEU Score

The BLEU (Bilingual Evaluation Understudy) score is a metric used to assess the quality of machine-generated translations in natural language processing tasks, particularly in machine translation. It aims to measure the similarity between a machine-generated translation and one or more human reference translations.

BLEU is sensitive to n-gram matches and may not fully capture the fluency, coherence, and semantic nuances of translations. It does not consider word order or synonyms, leading to cases where a translation with synonymous words may receive a lower score. It might not align perfectly with human judgments, as it’s a content-based metric.

In general:

  • ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.
  • BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.

Conclusion

As I wrap up this article, you now have a comprehensive understanding of fine-tuning and parameter optimization techniques for specific tasks using Large Language Models. From single-task and multi-task optimization to scaling instruction models, model evaluation, and benchmarking, each step plays a pivotal role in creating highly effective and specialized models. With this knowledge in hand, you are equipped to harness the power of LLMs to their fullest potential, bringing your projects to new heights of success.

Remember, mastery in this field comes with practice and continuous exploration. Stay curious, experiment, and stay tuned for our upcoming articles that delve even deeper into the fascinating world of Large Language Models.

References

  1. LLMs — A Brief Introduction
  2. LLMs — Model Architectures and Pre-training Objectives
  3. LLMs — Mastering LLM Responses through Advanced Prompt Engineering Strategies
  4. Scaling Instruction-Finetuned Language Models
  5. Introducing FLAN: More Generalizable Language Models with Instruction Fine-Tuning
  6. ROUGE: A Package for Automatic Evaluation of Summaries

--

--

Ritik Jain

Fallen for data and understand the problems which can be resolve. Passionate for ML and MLOps.