Fine-Tuning and Evaluating Large Language Models (LLMs)

3 min readJul 9, 2023

Introduction

This article is about making Large Language Models (LLMs) work better. We talk about different ways to train these models, like using a few examples to teach the model or adjusting the model for a specific job. We also introduce Dolly, a model that’s good at following instructions. It’s important to check how well the model is doing, so we talk about how to do that too. We touch on the idea of “alignment” to make sure the model is creating safe and relevant content. We also talk about choosing the right model for your needs, considering things like speed, accuracy, and size.

Fine-Tuning LLMs

A powerful approach to improve the applicability of LLMs to specific tasks is through fine-tuning, a process where model parameters are adjusted to suit specific applications. Two methods dominate this sphere: few-shot learning and instruction-following learning.

Few-shot learning exploits a model’s ability to generalize from a handful of examples provided in a prompt. As a quick development approach, it requires a larger model for superior performance due to the limited examples available, leading to low costs since no training is involved. However, it necessitates many high-quality examples covering the full spectrum of the task at hand.

The instruction-following approach does not rely on pre-made examples. In a scenario devoid of examples, a zero-shot learning approach comes into play, where the task is outlined and the model provides a summarization. Notably, the quality of results heavily depends on how well the instruction-following model was trained.

Dolly Usecase

The Dolly model by Databricks serves as an excellent example of a fine-tuned instruction-following LLM. Dolly, a 12 billion parameter model, is based on EleutherAI’s Pythia and was trained on “The Pile” dataset. Dolly’s fine-tuning dataset, Databricks-Dolly-15K, comprises high-quality pairs of instructions and responses for intellectual tasks, which enabled the model to perform specific tasks it was trained on effectively.

Dolly exemplifies how a commercially viable product can be created by combining an open-source model with a high-quality, open-source dataset. This concept, initially introduced by the Stanford Alpaca project, has inspired the shift from pursuing larger language models to developing fine-tuned, bespoke models for different tasks.

LLMs as a Service

Another approach is utilizing a proprietary Large Language Model (LLM) as a Service, where no pre-made examples are assumed to be available. This approach offers the advantage of quick application build-up and higher performance as computations are handled server-side. However, costs associated with each token sent and received, data privacy concerns, and vendor lock-in risks can be potential drawbacks.

Evaluating LLMs

Evaluation is a critical facet of the fine-tuning process. LLMs pose a unique challenge because their performance isn’t just a matter of accuracy; it’s about the value of the generated text. Traditional metrics such as loss or validation scores aren’t particularly insightful in these cases. More insightful metrics like perplexity and accuracy also fall short of providing a full performance picture, given that high confidence and correctness in predicting the next word don’t guarantee contextually appropriate or high-quality results.

Hence, task-specific evaluation metrics have emerged, such as the Bilingual Evaluation Understudy (BLEU) for translation tasks and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for summarization tasks. Both BLEU and ROUGE scores use comparison with reference translations or summaries to determine the performance of the model on the task at hand.

Recently, alignment has become a crucial aspect of evaluation. The concept of alignment involves guiding the model to produce appropriate, non-offensive content, which effectively acts as a form of content moderation.

Conclusion

In conclusion, the landscape of Large Language Models is rapidly evolving, with fine-tuning and evaluation methods improving and diversifying at a rapid pace. As open-source models like Dolly are developed and fine-tuned, we move closer to the goal of creating more targeted and task-efficient language models.

With this increased understanding, we can anticipate and shape the future of LLMs, ensuring their development aligns with our technological needs and ethical guidelines. These advanced models hold immense potential in numerous applications, marking an exciting trajectory for the future of this field.

#dolly #largelanguagemodel #gpt #generativeai #nlp #finetuning