Fine-Tuning Large Language Model with Hugging Face & PyTorch

Using GPT-2 to generate Cooking recipes

15 min readMar 9, 2024

Source: Generated by Bing. Prompted by Author

Introduction

Artificial Intelligence (AI) has witnessed groundbreaking advancements in recent years, and one of the standout technologies is ChatGPT. The curiosity surrounding its intelligence has made me want to explore it. In this article, we delve into text generation using the GPT-2 model. GPT-2 has been trained on a diverse dataset of 8 million web pages. I will then continue training GPT-2 model on a dataset of cooking recipes, enabling it to generate recipes based on input ingredients. I use pretrained GPT-2 mode from Hugging Face and PyTorch to develop the code for this article.

Pre-Trained Models
Fine-Tune a Large Language Model
2.1 What is fine-tuning and Why?
2.2 Setup
2.3 Quick Test with GPT2
2.4 Prepare a Dataset
2.5 GPT2 Tokenizer
2.6 Tokenize Datasets & Dataloaders
2.7 Finetune GPT2 Language Model
2.8 Saving & Loading Fine-Tuned Model
2.9 Generate Text
2.10 Performance Evaluation
What’s Next?

1. Pre-Trained Model

A pre-trained model is essentially a saved network that has undergone prior training on a substantial dataset, e.g. on a large-scale text generation task. The utility of a pretrained model lies in its adaptability — you can use it as is or fine-tune the model for a specific task.

Hugging Face Transformers provides thousands of pretrained models to perform tasks on text, vision, and audio. The underlying principle is simple yet powerful — we can save a trained model’s parameters for future use, then we can share it with anyone, whether for inference or as a basis for subsequent fine-tuning.

GPT-2 pretrained models are available in four different sizes. In this article, I use gpt2-medium to generate text and fine-tune it with a new dataset.
- gpt2: 110M parameters
- gpt2-medium: 345M parameters
- gpt2-large: 774M parameters
- gpt2-xl: 1558M parameters

2. Fine-Tuning a Large Language Model

2.1 What is Fine-Tuning and Why?

Pre-trained large language models (LLMs) are trained on a general, diverse corpus and they offer many capabilities but aren’t universal. You don’t want the model to respond in the most general way based on the general uses of particular words on the Internet. Fine-tuning starts from an existing pre-trained model and continues training on a specialized corpus to shift the parameters to achieve better loss on a specific task. In this article, I will fine-tune GPT2 model, which already understands English language, on a cooking recipes corpus, enabling the model to generate recipes based on input ingredients. This application of fine-tuning enhances the model’s capability in domain-specific tasks without starting the training process from scratch, which can be extremely time-consuming and computationally expensive.

There are 3 ways to fine-tune a pre-trained model.

Unsupervised fine-tuning is a technique where you train the LLM on a dataset of data that does not contain any labels. The task could be text generation like legal document or cooking recipe generation as below.
Supervised fine-tuning is a technique where you train the LLM on a dataset of data that contains labels. This task could be sentiment analysis, text summarization, translation, etc.
Reinforcement Learning from Human Feedback (RLHF) is a technique where you use human feedback to fine-tune the LLM. The basic idea is that you give the LLM a prompt and it generates an output. Then, you ask a human to rate the output.

There are 3 options for parameter training:

Full Fine-tuning: Updates all parameters of the pre-trained model on the new dataset, allowing models to learn task-specific details while retaining their general knowledge. Partial Fine-Tuning updates final few layers or specific parts like the attention heads.
Transfer Learning with Task-Specific Layers: Adds new task-specific layers (e.g., classification heads) on top of the LLM. Only these new layers are trained.
Parameter Efficient Fine-tuning (PEFT): Add small, trainable components or adapt only a fraction of the model’s parameters.

In this article, I use Full Fine-Tuning to update all parameters of the LLM.

2.2 Setup

First, let’s install the transformers package from Hugging Face which will give us a PyTorch interface for working with GPT-2 pre-trained model. We also need some libraries for data processing and model optimization using PyTorch. I have opted for the gpt2-medium variant with 345 million parameters so I can run on my computer equipped with an RTX2080. I recommend to execute the code on a computer or Colab notebook with a GPU supporting CUDA for optimal performance.

The resultant model will be saved in the “model” folder and will be used for generating recipes in last steps.

from transformers import GPT2LMHeadModel, GPT2TokenizerFast, GPT2Config
from transformers import get_linear_schedule_with_warmup

import torch
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader
from torch.utils.data import random_split, RandomSampler, SequentialSampler

import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"
# model_name: ['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl']
model_name = "gpt2-medium" 
model_save_path = './model'

2.3 Quick Test with GPT2

Let’s use the pre-trained GPT2-medium to generate text with the prompt of “beef, salt, pepper” and I expect the output is something like beef steak but let’s see. The max_length=120 is to define the max output length and I want to see 3 outputs with num_return_sequences=3. We will discuss the parameters do_sample , top_k , top_p later in Section 2.8.

configuration = GPT2Config.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, config=configuration)

tokenizer = GPT2TokenizerFast.from_pretrained(model_name)

input_sequence = "beef, salt, pepper"
input_ids = tokenizer.encode(input_sequence, return_tensors='pt')

model = model.to(device)
#combine both sampling techniques
sample_outputs = model.generate(input_ids.to(device),
                              do_sample = True, max_length = 120,
                              top_k = 50, top_p = 0.85,
                              num_return_sequences = 3)

Here are 3 results.

1: beef, salt, pepper, and vinegar, then remove the fillet to a plate and refrigerate until ready to serve. Cooking with a Lightly Spiced Pork Rind. Place 1 inch cubes of pork tenderloin on a lightly oiled nonstick or silicone baking sheet and drizzle with some cooking spray. Place a large pot of water on the stovetop and bring to a boil. As the pork tenderloin cooks, begin by gently stirring the oil and salt in the pan. You will get the most tender and crisp meat when the oil is…

Overall, the generated text looks like a cooking recipe. However, there are some unrelated words such as “remove the fillet” in the context of “beef”, “refrigerate until ready to serve” mentioned right at the beginning of the cooking process, and sudden appearance of “pork tenderloin” among others.

2: beef, salt, pepper, red onion, and mustard on one side, then sliced mushrooms, tomatoes, and red peppers on the other side. I then topped it with some chicken breasts. I have to give credit where credit is due: I made this at home with an electric stovetop smoker. It’s one of the few things I have that I’m comfortable doing without a commercial oven (which, in fact, was what I had on hand), which makes cooking at home very easy. My husband has an electric grill, so I could have baked this and used a…

The topic abruptly shifts from food to smoker, oven, and grill.

3: beef, salt, pepper, salt and pepper shakers (no spices), ketchup, mustard, mayonnaise, pickles, condiments, lettuce, and/or lettuce leaves, mayonnaise, mustard, mayonnaise, onion powder, mayonnaise, onion powder, mayonnaise, pickles, lettuce, onions, lettuce leaves, and/or lettuce leaves, lettuce leaves, lettuce leaves, lettuce leaves, and/or lettuce leaves, lettuce leaves, lettuce leaves, lettuce leaves, lettuce leaves, lettuce leaves, lettuce leaves, lettuce leaves, lettuce leaves, lettuce leaves…

Even using top_k and top_p, the results remain repetitive and fail to complete the ingredients section.

Let’s fine-tune GPT2-medium with a cooking recipes dataset so that the model can generate recipes with some ingredients as the prompt.

2.4 Prepare a Dataset

I get the recipes dataset from Food.com — Recipes and Reviews. There are over 50,000 recipes but I just use 1,000 recipes to fine-tune the GPT2 model. I combine ingredients and instruction of a recipe in one text and wrap them inside 2 distinct strings to signal the model about beginning of text and end of text.

df_recipes = pd.read_csv('recipes_1000.csv')
df_recipes.reset_index(drop=True, inplace=True)

def form_string(ingredient,instruction):
    s = f"<|startoftext|>Ingredients: {ingredient.strip()}. " \
        f"Instructions: {instruction.strip()}<|endoftext|>"
    return s

data = df_recipes.apply(lambda x:form_string(
    x['ingredients'], x['instructions']), axis=1).to_list()
data[0]

The data format of each recipe is like: <|startoftext|>Ingredients: blueberries, granulated sugar, vanilla yogurt, lemon juice. Instructions: Toss 2 cups berries with sugar. Let stand for 45 minutes, stirring occasionally. Transfer berry-sugar mixture to food processor. … Pour into plastic mold and freeze overnight. Let soften slightly to serve.<|endoftext|>

2.5 GPT2 Tokenizer

GPT-2 uses byte pair encoding (BPE) tokenizer technology. BPE is a data compression technique that is widely used in NLP for tokenization. GPT-2 tokenizer is available as an open-source library on Hugging Face’s Transformers platform

tokenizer = GPT2TokenizerFast.from_pretrained(model_name,
                                              bos_token='<|startoftext|>',
                                              eos_token='<|endoftext|>',
                                              unk_token='<|unknown|>',
                                              pad_token='<|pad|>'
                                             )

GPT-2 has a vocabulary of 50,260 tokens with 4 last indices reserved for special tokens such as <|endoftext|>, <|startoftext|>, <|unknown|> and <|pad|>. <|unknown|> is used for words not in the GPT-2 vocabulary and <|pad|> is used at the end of short sequences to make sequences within a batch have the same length. We can take a look at tokens and their indices with tokenizer.vocab.items(). “Ġplayers”, “Ġplayer”, “player”, “ĠPlayer”, “ĠPlayers”, “Player”, “Players”, “players” are all different words in GPT-2 vocabulary with ‘Ġ’ is the space character.

2.6 Tokenize DataSets & DataLoaders

As GPT2 is a large model, I got out of memory error with batch_size above 2 so in each iteration, I process 2 sequences together. The max length of sentences (or sequences) is 180, which is far from 1024 the limit of gpt2-medium though enough to show the meaning of the generated recipe.

Now I create RecipeDataset class to handle recipe data using PyTorch Dataset. I use tokenize.encode_plus to tokenize the sentences of recipe, add special tokens to start/end of sentence, map tokens into their integer IDs, create attention masks for real and [PAD] tokens. With truncation=True, padding='max_length', max_length=180, I can get same length inputs for the model - the long texts will be truncated to 180 tokens while the short texts will have extra padding tokens added to make it 180 tokens.

batch_size = 2
max_length = 180  

# standard PyTorch approach of loading data in using a Dataset class.
class RecipeDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.input_ids = []
        self.attn_masks = []

        for recipe in data:
            encodings = tokenizer.encode_plus(recipe,
                                              truncation=True,
                                              padding='max_length',
                                              max_length=max_length,
                                              # return a PyTorch tensor
                                              return_tensors='pt'       
                                             )
            self.input_ids.append(torch.squeeze(encodings['input_ids'],0))
            self.attn_masks.append(torch.squeeze(encodings['attention_mask'],0))


    def __len__(self):
        return len(self.data)

    def __getitem__(self,idx):
        return self.input_ids[idx], self.attn_masks[idx]

dataset = RecipeDataset(data, tokenizer)
print(f"input_ids: {dataset[0][0]} attn_masks: {dataset[0][1]}")

If you are curios about data in the RecipeDataset, you can print out elements of the first item of the encoded dataset and get something like below. The first token id of the sequence is 50257, which is bos_token. 50256 is the eos_tokenand followed by a bunch of 50259, which are padding tokens to make 180 (max_length) tokens. The second tensor is attention masks with 1 is for real token and 0 is for padding token.

dataset[0]: 
  input_ids: tensor([50257, 41222, 25,  4171, 20853, 11, 19468, 4817,
        16858, 32132, 11, 18873, 13135, 13, 27759, 25, 309, 793,
        4691, 13, 50256, 50259, 50259, 50259, 50259, 50259, 50259, 50259,
        50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259])
   attn_masks: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Divide up our dataset to use 90% for training and 10% for validation. I create an iterator for the recipe dataset using the PyTorch DataLoader. This helps save on memory during training because, unlike a for loop, the entire dataset does not need to be loaded into memory with an iterator.

Our data is recipes, they are independent to each other, so the order is not important, using RandomSampleror SequentialSamplerto sample data does not make any difference in this project.

# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# Create the DataLoaders for our training and validation datasets.
# Get training samples in random order.
train_dataloader = DataLoader(
            train_dataset, 
            sampler = RandomSampler(train_dataset),
            batch_size = batch_size # Trains with this batch size.
        )

# Get valiation samples sequentially.
validation_dataloader = DataLoader(
            val_dataset, 
            sampler = SequentialSampler(val_dataset),
            batch_size = batch_size # Evaluate with this batch size.
        )

2.7 Finetune GPT2 Language Model

GPT2LMHeadModel is a implementation of the GPT-2 model architecture by Hugging Face. The primary use case of GPT2LMHeadModel is text generation. Given an initial prompt or seed text, the model can autoregressively generate additional text by predicting the next word in the sequence based on the preceding words and sampling from the predicted distribution.

The original embedding size of gpt2-medium is [50257, 1024] with the last token (id 50256) is eos_token = <|endoftext|>. As we added 3 more special tokens bos_token , unk_token , pad_token to the vocabulary, we need to update the model embeddings with the new vocabulary size by using resize_token_embeddings(len(tokenizer)).

I choose Hugging Face’s AdamW as the optimizer that will do the optimization of model parameters using gradient descent algorithm. I create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

configuration = GPT2Config.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, config=configuration)
model = model.to(device)
model.resize_token_embeddings(len(tokenizer))

epochs = 3
learning_rate = 2e-5
warmup_steps = 1e2
# to prevent any division by zero in the implementation
epsilon = 1e-8
optim = AdamW(model.parameters(), lr = learning_rate, eps = epsilon)

total_steps = len(train_dataloader) * epochs  # [no batches] x [no epochs]

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optim,
                                            num_warmup_steps=warmup_steps,
                                            num_training_steps=total_steps)

Training loop for fine-tuning.

The outer loop iterates over each epoch (training cycle)
The inner loop iterates over each batch in the training dataset.
model.train() switches the model to training mode (compared to eval() evaluation mode).
Inside the batch loop, the input data (tokenized sentences), labels, and attention masks are extracted from the batch. These tensors are moved to the appropriate device (e.g., GPU).
model.zero_grad() clears the gradients of the model parameters before performing the forward pass with a new batch.
In the forward pass, the model is called with the input data to obtain the model’s predictions. The labels parameter is provided to calculate the loss during training. The model outputs include the loss and potentially other values (e.g., logits).
In the backward pass, the loss is used to compute the gradients of the model parameters with respect to the loss via backpropagation.
In the optimization step, the optimizer (optim) updates the model parameters based on the computed gradients.

for epoch_i in range(0, epochs):
    total_train_loss = 0
    model.train() 

    for step, batch in enumerate(train_dataloader): 
        b_input_ids = batch[0].to(device) 
        b_labels    = batch[0].to(device)
        b_masks     = batch[1].to(device) 

        model.zero_grad()
        outputs = model( input_ids = b_input_ids, labels = b_labels,
                         attention_mask = b_masks, token_type_ids = None )

        loss = outputs[0]

        # Get sample every x batches.
        if step % 100 == 0 and not step == 0:
            model.eval()
            print(infer("eggs, flour, butter, sugar"))
            model.train()

        loss.backward()
        optim.step()
        scheduler.step()

At the same level with the training iterations using train_dataloader above, I have validation iterations using validation_dataloader.

model.eval()sets the model to evaluation mode, which disables dropout and batch normalization layers, if any. During evaluation, we don’t want the model to update its parameters or learn anything new but just want to evaluate its performance on unseen data.
torch.no_grad()temporarily disable gradient computation as during evaluation, we don’t need to calculate gradients since we are not updating model’s parameters.

    model.eval()

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        b_input_ids = batch[0].to(device)
        b_labels    = batch[0].to(device)
        b_masks     = batch[1].to(device)

        with torch.no_grad():
            outputs  = model(input_ids = b_input_ids, labels = b_labels
                             attention_mask = b_masks)
            loss = outputs[0]

Below is the statistic after 3 epochs. The training time and training loss can be changed with different model sizes (small, large, extra-large), different sequence’s max length (180 in my code), bigger vocabulary (~50k with GPT2, GPT3; ~100k with GPT3.5, GPT4) and amount of data (1,000 recipes in my code).

2.8 Saving & Loading Fine-Tuned Model

I save the trained model, configuration and tokenizer for future use. The file pytorch_model.bin has a size of 1.3 GB.

model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

model = GPT2LMHeadModel.from_pretrained(model_save_path)
tokenizer = GPT2TokenizerFast.from_pretrained(model_save_path)
model.to(device)

2.9 Generate Text

The model splits into 2 parts: encoder and decoder. Encoder reads the input text and returns a vector representing that input. Then, the decoder takes that vector and generates a corresponding text by generating one token at a time. There are various decoding strategies for a model to generate text.

Greedy
At each step, the model predicts the token with the highest probability and selects it as the next token in the sequence. Greedy search tends to produce fluent and grammatically correct but repetitive outputs.

Beam Search
Expands on greedy search by maintaining a set of candidate sequences, called the beam. Beam search typically produces more diverse and coherent text compared to greedy search but because it selects the best probable response, it may still suffer from repetitive or generic outputs.

Random Sampling with Temperature
Random sampling, by itself, could potentially generate a very random word by chance. Temperature is used to increase the probability of probable tokens while reducing the one that is not. Usually, the range is 0 < temp ≤ 1.

Top-K Sampling
Top-K sampling is used to ensure that the less probable words should not have any chance at all. Only top K probable tokens should be considered for a generation.

Top-P Sampling (Nucleus Sampling)
Nucleus sampling is similar to Top-K sampling. Instead of focusing on Top-K words, nucleus sampling focuses on the smallest possible sets of Top-V words such that the sum of their probability is ≥ p. Then, the tokens that are not in V^(p) are set to 0; the rest are re-scaled to ensure that they sum to 1. The intuition is that when the model is very certain on some tokens, the set of potential candidate tokens is small, otherwise, more potential candidate tokens are needed to have sum of probability to exceed p.

This is the function I used to generate text from a model.

Takes a prompt as input, adds a starting text before it. The input is then tokenized to convert into numerical representation.
Generates new text based on the provided input
The generated output is a sequence of token IDs, all special tokens such as padding, end-of-sequence tokens are removed. The decoded output is a string of generated text.

def infer(prompt):
    input = f"<|startoftext|>Ingredients: {prompt.strip()}"
    input = tokenizer(input, return_tensors="pt")
    input_ids      = input["input_ids"]
    attention_mask = input["attention_mask"]

    output = model.generate(input_ids.to(device),
                            attention_mask=attention_mask.to(device),
                            max_new_tokens=max_length,
                            do_sample = True, top_k = 50, top_p = 0.85)
    output = tokenizer.decode(output[0], skip_special_tokens=True)
    return output

So now, we are ready to generate recipes with our fine-tuned model. Let’s take 2 examples and see if the generated text is more coherent than the text generated by the original GPT2.

infer("eggs, mushroom, butter, sugar")

Ingredients: eggs, mushroom, butter, sugar, milk, cream, cornstarch, salt, brown sugar, flour, baking powder, baking soda, baking soda.
Instructions: In a large saucepan, bring eggs to boil. Reduce heat and simmer until thickened. Remove from heat and stir in mushroom, butter, sugar, milk, cornstarch, salt, and brown sugar. Add the flour, baking powder and baking soda; stir until well blended. Add the baking soda and stir until completely dissolved. Add milk and cornstarch mixture to the mushroom mixture. Stir well to combine. Divide mixture into 8 equal parts and bake at 350 degrees for 15 minutes. Let stand 5 minutes before serving. Makes 4–6 servings.

infer("beef, salt, pepper")

Ingredients: beef, salt, pepper, onions, garlic.
Instructions: Cut the beef into 1-inch cubes and discard the fat. Melt the butter in a small saucepan, and add the onion. Stir-fry for 2 minutes. Add the garlic and saute until golden brown. Add the beef cubes and cook for 5 minutes. Pour the remaining butter in a heavy bottomed pan and bring to a boil. Stir occasionally, simmer for 10 minutes. Sprinkle the chopped parsley over the beef cubes, and cook for 3 minutes more, stirring occasionally. Return the beef to the saucepan and stir. Add the salt, pepper, onion, garlic and remaining butter to the beef cubes. Cook for 2 minutes more. Season with salt and pepper. Spoon the beef into a lasagna pan, and spread the sauce evenly over. Top with fresh parsley and tomato sauce. Refrigerate for 5 minutes. Cut into

Observation

“baking soda” appears twice in the Ingredients of the first example. Some ingredients are used in the second recipe but not listed in the Ingredients section.
The recipes appear to have coherent steps.
Each time, completely different recipes are generated.
Obviously, we should not use these to cook.

2.10 Performance Evaluation with BLEU

I tried to implement BLEU score to evaluate the quality of generated recipe by comparing it with the original recipe. However, there are no 2-gram overlaps between them. I’m posting the code here for your reference and may come back to it later.

import statistics
from nltk.translate.bleu_score import sentence_bleu

scores=[]

for i in range(10):
    ingredients = val_dataset[i][2]  # original ingredients
    reference = val_dataset[i][3]    # original instructions
    candidate = infer(ingredients)   # generated instructions
    scores.append(sentence_bleu(reference, candidate))

print(statistics.mean(scores))

# UserWarning: The hypothesis contains 0 counts of 2-gram overlaps.
# Therefore the BLEU score evaluates to 0

This project is a work in progress and I will continually update it as I learn more about text generation.

Thanks a lot for reading, I hope I could help!

3. What next?

As the article shows, fine-tuning GPT-2 on specific dataset enables the generation of relevant text relatively easily. However, there are several improvement ideas that we can consider implementing:

Using a bigger GPT-2 model, such as gpt2-large, gpt2-xl along with a bigger training data could generate higher quality recipes.
Exploring the potential of GPT-4? This is not just an opportunity to generate better recipes but also a chance to work with Azure OpenAI services.

GitHub: https://github.com/tuangatech/01-Recipes-Generation-GPT2
LinkedIn: https://www.linkedin.com/in/tuantran-atlanta/

References