LLMs with HuggingFace: A practical introduction

Large Language Models: is it just hype or an "iPhone moment"?
There's no better way to answer this question than getting our hands dirty and learn how to implement LLM apps, so I thought of compressing useful resources and practical knowledge in this post, where I'll be using HuggingFace libraries to outline the basics.
I would recommend some knowledge of Natural Language Processing to follow at full speed (at least to know what terms like "vocabulary", "tokenization" and "word embedding" mean in NLP context), but it is not strictly necessary.
By the end of this post you will get:

LLMs in short

LLMs are Language Models trained on a humongous amount of data to perform classification based tasks (e.g. they attempt to uncover a masked word in a text) or generative tasks (the model "guess" the next word after the sequence that's been shown).
The latter are the most famous now, and we can define them as statistical models that try to predict the most likely word in text (in some kind of language), finding a distribution over a vocabulary (which is the set of tokens given to the model during training).

Since BERT, this type of models improved a lot, and with the latest advancements they manage to get better and better as they are fed with more data; as a quick comparison, a human on average reads 700 books in their life while some LLM is fed an amount of text that corresponds to 10 million books.

The type of applications that can be developed with less code using LLMs are increasing every day, so the potential is huge; on the other hand, there's no silver bullet since models performance are highly task-dependent, so it is necessary to do some homework and get to know them better.

HuggingFace main libraries

These are the 3 main HuggingFace libraries to start working with LLM apps (I would also advice to set up anaconda and a conda environment):

  • datasets , a 1-line API for loading and sharing datasets that can be installed typing pip install datasets on your terminal.
    To import a dataset you just need to import the package (from datasets import load_dataset) and to call the function e.g. xsum_dataset = load_dataset("xsum", version="1.2.0")
  • transformers, that includes pipelines, tokenizers, models etc. and can be installed in your environment just typing conda install -c huggingface transformers
  • evaluate, which does what it promises 🙂

Pipeline coding example

Let's start from transformers.pipeline, looking at a summarisation example:

from transformers import pipeline

text "some text, for example an article"

summarizer = pipeline("summarization")
summarizer(text)

These few lines of code are enough for a default summarisation pipeline that can be used on our text.
In reality, this is what's happening inside:

--- title: Summarizer pipeline under the hood --- classDiagram someTextData --|> promptConstruction promptConstruction --|> inputText inputText --|> encodingTokenizer encodingTokenizer --|> encodedInput encodedInput --|> LLMmodel LLMmodel --|> encodedOutput encodedOutput --|> decodingTokenizer decodingTokenizer --|> summarisedText class promptConstruction{ } note for promptConstruction "This is an optional step in which the instruction \nfor the model are merged with the text data \nthat we feed the model with" class someTextData{ "Something huge happened in Penang in the last couple of weeks..." } class inputText{ Summarize: "Something huge happened in Penang in the last couple of weeks..." } class encodingTokenizer{ } note for encodingTokenizer "The tokenizer is first used to transform text \ninto numbers based on the vocabulary, \nso that the model can be fed with it" class encodedInput{ Encoded input: [20389, 498223, ...] } class LLMmodel{ } class encodedOutput{ Encoded output: [4567, 219, ...] } class decodingTokenizer{ } note for decodingTokenizer "The tokenizer is finally used to transform numbers into text, \nso that the user can read it" class summarisedText{ }

The text is passed to a tokenizer for the encoding, then a model takes the encoded input and returns an output. This output needs to be decoded by the tokenizer to finally obtain the summary.
Let's see how the code looks like when every step is explicit.

If we look more closely at the tokenizers and write a short encoding script, we can see that HuggingFace has the class transformers.AutoTokenizer that automatically loads the correct tokenizer depending on the model that we choose (specified as an attribute to the class).

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrainred("some_model_name")

inputs = tokenizer(text, #the input text
                   max_length=1024, #this forces the text into a fixed length tensor
                   padding=True, 
                   return_tensors="pt" #this specifes that we are using pytorch
                   )

Here's a link if you are not familiar with terms like padding and masking.
After being passed to the tokenizer, the input text will look like this:

{
 "input_ids": tensor([[1203, 43567, ...]]), #the token-encoded input text
 "attention_mask" : tensor([[1, ...]]) #the meta-data required by the model
                                       #to handle variable length inputs
 }

This data is then passed to some LLM model, and again we can use another "Auto" class for sequence to sequence modelling: transformers.AutoModelForSeq2SeqLM

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("some_model_name")

summary_ids = model.generate(
                            inputs.input_ids,
                            attention_mask=inputs.attention_mask,
                            num_beams=10, #beam search settings
                            min_length=5, #minimum output length
                            max_length=40 #maximum output length
                            )
decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

This code will return the summaries.

Selecting the right model

Let's keep the summarisation example rolling: do we want to do extractive summarization (where we select representative pieces of text from the original input) or abstractive summarization (where we generate new text "compressing" the input).
This is a fundamental detail to consider to choose the right model.

The HuggingFace website UI can be filtered and sorted in many ways to find what we are looking for; here are few tips to do that in the most efficient way:

  • Filter by task, licence, language, and other hard constraints
  • Sort by size (another important aspect that depends on the resources available and the type of app.
    Sizes matter for LLM as larger models tend to be more powerful, but there are cases in which the real difference is done by the training technique. The GPT family for example used Reinforcement Learning from Human Feedback)
  • Sort by popularity and updates (old models might not load properly with more recent libraries)
  • Take into account model variants (usually it is better to prototype using smaller variants, or with variants that have been fine tuned for a specific task).
    Fine-tuned models will be smaller and better performing on a task if it matches the one on which they were fine tuned
  • Take into account the available examples to better understand usability
  • Take into account the datasets that have been used for training or fine tuning the model
  • Recognize "famous good models": there are models families that are important to recognize (e.g. Pythia is a family of base models commonly used for fine-tuning, and Dolly is a Pythia model fine tuned for instruction following)

Coding common LLM applications

Sentiment Analysis

Sentiment Analysis  is a text classification task that estimates whether a piece of text is positive, negative, or another "sentiment" label (The set of labels can vary across applications).
The first step is always to load pipeline and load_datasets module:

from datasets import load_dataset
from transformers import pipeline

We will use poem sentiment dataset (which contains texts labelled as positive (0) , negative (1), no_impact (2) and mixed (3)) and a fine-tuned version of BERT (BERT is an encoder-only model from Google usable for 11+ tasks such as sentiment analysis and entity recognition).

Let's create a cache directory to use the pre-downloaded data:

mkdir cache

We can now load the dataset, that has 3 columns:

  • id: the poem id
  • verse_text: the text to classify
  • label: either 0, 1, 2 or 3
poem_dataset = load_dataset(
    "poem_sentiment", version="1.0.0", cache_dir="../working/cache/"
)

We load the pipeline using the task text-classification, specifying the model and the cache directory:

sentiment_classifier = pipeline(
    task="text-classification",
    model="nickwong64/bert-base-uncased-poems-sentiment",
    model_kwargs={"cache_dir": "../working/cache/"},
)

To get the predicted sentiments:

results = sentiment_classifier(poem_sample["verse_text"])

The following lines display the predicted sentiment with the ground-truth label, the original text and the model's confidence:

# Join predictions with ground-truth data
joined_data = (
    pd.DataFrame.from_dict(results)
    .rename({"label": "predicted_label"}, axis=1)
    .join(pd.DataFrame.from_dict(poem_sample).rename({"label": "true_label"}, axis=1))
)

# Change label indices to text labels
sentiment_labels = {0: "negative", 1: "positive", 2: "no_impact", 3: "mixed"}
joined_data = joined_data.replace({"true_label": sentiment_labels})

display(joined_data[["predicted_label", "true_label", "score", "verse_text"]])
predicted_label true_label score verse_text
positive positive 0.996594 with pale blue berries. in these peaceful shad...
no_impact no_impact 0.998741 it flows so long as falls the rain,
negative negative 0.995966 and that is why, the lonesome day,
mixed mixed 0.99654 when i peruse the conquered fame of heroes, an...
... ... ... ...

Translation

We are going to use the Helsinki-NLP/opus-mt-en-es and t5-small model for this translation task.
The latter is an encoder-decoder created by Google and supports several tasks (summarisation, translation, Q&A and text classification), while the first is based on Marian-NMT and is used specifically for English to Spanish translation (it also requires the additional sacremoses library, so we need to install it). 

pip install sacremoses==0.0.53

There are many translations datasets but I am just using few hard coded examples like the following here:

#load translation pipeline
en_to_es_translation_pipeline = pipeline(
    task="translation",
    model="Helsinki-NLP/opus-mt-en-es",
    model_kwargs={"cache_dir": "../working/cache/"},
)

#execute on a text
en_to_es_translation_pipeline(
    "Existing, open-source (and proprietary) models can be used out-of-the-box for many applications."
)

The output looks like

[{'translation_text': 'Los modelos existentes, de código abierto (y propietario) se pueden utilizar fuera de la caja para muchas aplicaciones.'}]

When using other types of model, like t-5, we need to specify more things in the instructions as they can handle multiple languages:

t5_small_pipeline = pipeline(
    task="text2text-generation",
    model="t5-small",
    max_length=50,
    model_kwargs={"cache_dir": "../working/cache/"}
)

#the prompt includes instructions for the languages
t5_small_pipeline(
    "translate English to French: Existing, open-source (and proprietary) models can be used out-of-the-box for many applications."
)
[{'generated_text': 'Les modèles existants, libres (et propriétaires) peuvent être utilisés hors de la boîte de commande pour de nombreuses applications.'}]

Zero-shot classification

Zero-shot classification consists of classifying a piece of text into one or more categories, without having trained the model to predict those categories explicitly.

We will use the xsum dataset mentioned at the top of this post; this dataset is a collection of many BBC articles and we will classify them over a bunch of labels.

xsum_dataset = load_dataset(
    "xsum", version="1.2.0", cache_dir="../working/cache/"
)

As a model, we will use a fine-tuned version of DeBerta: nli-deberta-v3-small .
This model was NOT fine-tuned to use the lables that we are going to choose for this task, but it "knows their meaning" due to its more general training.

#here we define the pipeline
zero_shot_pipeline = pipeline(
    task="zero-shot-classification",
    model="cross-encoder/nli-deberta-v3-small",
    model_kwargs={"cache_dir": "../working/cache/"},
)

#these are the labels we want to use
labels = [
            "politics",
            "finance",
            "sports",
            "science and technology",
            "pop culture",
            "breaking news",
        ]

#we use an helper function to pass them and return prediction and confidence score
def categorize_article(article: str, labels: List[str] = labels) -> None:
    """
    This helper function defines the categories (labels) which the model must use to label articles and prints out the predicted labels alongside their confidence scores.
    """
    results = zero_shot_pipeline(
        article,
        candidate_labels=labels,
    )
    # Print the results nicely
    del results["sequence"]
    display(pd.DataFrame(results))

Let's see what the output will be with this long article:

categorize_article(
    """
Simone Favaro got the crucial try with the last move of the game, following earlier touchdowns by Chris Fusaro, Zander Fagerson and Junior Bulumakau.
Rynard Landman and Ashton Hewitt got a try in either half for the Dragons.
Glasgow showed far superior strength in depth as they took control of a messy match in the second period.
Home coach Gregor Townsend gave a debut to powerhouse Fijian-born Wallaby wing Taqele Naiyaravoro, and centre Alex Dunbar returned from long-term injury, while the Dragons gave first starts of the season to wing Aled Brew and hooker Elliot Dee.
Glasgow lost hooker Pat McArthur to an early shoulder injury but took advantage of their first pressure when Rory Clegg slotted over a penalty on 12 minutes.
It took 24 minutes for a disjointed game to produce a try as Sarel Pretorius sniped from close range and Landman forced his way over for Jason Tovey to convert - although it was the lock's last contribution as he departed with a chest injury shortly afterwards.
Glasgow struck back when Fusaro drove over from a rolling maul on 35 minutes for Clegg to convert.
But the Dragons levelled at 10-10 before half-time when Naiyaravoro was yellow-carded for an aerial tackle on Brew and Tovey slotted the easy goal.
The visitors could not make the most of their one-man advantage after the break as their error count cost them dearly.
It was Glasgow's bench experience that showed when Mike Blair's break led to a short-range score from teenage prop Fagerson, converted by Clegg.
Debutant Favaro was the second home player to be sin-binned, on 63 minutes, but again the Warriors made light of it as replacement wing Bulumakau, a recruit from the Army, pounced to deftly hack through a bouncing ball for an opportunist try.
The Dragons got back within striking range with some excellent combined handling putting Hewitt over unopposed after 72 minutes.
However, Favaro became sinner-turned-saint as he got on the end of another effective rolling maul to earn his side the extra point with the last move of the game, Clegg converting.
Dragons director of rugby Lyn Jones said: "We're disappointed to have lost but our performance was a lot better [than against Leinster] and the game could have gone either way.
"Unfortunately too many errors behind the scrum cost us a great deal, though from where we were a fortnight ago in Dublin our workrate and desire was excellent.
"It was simply error count from individuals behind the scrum that cost us field position, it's not rocket science - they were correct in how they played and we had a few errors, that was the difference."
Glasgow Warriors: Rory Hughes, Taqele Naiyaravoro, Alex Dunbar, Fraser Lyle, Lee Jones, Rory Clegg, Grayson Hart; Alex Allan, Pat MacArthur, Zander Fagerson, Rob Harley (capt), Scott Cummings, Hugh Blake, Chris Fusaro, Adam Ashe.
Replacements: Fergus Scott, Jerry Yanuyanutawa, Mike Cusack, Greg Peterson, Simone Favaro, Mike Blair, Gregor Hunter, Junior Bulumakau.
Dragons: Carl Meyer, Ashton Hewitt, Ross Wardle, Adam Warren, Aled Brew, Jason Tovey, Sarel Pretorius; Boris Stankovich, Elliot Dee, Brok Harris, Nick Crosswell, Rynard Landman (capt), Lewis Evans, Nic Cudd, Ed Jackson.
Replacements: Rhys Buckley, Phil Price, Shaun Knight, Matthew Screech, Ollie Griffiths, Luc Jones, Charlie Davies, Nick Scott.
"""
)
predicted_label score
sports 0.469
breaking news 0.223
science and technology 0.107
pop culture 0.104
politics 0.057
finance 0.039

When all the confidence score are low, it means the model is uncertain about all of them so it might be that none of the labels fits properly with the text.

Few-shot learning

In few-shot learning tasks we give the model an instruction with a few examples of how to follow that instruction, and we expect it to generate a coherent response when a new query is provided.
This technique is very powerful and allows models to be reused for many more applications, but requires significant prompt engineering to get good and reliable results.

We will be using this approach with sentiment analysis.
On HuggingFace we will specify task = "text-generation" in the pipeline  constructor, as that's how the library handle it.
As a model, we will be using gpt-NEO-1.3B, a transformer model developed by EleutherAI.
Few-shots learning generally requires larger models as this task is way more general than the others, so in case you want to get serious with it I would recommend upgrading to GPT-NeoX (or other larger and more powerful models for intrusction-following or text generation).

# defining the pipeline
few_shot_pipeline = pipeline(
    task="text-generation", #general task specified
    model="EleutherAI/gpt-neo-1.3B",
    max_new_tokens=10, #for each text generation, produce max 10 tokens
    model_kwargs={"cache_dir": "../working/cache/"},
)

# Get the token ID for "###", which we will use as the EOS (end of sequence) token below.
eos_token_id = few_shot_pipeline.tokenizer.encode("###")[0]

Without any examples, the model output is inconsistent and usually incorrect.

# wrong result with a poor prompt
results = few_shot_pipeline(
    """For each tweet, describe its sentiment:

[Tweet]: "This new music video was incredible"
[Sentiment]:""",
    eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])
For each tweet, describe its sentiment:

[Tweet]: "This new music video was incredible"
[Sentiment]: "sad, very sad"

In the following example we provide just one type of label, so the model still don't get the task right. It needs varied and relevant examples, so we're still not there.

results = few_shot_pipeline(
    """For each tweet, describe its sentiment:

[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]:""",
    eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])
For each tweet, describe its sentiment:

[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]: Neutral
###

The more examples provided, the better the performance will be:

# giving 1 example for each sentiment, the model is more likely to understand!
results = few_shot_pipeline(
    """For each tweet, describe its sentiment:

[Tweet]: "I hate it when my phone battery dies."
[Sentiment]: Negative
###
[Tweet]: "My day has been 👍"
[Sentiment]: Positive
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]:""",
    eos_token_id=eos_token_id,
)

print(results[0]["generated_text"])
For each tweet, describe its sentiment:

[Tweet]: "I hate it when my phone battery dies."
[Sentiment]: Negative
###
[Tweet]: "My day has been 👍"
[Sentiment]: Positive
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
###
[Tweet]: "This new music video was incredible"
[Sentiment]: Positive

I hope you found this post useful to start your journey in the LLM apps development; I am planning to post more about this topic, so let me know in the comment if there's anything specific you want to read!

Leave a Comment

Your email address will not be published. Required fields are marked *