How to run Llama 2

In this post, we show how Llama 2 LLM can be run on a server with a GPU. Llama 2 is a collection of pre-trained and fine-tuned LLMs ranging in scale from 7 billion to 70 billion parameters. This model is available for free for research and commercial use with some exceptions.

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

This Python code is using the HuggingFace Transformers library to generate text recommendations based on a given prompt using the LLaMA text generation model.

It starts by importing the necessary libraries: transformers, torch

It then specifies the pretrained LLaMA model to use and loads the associated tokenizer.

The pipeline is created, specifying text-generation as the task, the LLaMA model, and some generation parameters like using float16 tensors and allowing multiple GPUs if available.

The prompt is defined, as asking for recommendations based on liking Breaking Bad and Band of Brothers.

The pipeline is called, generating text with:

  1. do_sample=True - Generates text by sampling from the model probabilities rather than greedily.
  2. top_k=10 - Filters vocabulary down to the top 10 most likely next tokens.
  3. num_return_sequences=1 - Generates 1 text sequence.
  4. eos_token_id - Stops generation when the end-of-sequence token is reached
  5. max_length=200 - Limits the max text length.

Finally, it prints out the generated text sequence.

So in summary, it is using Transformers and LLaMA to take a prompt, generate a text continuation for it, and return a recommended show based on the likes specified. The various parameters control the text generation process.

Other Variants of Llama Model

  1. Code Llama : https://arxiv.org/pdf/2308.12950.pdf
  2. Llama fine-tuned for Opthmalogy: Ophtha-LLaMA2: A Large Language Model for Ophthalmology https://arxiv.org/pdf/2308.12950.pdf
  3. Llama for Italian language: LLaMAntino: LLaMA 2 models for effective text generation in Italian language

References

  1. https://about.fb.com/news/2023/07/llama-2/
  2. https://simonwillison.net/2023/Aug/1/llama-2-mac/
  3. https://github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference