How to run some open-source large language models?

How to run open-source large language models such as Falcon-7B, Llama-2 etc

Over the past year, large language models have become quite popular. ChatGPT was the key moment in time when the popularity of large language models started increasing. While closed-source models such as ChatGPT, Bard, and Claude are easily available for anyone to use, most machine learning developers should learn how to use open-source large language models. These open-source models often require powerful GPUs to run but there are different platforms such as Google Colab, and Kaggle available to run these models to test and explore.

Hugging Face Models on GPU

Let's try to run the Falcon-7b-Instruct parameter model. Facon-7b-Instruct is a 7 billion causal decoder-only model built by TII based on Falcon-7b base model. This instruct model is fine-tuned on a mixture of chat/instruct datasets. This model is available under Apache 2.0 license.

pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
pip install -q -U einops

Then we import the required libraries from Hugging Face

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
import transformers

We specify the model name and instantiate the model  

model_name = "tiiuae/falcon-7b-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Then we call the model to run an inference

pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

Running Llama Models on CPU

Thank you Georgi Gerganov, llama.cpp enables running LLMs on CPUs. The goal of llama.cpp is to run llm using 4-bit integer quantization. This is implemented in C/C++.

First, let's the code

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build the code

make

The model weights need to be downloaded and saved in ./models

ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
  # [Optional] for models using BPE tokenizers
  ls ./models
  65B 30B 13B 7B vocab.json

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/

  # [Optional] for models using BPE tokenizers
  python convert.py models/7B/ --vocabtype bpe

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0

# run the inference
./main -m ./models/7B/ggml-model-q4_0.gguf -n 128

Typically 7B model will require ~4GB of RAM to load the models and run.