It entails changing the weights from FP16 to INT8, successfully halving the dimensions of the LLM. The tactic claims to effectively scale back the dimensions of LLMs as much as 175B parameters with out efficiency degradation.
Earlier than going to the small print of the paper [1], it’s necessary to grasp that LLMs have emergent options — patterns that come up from the coaching knowledge and are essential for the mannequin’s efficiency. A few of these options can have massive magnitudes and may exert a robust affect over the mannequin’s total efficiency.
Steps concerned:
- The LLM.int8() technique begins with vector-wise quantization. Because of this every vector (a row within the matrix) is quantized individually, utilizing its personal normalization fixed. The relative significance of every function is thus preserved.
- For every vector, a normalization fixed is calculated that’s used to scale the vectors in order that they are often represented as 8-bit integers. By utilizing the normalization constants, a lot of the options within the LLM are quantized.
- For emergent outliers — options with unusually massive magnitudes — a mixed-precision decomposition scheme is used. This isolates these outlier options right into a separate 16-bit matrix multiplication, guaranteeing they’re dealt with precisely whereas nonetheless permitting greater than 99.9% of the values to be multiplied in 8-bit.
Execs
LLMs may be quantized and used instantly for inference with out efficiency degradation.
Cons
The tactic focuses solely on the INT8 datatype and fashions of as much as 175B parameters (particularly OPT-175B / BLOOM).
Code Implementation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torchmodel_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)
GPTQ (Oct 2022)
GPTQ was an early one-shot PTQ approach that enabled environment friendly deployment of huge language fashions. It was achieved primarily via the 2 options proposed within the paper [4],
- Layerwise Quantization
Quantization is carried out layer by layer within the LLM. The purpose is to discover a less complicated model of the weights that also offers us a very good end result once we use it to make predictions. That is performed in a method that the distinction between the unique and the simplified weights is as small as possible- ie, lowest imply squared error. - Optimum Mind Quantization
It’s an algorithm supposed to scale back errors launched within the mannequin as a consequence of quantization. Whereas quantizing a weight, the remaining weights are adjusted.
Execs
GPTQ permits for quantization as much as 2 bits, offering a spread of trade-offs between mannequin dimension and efficiency.
Cons
Quantization by this technique introduces appreciable efficiency degradation.
Code Implementation
Set up the required libraries.
pip set up auto-gptq transformers speed up
Load the mannequin and quantize it with the autogptq library.
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
mannequin = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quant_config)
QLoRA (Could 2023)
Earlier than diving into QLoRA, here’s a temporary introduction to LoRA. LoRA (Low-Rank Adaptation of Giant Language Fashions) is a parameter-efficient fine-tuning technique used to specialize LLMs for explicit duties. It achieves this by integrating trainable matrices primarily based on rank decomposition into each transformer layer. Furthermore, it minimizes the variety of parameters that should be skilled for the focused activity, all of the whereas sustaining the unique pre-trained mannequin weights unchanged. Learn extra about it here.
QLoRA is an enhanced model of LoRA. Listed below are the highlights on this technique as described within the paper [2]:
1. 4-bit Regular Float Quantization:
The 4-bit Regular Float operates by calculating the 2ᵏ+1 quantiles (the place ok is the bit depend) inside a distribution starting from 0 to 1, subsequently normalizing these values to suit inside the [-1, 1] interval. With this normalization, we are able to equally modify our neural community weights to the [-1, 1] vary and proceed with quantization.
2. Double Dequantization:
This entails quantizing the quantization constants employed within the 4-bit NF quantization course of. It will probably preserve a median of 0.5 bits per parameter. That is useful as a result of QLoRA makes use of Block-wise k-bit Quantization.
3. Paged Optimizations:
QLoRA entails environment friendly web page transfers from GPU to CPU utilizing Nvidia’s unified reminiscence function. This prevents GPU overloads and makes the coaching environment friendly with out interrupting.
Execs
QLoRA, as a consequence of decrease GPU reminiscence utilization, can assist greater max sequence lengths and the next variety of batches.
Cons
It may be slower by way of tuning pace. It additionally stands on the decrease aspect in price effectivity however that’s not a matter of concern.
Code Implementation
Set up the required libraries
pip set up -q -U trl transformers speed up git+https://github.com/huggingface/peft.git
pip set up -q datasets bitsandbytes
Load the mannequin and tokenizer. Configure the LoRA parameters.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizermodel_id = "meta-llama/Llama-2-7b-chat-hf"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
mannequin = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
trust_remote_code=True
)
mannequin.config.use_cache = False
from peft import LoraConfig, get_peft_model
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
task_type="CAUSAL_LM"
)
Arrange the coach utilizing SFTTrainer
from the TRL library that provides a wrapper round transformers Coach
to simply fine-tune fashions on instruction-based datasets utilizing PEFT adapters. After all, you have to a dataset to coach.
from transformers import TrainingArgumentsoutput_dir = "./fashions"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 100
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "fixed"
training_arguments = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
fp16=True,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=True,
lr_scheduler_type=lr_scheduler_type,
)
from trl import SFTTrainer
max_seq_length = 512
coach = SFTTrainer(
mannequin=mannequin,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="textual content",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
)
coach.prepare()
AWQ (Jun 2023)
AWQ (Activation-Conscious Weight Quantization) is a Put up-Coaching Quantization technique. On this technique, the activations of the mannequin are thought-about rather than weights. Let me quote it straight from the paper [3],
Our technique relies on the commentary that weights should not equally necessary: defending only one% of salient weights can tremendously scale back quantization error. We then suggest to seek for the optimum per-channel scaling that protects the salient weights by observing the activation, not weights.
Execs
AWQ offers extra accuracy than different strategies as weights vital to the LLM efficiency are preserved. Additionally it is environment friendly and quicker because it doesn’t contain backpropagation or reconstruction. It performs properly on edge units.
Cons
Whereas sustaining 0.1% of weights in FP16 can improve the efficiency of quantization with out considerably rising the mannequin dimension, this mixed-precision knowledge kind complicates system implementation.
Code Implementation
Set up required libraries.
!pip set up autoawq transformers speed up
Load the mannequin and quantize it with the autoawq library.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_id = 'meta-llama/Llama-2-7b-hf'
quant_path = 'Llama2-7b-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }# Load mannequin and tokenizer
mannequin = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
# Quantize
mannequin.quantize(tokenizer, quant_config=quant_config)
Quip# (Jul 2023)
In easy phrases, QuIP (Quantization with Incoherence Processing) relies on the concept that the method of quantization may be improved if the weights of the mannequin are evenly distributed (incoherent), and the necessary instructions for rounding them should not aligned with the coordinate axes. It consists of two steps:
- LDLQ Adaptive rounding procedure: Regulate the weights of the mannequin in a method that minimizes a sure measure of error (the ‘quadratic proxy goal’) [8].
- Pre- and post-processing: Multiply the burden and Hessian matrices by random orthogonal matrices. This ensures that the weights and Hessians are incoherent, which is useful for the quantization course of.
QuIP# [5] advances on QuIP utilizing some enhancements in processing.
- Improved Incoherence Processing: It makes use of a quicker and higher technique known as the randomized Hadamard remodel.
- Vector Quantization: QuIP# makes use of vector quantization to leverage the ball-shaped sub-Gaussian distribution that incoherent weights possess. Particularly, it introduces a set of hardware-efficient codebooks primarily based on the extremely symmetric E8 lattice. The E8 lattice achieves the optimum 8-dimension unit ball packing, which implies it will probably characterize the weights extra effectively.
Execs
In comparison with different strategies, QuIP# affords considerably greater throughput (>40%) on the identical or higher quantization high quality. That isn’t unhealthy for a 2-bit quantization.
Cons
Though not many limitations are talked about, complexity and {hardware} compatibility may be thought-about.
Code Implementation
Clone the official repo and set up the required libraries.
git clone https://github.com/Cornell-RelaxML/quip-sharp.git
pip set up -r necessities.txt
cd quiptools && python setup.py set up && cd ../
Discover the scripts for numerous fashions. Run the script quantize_finetune_llama.py
to make use of llama fashions.
Additionally, take a look at the repo for quip quantization. The code for quantizing fashions is as proven.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from quantizer import QuipQuantizermodel_name = "meta-llama/Llama-2-70b-hf"
quant_dir = "llama-70b_2bit_quip"
tokenizer = AutoTokenizer.from_pretrained(model_name)
mannequin = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
quant = QuipQuantizer(codebook="E8P12", dataset="redpajama")
quant.quantize_model(mannequin, tokenizer, quant_dir)
GGUF (Aug 2023)
GGUF (GPT-Generated Unified Format) was a extremely anticipated launch by Georgi Gerganov and the llama.cpp staff. The principle spotlight was certainly the function that LLMs might now be run simply on shopper CPUs. Earlier it was known as GGML and later upgraded to GGUF.
A notable achievement of GGML was the power to dump sure layers of the LLM to GPU if required even whereas the LLM operates on the CPU. This successfully addresses the worldwide problem builders face as a consequence of insufficient VRAM.
Execs
When you plan to run LLMs on CPU or Apple units (the M collection chips), it’s the goto technique for a lot of LLMs like Llama and Mistral. GGUF file format is now properly supported by llama.cpp and HuggingFace. GGUF fashions additionally present decrease perplexity scores in comparison with different codecs.
Cons
GGUF is targeted on CPU and Apple M collection units. This may very well be a limitation in the event you’re working with totally different {hardware} configurations.
Code Implementation
Set up the ctransformers
library.
pip set up ctransformers[cuda]
Fashions can be found within the repositories by Bloke in HuggingFace.
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline# Load LLM and Tokenizer
# Use `gpu_layers` to specify what number of layers will probably be offloaded to the GPU.
mannequin = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-beta-GGUF",
model_file="zephyr-7b-beta.Q4_K_M.gguf",
model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
"HuggingFaceH4/zephyr-7b-beta", use_fast=True
)
# Create a pipeline
pipe = pipeline(mannequin=mannequin, tokenizer=tokenizer, activity='text-generation')
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline# Load LLM and Tokenizer
# Use `gpu_layers` to specify what number of layers will probably be offloaded to the GPU.
mannequin = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-beta-GGUF",
model_file="zephyr-7b-beta.Q4_K_M.gguf",
model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
"HuggingFaceH4/zephyr-7b-beta", use_fast=True
)
# Create a pipeline
pipe = pipeline(mannequin=mannequin, tokenizer=tokenizer, activity='text-generation')
HQQ (Nov 2023)
In response to the paper, weight calibration may be achieved by data-free calibration strategies (BitsAndBytes) and calibration-based strategies (GPTQ and AWQ). Whereas calibration-free strategies are quicker, calibration-based strategies endure from knowledge bias and quantization time.
HQQ (Half-Quadratic Quantization) carries out quantization in actual time utilizing fast and durable optimization. It eliminates the necessity for calibration knowledge and is flexible sufficient to quantize any given mannequin, thus attaining pace of calibration-free strategies with out knowledge bias points. It drastically diminished quantization time to virtually a couple of minutes as a consequence of optimization strategies like half-quadratic splitting. For extra particulars on the mathematics and dealing of the tactic, see the official website.
Execs
Achieved surprisingly low quantization time in comparison with different strategies (50x quicker in comparison with GPTQ!). The elimination of calibration knowledge necessities makes it simpler.
Cons
Not many limitations are talked about elsewhere. It could nonetheless present high quality degradation like different strategies.
Code Implementation
Set up the transformers library and use HQQ implementation straightaway!
from transformers import AutoModelForCausalLM, HqqConfig# All linear layers will use the identical quantization config
quant_config = HqqConfig(nbits=4, group_size=64, quant_zero=False, quant_scale=False, axis=1)
model_id = "meta-llama/Llama-2-7b-hf"
# Load and quantize
mannequin = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=quant_config
)
AQLM (Feb 2024)
AQLM (Additive Quantization of Language Fashions) is a weight-only PTQ technique that units a brand new benchmark within the 2-bit-per-parameter vary. It outperforms well-liked algorithms like GPTQ in addition to QuIP and QuIP#.
It applies a brand new technique known as Multi-Codebook Quantization (MCQ) which divides every vector into sub-vectors and approximates them utilizing a finite set of codewords. Codewords are already discovered vectors outlined in a codebook [7]. AQLM works by taking the rows of the burden matrices in a mannequin and quantizing them.
Execs
AQLM affords fast implementations for token technology on each GPU and CPU, permitting it to surpass the pace of optimized FP16 implementations, all whereas working inside a considerably diminished reminiscence footprint.
Cons
Only some limitations are talked about elsewhere. It could nonetheless present high quality degradation like different strategies.
Code Implementation
The directions on how you can quantize fashions your self and the corresponding code may be discovered within the official repo. To run AQLM fashions, load a mannequin that has been quantized with AQLM:
from transformers import AutoTokenizer, AutoModelForCausalLMquantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")
Quantization strategies have opened up a world of potentialities, enabling superior language processing capabilities even in our pockets. On this article, we mentioned all about LLM quantization and explored intimately numerous strategies to quantize LLMs. We additionally went via the ups and downs of every method and discovered how you can use them. Moreover, we gained insights on how you can choose probably the most appropriate method primarily based on particular necessities and whether or not you’re utilizing a CPU or GPU.