Positive-tuning massive language fashions (LLMs) with as much as 35B parameters is comparatively simple and low-cost since it may be finished with a single client GPU. Positive-tuning bigger fashions with a single client GPU is, in concept, not unattainable as we will offload components of the mannequin to the CPU reminiscence. Nevertheless, it will be extraordinarily gradual, even with high-end CPUs.
Utilizing a number of GPUs is the one various to maintain fine-tuning quick sufficient. A configuration with 2×24 GB GPUs opens loads of prospects. 48 GB of GPU reminiscence is sufficient to fine-tune 70B fashions comparable to Llama 3 70B and Qwen2 72B.
On this article, I clarify the way to fine-tune 70B LLMs utilizing solely two GPUs due to FSDP and QLoRA.
I first clarify what’s FSDP after which we’ll see the way to modify an ordinary QLoRA fine-tuning code to run it on a number of GPUs. For the experiments and demonstrations, I take advantage of Llama 3.1 70B however it will work equally for different LLMs. For the {hardware}, I relied on 2 RTX 3090 GPUs supplied by RunPod (referral link). Utilizing 2 RTX 4090 GPUs could be quicker however dearer.
I additionally made a pocket book implementing the code described on this article. It’s obtainable right here: