When fine-tuning giant language fashions (LLMs) regionally, utilizing giant batch sizes is usually impractical as a result of their substantial GPU reminiscence consumption. To beat this limitation, a way known as gradient accumulation is often used to simulate bigger batch sizes. As a substitute of updating the mannequin weights after processing every batch, gradient accumulation includes summing the gradients over a number of smaller mini-batches. The mannequin weights are up to date solely after a predetermined variety of these mini-batches have been processed. This technique successfully mimics coaching with a bigger batch measurement with out the reminiscence overhead usually related to it.
As an example, setting a mini-batch measurement of 1 and accumulating gradients over 32 mini-batches needs to be equal to coaching with a full batch measurement of 32. Nevertheless, I found that gradient accumulation usually leads to considerably degraded efficiency in comparison with coaching with bigger precise batch sizes with in style deep-learning frameworks like Transformers.
After sharing this challenge on X and Reddit, Daniel Han from Unsloth AI replicated the issue. He discovered that it was affecting not solely gradient accumulation but additionally multi-GPU setups. In such…