With a LoRA adapter, we will specialize a big language mannequin (LLM) for a process or a site. The adapter have to be loaded on prime of the LLM for use for inference. For some functions, it is perhaps helpful to serve customers with a number of adapters. As an illustration, one adapter might carry out perform calling and one other might carry out a really completely different process, reminiscent of classification, translation, or different language era duties.
Nevertheless, to make use of a number of adapters, a regular inference framework would have first to unload the present adapter after which load the brand new adapter. This unload/load sequence can take a number of seconds which might degrade the consumer expertise.
Fortuitously, there are open supply frameworks that may serve a number of adapters on the similar time with none noticeable time between using two completely different adapters. As an illustration, vLLM (Apache 2.0 license), one of the environment friendly open supply inference frameworks, can simply run and serve a number of LoRA adapters concurrently.
On this article, we’ll see how one can use vLLM with a number of LoRA adapters. I clarify how one can use LoRA adapters with offline inference and how one can serve a number of adapters to customers for on-line inference. I exploit Llama 3 for the examples with adapters for perform calling and chat.