LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method introduced in 2021 by Hu and colleagues at Microsoft. Instead of updating all of a large model’s weights, LoRA freezes them and injects a pair of small, trainable low-rank matrices into each layer, whose product approximates the needed weight change.

The idea rests on the observation that weight updates during fine-tuning have a low “intrinsic rank” and can therefore be captured by far fewer parameters. For a large language model like GPT-3 (175 billion parameters), this cuts trainable parameters by up to 10,000× and GPU memory several-fold. After training, the adapter can be merged back into the base weights, so there is no added inference latency.

In 2025-2026, LoRA is the de facto standard for cheaply adapting open models such as Llama, and pairing it with quantization (QLoRA) makes fine-tuning feasible on a single consumer GPU. Small, portable adapters — often under 100 MB — make it the backbone of the model-customization ecosystem.

Sources

See also