AI Notes

Apr 27, 2024 · 10 min read · Notes ·

FT with LoRA

If you have ever fine-tuned LLMs using Lora and felt disappointed after merging the adapter with the base model, heres a reason why that might have happened:

👉🏼 Improper r and alpha values. These are the most important values that are determined by the purpose behind your fine-tuning.

Before going further into the details of Lora, here is how Lora works in layman's terms.(feel free to skip this part)

Let's say you have a weight matrix of 1000x1000 = 1,000,000 weights.

Instead of doing backpropagation and modifying all the 1,000,000 weights, we determine a rank, let's say in this case we choose 5.

A = matrix[1000x5] B = matrix[5x1000]

Even though the dot product will have the same dimensions as the initial weight matrix, you will only be changing (1000x5)+(5x1000) = 10,000 weights, which is 0.01% of the initial weights.

And while merging these weights with the base model, that's where alpha comes in.

We calculate the scaling with which we have to merge the newly learned weights. The higher the scaling, the more the change in the weights.

scaling = alpha/rank(r)

weight += (lora_B @ lora_A) * scaling

👉🏼 Now, moving on to my observations:

Let's say you are fine-tuning a base model for instruction. Having a low rank (r) value would be the best bet, as you want to teach the model to respond with an answer to a question and not continue asking questions. Having a low r value not only allows the model to leverage most of its learning from the pretraining stage but also allows it to be less rigid with its responses.

When instructing fine-tuning on a basic model, a low rank works well, teaching the model the format without delving into high-level concepts. It's like playing the "I am the user, and you are the helpful assistant!" game, emphasizing the desired output structure.

But if you are fine-tuning a model for domain adaptation or to embed some sort of knowledge into it, having a higher r value would be ideal as it's going to change a lot more weights.

By low r value, I mean around 32ish, and high r value corresponds to something around 128. The practical limit, from what I have observed, is 256.

Now coming to alpha, which is the scaling factor that essentially tells at what scale the weights will be merged with the base model. Generally, from what I have observed, people suggest having alpha = 2*r, which makes the newly learned weights more "louder" or prominent than the original model's weights. But this is not always the case.

I have linked a blog below where @rasbt conducted multiple experiments with different r and alpha models. They observed peak performance at both r = 256 and alpha = 512, and r = 256 and alpha = 128, which is a scaling of 0.5. Also, another point to remember is that alpha is among the few parameters that can be safely lowered after training without significant downsides.

In summary, balancing R and alpha values is key. Layer selection and dataset size are also major contributors to a successful fine-tune, and it all comes down to the purpose of your finetune.

RAG vs long context window

RAG is cheap, long context is expensive. True, but remember, compared to LLM, BERT-small is also cheap, and n-gram is even cheaper, but they are not used today, because we want the model to be smart first, then makes smart models cheaper -- history of AI tells it is much easier to make smart models cheaper than making cheap model smart -- when it is cheap, it's never smart.
Long context can mix retrieval and reasoning during the whole decoding processing. RAG only does the retrieval at the very beginning. Typically, given a question, RAG retrieves the paragraphs that is related to the question, then generate. Long-context does the retrieval for every layer and every token. In many cases the model needs to do on-the-fly per-token interleaved retrieval and reasoning, and only knows what to retrieve after getting the results of the first reasoning step. Only long-context can do such cases.
RAG supports trillion level tokens, long-context is 1M. True, but there is a natural distribution of the input document, and I tend to believe most of the cases that requires retrieval is under million level. For example, imagine a layer working on a case whose input is related legal documents, or a student learning machine learning whose input are three ML books -- does not feel as long as 1B right?
RAG can be cached, long-context needs to re-enter the whole document. This is a common misunderstanding of long-context: there is something called KV cache, and you can also design sophisticated caching and memory hierarchy ML system working with kv cache. This is to say, you only read the input once, then all subsequent queries will reuse the kv cache. One may argue that kv cache is large -- ture, but don't worry, we LLM researchers will give you crazy kv cache compression algorithms just in time.
You also want to call a search engine, which is also retrieval. True, and in the short term, it will continue to be true. Yet there are crazy researchers whose imagination can be wild -- for example, why not letting the language model directly attend to the entire google search index, i.e., let the model absorb the whole google. I mean, since you guys believe in AGI, why not?
Today's Gemini 1.5 1M context is slow. True, and definitely it needs to be faster. I'm optimistic on this -- it will definitely be much faster, and eventually as fast as RAG

Models/things to explore

FT mixtral

Fine-tune Mixtral super easily with this one command:

1pip uninstall -y transformers && pip uninstall -y flash-attn && pip install flash-attn && pip install git+https://github.com/huggingface/transformers && git clone https://github.com/OpenAccess-AI-Collective/axolotl && cd axolotl && pip3 install -e .[flash-attn] && pip3 install -U git+https://github.com/huggingface/peft.git && pip uninstall -y deepspeed && pip install -U deepspeed && pip install accelerate && wget YAML_LINK_HERE && NCCL_P2P_LEVEL=PIX accelerate launch -m axolotl.cli.train YAML_NAME_HERE --deepspeed deepspeed/zero2.json && wget https://gist.githubusercontent.com/mlabonne/a3542b0519708b8871d0703c938bba9f/raw/9fc2141d2653e83192a97bccf6826b201e8e47cd/merge_peft.py && python merge_peft.py --base_model=mistralai/Mixtral-8x7B-v0.1 --peft_model=./qlora-out --hub_id=REPO_FOR_TRAINED_MODEL

How to use this:

Install Axolotl normally, or use Runpod's template.
Create an Axolotl-compatible training yaml (here is a starting point: https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/examples/mistral/mixtral.yml), upload it as a Github gist, and paste the gist link in place of YAML_LINK_HERE.
Replace YAML_NAME_HERE with the name of the gist/yaml file.
Replace REPO_FOR_TRAINED_MODEL with the name of an empty Hugging Face repo where you want to store the trained model.
Run the command. It'll train, merge, and upload the model all in one go!

Things to note:

This handles weird edge cases (see the seemingly random pip uninstalls) and uses deepspeed
Can also be used to fine-tune other models
Works with QLoRA and LoRA as-is, but removing the merging section of the command allows you to do full fine-tunes
There's fluff like NCCL_P2P_LEVEL in here, feel free to modify it to fit your needs
Make sure your HuggingFace token is set

PyTorch internals

Memory abstraction: pytorch object, storage, allocator
IR: intermediate representation by TorchScript (1.0) or TorchDynamo (2.0)
Torch Export and Exec

Transformer as a computer Andrej Karpathy

The Transformer is a magnificient neural network architecture because it is a general-purpose differentiable computer. It is simultaneously:

expressive (in the forward pass)
optimizable (via backpropagation+gradient descent)
efficient (high parallelism compute graph)

because its message-passing-like architecture is general (i.e. completeness) and powerful (i.e. efficiency), able to cover many real-world algorithms and in a small number of compute steps; an an empirical finding.
because of residual connections, layer normalizations, and softmax attention. Absence of any flat tails. Residual connections support a kind of ability to learn short algorithms (think low LOC) fast and first, then gradually extend them longer during training.
because the compute graph is shallow and wide, mapping significantly better to our high-parallelism compute architectures (think GPUs). An earlier attempt that understood the significance and optimized for this property was the Neural GPU paper (https://arxiv.org/abs/1511.08228)

Its success lies in a single architecture that simultaneously satisfies all of these properties. The original Attention Is All You Need paper is a bit haphazard and undersells the magnitude of these insights, their history and motivations. But there's a lot going on :)

So I probably would have called the paper something like "Transformer: A general-purpose, efficient, optimizable computer" and presented it alongside the Neural Turing Machine, NeuralGPU and friends, then applied it to translation as an example. Something like that, but ok :)

A few people have (correctly) pointed out the hindsight here, which is fair. I don't suspect the authors would have known that 5 years later that architecture will have taken over most of AI ~unchanged, except for a re-shuffling of layernorms. Calls for a followup paper :)

Transformer converged Andrej Karpathy

The ongoing consolidation in AI is incredible. When I started ~decade ago vision, speech, natural language, reinforcement learning, etc. were completely separate; You couldn't read papers across areas - the approaches were completely different, often not even ML based.

In 2010s all of these areas started to transition 1) to machine learning and specifically 2) neural nets. The architectures were diverse but at least the papers started to read more similar, all of them utilizing large datasets and optimizing neural nets.

But as of approx. last two years, even the neural net architectures across all areas are starting to look identical - a Transformer (definable in ~200 lines of PyTorch https://github.com/karpathy/minGPT/blob/master/mingpt/model.py…), with very minor differences. Either as a strong baseline or (often) state of the art.

You can feed it sequences of words. Or sequences of image patches. Or sequences of speech pieces. Or sequences of (state, action, reward) in reinforcement learning. You can throw in arbitrary other tokens into the conditioning set - an extremely simple/flexible modeling framework

Even within areas (like vision), there used to be some differences in how you do classification, segmentation, detection, generation, but all of these are also being converted to the same framework. E.g. for detection take sequence of patches, output sequence of bounding boxes.

The distinguishing features now mostly include 1) the data, and 2) the Input/Output spec that maps your problem into and out of a sequence of vectors, and sometimes 3) the type of positional encoder and problem-specific structured sparsity pattern in the attention mask.

So even though I'm technically in vision, papers, people and ideas across all of AI are suddenly extremely relevant. Everyone is working with essentially the same model, so most improvements and ideas can "copy paste" rapidly across all of AI.

As many others have noticed and pointed out, the neocortex has a highly uniform architecture too across all of its input modalities. Perhaps nature has stumbled by a very similar powerful architecture and replicated it in a similar fashion, varying only some of the details.

This consolidation in architecture will in turn focus and concentrate software, hardware, and infrastructure, further speeding up progress across AI. Maybe this should have been a blog post. Anyway, exciting times.

AI big ideas LeCun

Self-Supervised Learning
ResNets (not intellectually deep, but useful)
Gating -> Attention -> Dynamic connection graphs.
Differentiable memory.
Permutation-equivariant modules, e.g. multihead self-attention -> Transformers.

I should say the #3 includes graph neural nets, which I see as a major conceptual advance (albeit somewhat subsumed be transformers).

GANs are still an interesting concept. But in terms of representation learning (which is what I'm primarily interested in), they have been a complete failure. In fact, I've been arguing against generative models and in favor of joint embedding architectures recently.

What is differentiable memory? What's the best reference for this?

Memory networks https://arxiv.org/abs/1410.3916
End-to-end MemNets https://arxiv.org/abs/1503.08895
Key-Value MemNets https://arxiv.org/abs/1606.03126

It's the basic circuit used in Transformers.