It's also worth mentioning that the original implementation by Meta is only 300 ...

ebb_earl_co · on May 16, 2024

On line 59, there is a less-than-or-equals comparison between 0 and 1. Curious https://github.com/meta-llama/llama3/blob/main/llama/model.p...

bloaf · on May 16, 2024

I am a reasonably competent python coder, yet when I see stuff like this I regard it with the same suspicion as a switch in the "more magic" position.

https://www.catb.org/jargon/html/magic-story.html

danielheath · on May 16, 2024

What's the operator precedence in python?

Is it `assert(0 <= (1 < ndim))` or `assert((0 <= 1) < ndim)`, or something even stranger like `assert(0 <= 1) < ndim`?

__s · on May 16, 2024

Python actually does something pretty neat: it chains comparisons so that `x < y <= z` is like `x < y and y <= z` except y is only evaluated once

In linked code we can be confident that `0 <= 1`, so only `1 < ndim` should matter. In fact I'd expect peephole optimization to remove most of the code for `0 <= 1`

3abiton · on May 18, 2024

I did not know that python fact. Thanks for sharing!

blharr · on May 16, 2024

So is this the case that the information is in the data set? Or the code is very well defined to be so small? As an outsider it's surprising that such a capable model can be so "simple".

jacobn · on May 16, 2024

The training code is presumably quite a bit more complex than what they've open sourced, but part of the beauty of the GPT-based LLMs is their structural simplicity.

Now, that simplicity can be deceiving - there are a lot of conceptual interconnectedness within these models. They've been put together "just so" if you will.

If you look at the source code to nanoGPT and compare it to Llama3, the most remarkable thing (when you look past the superficial name changes) is just how similar they are.

If I recall correctly the primary differences are:

  - The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x + proj(gelu(expand(x))) in GPT2
  - The token encoders, which is arguably external to the model
  - Attention: Llama3 uses Grouped Query Attention, vs full Multi-Head Attention in GPT2
  - Normalization: Llama3 uses RMSNorm, vs LayerNorm for GPT2

They were published more than five years apart. On the one hand progress has been breathtaking, truly astounding. On the other hand, it's almost exactly the same model.

Goes to show just how much is in the training data.

jacobn · on May 18, 2024

> Goes to show just how much is in the training data.

And in the scale (num_layers, embed_dim, num_heads) of the model of course ;)

novaRom · on May 17, 2024

> beauty of the GPT-based LLMs is their structural simplicity

human brain's structure is also encoded in a short DNA sequence

jacobn · on May 17, 2024

Forgot one: the positional encoding also changed, llama3 uses RoPE, gpt2 uses a learned embedding.

moritzwarhier · on May 16, 2024

I think with LLMs in general, the algorithms are very refined and require lots of research, despite being "simple" in terms of entropy, or an imagined Kolgomorov complexity for defining algorithms.

So "simple" is a fuzzy term here, but yes, the entropic complexity is in the data, not the algorithms.

Related to the so-called "Bitter lesson".

Edit: the sister comment pointed out what I failed to express: RILHF and training are also algorithms, and their applications and implementations are probably much more complex than the code that evaluates a given prompt.

So basically, "models" (trained NNs) are also an example for the equivalence of code and data.

Fixed data used by code (the trained model) is code in itself, even when it is not directly written by humans or in a human-readable language.

Edit edit: don't forget to count the imported maths code :) but I assume this is not relevant to the "it's just matrix multiplications" overall argument

SpaceManNabs · on May 16, 2024

300 lines of this code is a bit different than 300 lines of typical code where you read files, set up a backend/frontend, or parse data. In the latter case, there are a lot of tedious operations. Sure, the former also has that with reshaping and asserts or wtv.

But in a sense, the 300 lines of Llama code are essentially just lines of math. And reading through any math proof will show you that any particular line can hide large amounts of complexity.

This can be true with code with more tedious operations, but those lines are a smaller fraction of the overall code base by definition.

Even the "tedious" parts of the llama code can hide large complexity. Setting a learning rate with a schedule might require reading a paper or two for your particular architecture.

But yes, once you parse all the math and the theory, the lines are kinda simple matmul and forward lol.

ffriend · on May 16, 2024

Sure, knowing the basics of LLM math is necessary. But it's also _enough_ to know this math to fully grasp the code. There are only 4 concepts - attention, feed-forward net, RMS-normalization and rotary embeddings - organized into a clear structure.

Now compare it to the Hugginface implementation [1]. In addition to the aforementioned concepts, you need to understand the hierarchy of `PreTrainedModel`s, 3 types of attention, 3 types of rotary embeddings, HF's definition of attention mask (which is not the same as mask you read about in transformer tutorials), several types of cache class, dozens of flags to control things like output format or serialization, etc.

It's not that Meta's implementation is good and HF's implementation is bad - they pursue different goals in their own optimal way. But if you just want to learn how the model works, Meta's code base is great.

[1]: https://github.com/huggingface/transformers/blob/main/src/tr...

hongspike · on May 17, 2024

The numpy code can seem more accessible and easy to understand. Torch can look scary even though it's similar to numpy.

kureikain · on May 16, 2024

Do you know why these are so short? What is the algorithm/magic in all of these?

I tried to make sense of it but cannot

Hugsun · on May 16, 2024

Architecturally, LLMs are very simple compared to many software projects.

The crux of their behavior comes from their learned weights which are gigabytes and can cost millions to obtain via training.

DavidSJ · on May 16, 2024

The magic is in the billions of learned weights (~synapses). This is just the scaffolding that runs them.

chpatrick · on May 16, 2024

The magic is the structure of the model, and the real magic is the billions of weights.

_pastel · on May 17, 2024

Why is max_seq_len set to 2048 [1] when the model card says the context size is 8k [2]?

[1] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

[2] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

mkolodny · on May 17, 2024

That's just the default. You can set max_seq_len to 8k. From the readme [0]:

> All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

[0] https://github.com/meta-llama/llama3/tree/14aab0428d3ec3a959...

blt · on May 16, 2024

the simplicity of the transformer is quite refreshing. especially in vision where the Vision Transformer with linear patch encodings replaces complex intertwined decisions about filter size, striding, pooling, #filters, depth, etc., with the simpler decision of how to allocate your FLOPS between dimensionality, #heads, and #layers.