That isn't a very good example. The vectors for each token are randomly initiali...

smaddox on Jan 3, 2024 | parent | context | favorite | on: Understand how transformers work by demystifying t...

That isn't a very good example. The vectors for each token are randomly initialized with each element taken from the normal distribution. After training, similar words will have some cosine similarity, but almost never as much cosine similarity as [1,2,3,4] and [2,3,4,5].