A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).
It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.
Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).
First - thank you for open sourcing this! It's a real gift to the community to have a model intended for "commercial use" that's actually licensed as such.
I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).
My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.
While intuitively it does seem like ALiBi would make it hard for the model to attend to things that are far away, in many scenarios we've tested with different models trained on different datasets, ALiBi always performs better than sinusoidal, rotary, and other embedding types, even when we're not using it to extrapolate to longer sequence lengths.
These findings have been confirmed by others, including by the BLOOM open source LM project.
Thanks for the link (which I've now skimmed beyond the abstract). What wasn't obvious to me from the abstract is that different attention heads have different penalty strengths, so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing. I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear)
I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.
> so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing
Exactly. You have heads that focus on content nearby and ones that focus on stuff that is far away.
> I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear)
Yup, this is something we tried. Making one of the heads zero doesn't improve or degrade performance.
>I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.
Impressive model, thank you for releasing it under a business-friendly license!
Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.
It is a guess informed by some familiarity with the literature and by going over the papers authored by researchers credited in the OpenAI's "GPT-4 contributors" web page.
Wow! I sincerely wonder how all those folks manage to do business in the tech industry without ever touching Linux, Git, Bash, GCC, glibc, WordPress, Ansible, Grafana, MongoDB, 7-Zip, Vim, Emacs, Firefox, Thunderbird, StackOverflow, Wikipedia, most web fonts, most ad blockers, and all the rest!
What does "fine tuning" mean in this context? Does it mean you fine-tuned it on a specific code repository, or collection of code repositories and then had it do work in those repositories?
Broadly finetuning is any post pretraining training. Most of the time it is used in the context of fitting a more narrow task. In our case, it was the same training objective as the pretraining but meant to be more representative of what Replit users like to code. However, we were surprised by how well it boosted overall performance. Best guess: it's a) novel data and b) the model could take even more training!!
How feasible and effective would it be to fine-tune a model against an organization's private source code, resulting in an "internal" model that knows how to work with that org's stuff?
Could you, say, fine-tune the model every week with the latest merges? Every hour?
Finetuning is a relatively quick process. Training the base model is the expensive part (can take weeks and huge amounts of compute), whereas finetuning usually is only on the last few layers and can be done with much less resources. You could definitely have a "nightly" finetune model that is retrained every day or so.
Interesting - how would that work for a company that wanted to run their own codex model, on-prem, trained on their own code? Perhaps also trained on their dependencies?
Finetuning a smaller model leading to better performance seems like a significant finding that'll lead to a lot of companies fine-tuning their own internal "ChatGPT"s
You seem to know your stuff some, so I'll ask you a question on this: Are there any good books on all the different approaches in this space, or is it all too new and fast moving for such a thing?
There are no books on Large LMs but almost any resource about neural networks covers fine tuning. I like the FastAI courses, and these do cover language models.
When you fine-tune it, do you train just the head/last few layers or do you also unfreeze the model afterwards and retrain the whole model with a very small LR for a few epochs?
You can take a network and its weights that someone else trained, and use that pretrained network to train on your own data, which is likely to be a better starting point than random weights.
The base model checkpoint is licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.
Can't find it now but pretty sure BigCode said somewhere they explicitly looked for it and removed it. Also subjective measure does match up to the benchmark. Our finetuned model performed +50% on HumanEval and then when using it felt at least that much improved.
You can view the prompts, solutions, and checks here[0]. See my sibling comment (to yours) where I quote the Human Eval paper and do some more analysis. But I think if you look at [0] you'll see that these aren't really unique problems and are likely to have large repetitions in the dataset. I should add to that comment to include the dataset[1] (too late to edit) where they mention that they just scrape all of GitHub (Jan 1 2015 - Mar 31 2022). They do exact and near de-duplicate but near de-duplication is messy.
> We implement near-deduplication in our pre-processing pipeline on top of exact
deduplication. We first split the files into words/tokens based on non-alphanumeric characters and remove files with fewer than 10 tokens. Next, we compute the MinHash with 256 permutations of all documents, and use Locality Sensitive Hashing to find clusters of duplicates. We further reduce these clusters by ensuring that each file in the original cluster is similar to at least one other file in the reduced cluster. We consider two files similar when their Jaccard similarity exceeds 0.85.
Near-duplicates are still difficult to measure. So we should expect duplication, and it should be proportional to the number of samples we have (even if the same variance, but I'd wager higher variance with larger duplications).
> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.
So to answer your question, yes, the evaluation dataset is spoiled. You can find such unique and never before seen docstrings like
> For a given list of input numbers calculate the Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the absolute difference between each element and a centerpoint (mean in this case)[1]
And here's a repo I found that is 8 years old[2]. But how about a more recent one that is even closer?[3] There's plenty more examples[4] (does anyone know how actually limit the date to prior to 2021? `pushed:<2021` doesn't work nor does using the `created` keyword. Date searching doesn't seem to work well).
In essence, we can still use this evaluation method to determine how good our model is at doing fuzzy searching. Which, mind you, is still a useful thing. But I would be careful in concluding that this means the model is good at generalizing arbitrary descriptions of code or novel pieces of code. That said, one may be able to argue that not many lines of code are actually that novel. Still, we need to be careful about our conclusions and understand the limitations of our metrics (something I am currently deeply troubled by)
(follow-up: Figured this should be a different comment)
I wanted to demonstrate what I said above so I came up with some examples of things I think a human would have an easy time implementing but might be hard to implement. BUT a key part is that I expect these to be in the dataset! I just don't expect these to be in hundreds or thousands of githubs because they will be uncommon (but not rare). Also, we'll pretty much ask for few-liners to give the model the biggest advantage we can (errors will compound).
Prompt:
from torch import nn
class LipSwish(nn.Module):
""""
The Swish activation function is defined by a gated linear unit,
where the gate is defined by a sigmoid function and multiplies the input with
a learnable parameter, beta. Beta is initialized as 0.5.
The Lipswish function normalizes the output by the upper bound of 1.1.
""""
def __init__(self:
super().__init__()
Result: Mostly correct but missing the division by 1.1. The forward is `return x * F.sigmoid(self.beta * x)`, which is Swish (it also assumes we had "import torch" and applied type hinting). It did properly set the parameter (this is just a 3 liner)
Discussion: The Swish function should be in the dataset and is a well known activation function (though beta is not in the pytorch version). Despite LipSwish being in the dataset (introduced in 2019 from Residual Flows[0]) it is not common. I could get the code to generate the swish function (initializing beta, and performing the gate) but could not get the code to divide the output by 1.1. I would not expect a human to have difficulties with this.
Okay, so let's try something else that might be a bit more common and older. The same paper uses a concatenated activation function, and those aren't "uncommon". CReLU was introduced in 2016[1] and there's plenty of concatenated activations around since then. The pytorch documentation even uses it as an example[2]. There's far more examples of CReLU (3k python results for "class CReLU" vs 58 for "class LipSwish. Use these numbers as weak hints because search sucks and isn't always accurate).
Prompt:
from torch import nn
from torch.nn import functional as F
class CReLU(nn.Module):
""""
Concatenated version of ReLU. The activation is applied to both the positive and
negative of our input and the result is concatenated.
Result: `return torch.cat([x.clamp(min=0), -x.clamp(min=0)], 1)`. This is correct but not the expected one-liner result.
Discussion: This was a bit surprising, it didn't use functional as we might expect (or hinted). But interestingly it will if we change the class name to "ConcatenatedReLU". I found exact copies on GitHub with the full name (memorization) but the fist page of instances for CReLU I found used functional (I did find one that was exactly the above code, when adding "clamp" to the search, but missing the minus sign. There were plenty of errors in CReLU implementations). Interesting side note: CReLU continues and defines a function CReLU6 with uses the same docstring but clamps with a max of 6 on the positive input whereas Concatenated starts to define a convolutional block (Conv + BatchNorm + ReLU) called Conv2d.
So we have kinda mixed results, and in both cases these are rather odd and probably not what we wanted. We can clearly see that there are issues where a human would not have too much trouble. There's a big issue in these types of problems: we need to memorize a lot of information (otherwise we can't even write code or know library calls) but too much memorization prevents creativity. There is a lot of gray area between the _pure_ "Stochastic Parrot"/"Fancy copy machine" vs a generalized intelligence (with a broad and flexible definition of intelligence). I'd still call them stochastic parrots because to me the evidence suggests that we're closer to the memorization side than the creation side. But that doesn't mean these frameworks aren't useful. We all know a lot of code is boiler plate (otherwise we wouldn't have the joke "copy paste from SO") and these tools can be very useful for that. But I think the utility is highly going to depend on what you are coding for and how you code. If you're doing standard stuff, this probably has high utility to you and can save you a lot of time. The same way writing macros does, but this is FAR more powerful. It can also help novices a lot. Also, if your main errors are reading mistakes (e.g. you're dyslexic) -- this is my largest problem -- then this might make things difficult as you have a tendency to gloss over text and miss minor errors. I also don't think these tools would help if you're a researcher or writing optimized or specialized code. These differences are probably why we see such differences in people's reactions. But it may also be a hint into what people do and how they work when we see who raves and who rants about these.
Edit: We can also check if code is in the stack[3]. We see that [0] is indeed in the dataset so we know there is information leakage. Interestingly the exact copy I found in the previous comment[4] isn't! (The repo, though the user is)
1- We trained on languages that are most popular on Replit. Markdown is important because you need some amount of natural language in the data, and it will act as a sort of "natural language label" for code.
2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.
Have you thought of finding or creating something like this [0]?
I created this as the basis for my origami folding descriptive language. I tried to find something similar, requirements being both well structured and English-like but couldn't find any, so I created it.
The origami folding app will hopefully be out in 2 weeks, so you can see how it's used.
- Repo: https://github.com/replit/ReplitLM/tree/main/replit-code-v1-...
- HuggingFace: https://huggingface.co/replit/replit-code-v1-3b
- Demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo
- Early benchmark results: https://twitter.com/amasad/status/1651019556423598081
A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).
It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.
Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).