Hacker News new | past | comments | ask | show | jobs | submit login

Some links:

- Repo: https://github.com/replit/ReplitLM/tree/main/replit-code-v1-...

- HuggingFace: https://huggingface.co/replit/replit-code-v1-3b

- Demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo

- Early benchmark results: https://twitter.com/amasad/status/1651019556423598081

A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).

It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.

Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).




First - thank you for open sourcing this! It's a real gift to the community to have a model intended for "commercial use" that's actually licensed as such.

I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).

My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.


(I wrote ALiBi) You can read the paper here https://arxiv.org/abs/2108.12409

While intuitively it does seem like ALiBi would make it hard for the model to attend to things that are far away, in many scenarios we've tested with different models trained on different datasets, ALiBi always performs better than sinusoidal, rotary, and other embedding types, even when we're not using it to extrapolate to longer sequence lengths.

These findings have been confirmed by others, including by the BLOOM open source LM project.


Small world!

Thanks for the link (which I've now skimmed beyond the abstract). What wasn't obvious to me from the abstract is that different attention heads have different penalty strengths, so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing. I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear)

I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.


> so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing Exactly. You have heads that focus on content nearby and ones that focus on stuff that is far away.

> I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear) Yup, this is something we tried. Making one of the heads zero doesn't improve or degrade performance.

>I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.

Thanks so much!!


Impressive model, thank you for releasing it under a business-friendly license!

Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.

Here is the paper https://arxiv.org/abs/2111.12763 and the implementation https://github.com/google/trax/blob/master/trax/models/resea... if you are interested.

Hope you get to look into this!


Thank you for releasing the weights along with the announcement. The posts that made great headlines, but “weights are on their way!”

Like why did we even get excited? This? Great work.


> I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.

is that a guess or is there a source? im curious to read more


It is a guess informed by some familiarity with the literature and by going over the papers authored by researchers credited in the OpenAI's "GPT-4 contributors" web page.

I have an expanded list of foundational research that is likely to serve as basis for gpt4 here in my blog: https://kir-gadjello.github.io/posts/gpt4-some-technical-hyp...

Hope it helps!


Interesting resource. I had been wondering whether anyone had tried to compile such a list.


thank you! glad i asked


I don't think it's a business friendly license?


It allows for modifications and commercial use: https://creativecommons.org/licenses/by-sa/4.0/

>You are free to:

>Share — copy and redistribute the material in any medium or format

>Adapt — remix, transform, and build upon the material

>for any purpose, even commercially.

Compare this to the latest release from StabilityAI lab DeepFloyd, "IF", which in addition to various restrictive clauses strictly prohibits commercial use: https://github.com/deep-floyd/IF/blob/develop/LICENSE-MODEL

Repl.it's release is as open as it gets these days, in my book.


It's a copyleft license; and lots of folks on HN seem to think that copyleft, while being open, isn't business friendly.


Wow! I sincerely wonder how all those folks manage to do business in the tech industry without ever touching Linux, Git, Bash, GCC, glibc, WordPress, Ansible, Grafana, MongoDB, 7-Zip, Vim, Emacs, Firefox, Thunderbird, StackOverflow, Wikipedia, most web fonts, most ad blockers, and all the rest!


What does "fine tuning" mean in this context? Does it mean you fine-tuned it on a specific code repository, or collection of code repositories and then had it do work in those repositories?


Broadly finetuning is any post pretraining training. Most of the time it is used in the context of fitting a more narrow task. In our case, it was the same training objective as the pretraining but meant to be more representative of what Replit users like to code. However, we were surprised by how well it boosted overall performance. Best guess: it's a) novel data and b) the model could take even more training!!


How feasible and effective would it be to fine-tune a model against an organization's private source code, resulting in an "internal" model that knows how to work with that org's stuff?

Could you, say, fine-tune the model every week with the latest merges? Every hour?


Finetuning is a relatively quick process. Training the base model is the expensive part (can take weeks and huge amounts of compute), whereas finetuning usually is only on the last few layers and can be done with much less resources. You could definitely have a "nightly" finetune model that is retrained every day or so.


Interesting - how would that work for a company that wanted to run their own codex model, on-prem, trained on their own code? Perhaps also trained on their dependencies?


Finetuning a smaller model leading to better performance seems like a significant finding that'll lead to a lot of companies fine-tuning their own internal "ChatGPT"s


You seem to know your stuff some, so I'll ask you a question on this: Are there any good books on all the different approaches in this space, or is it all too new and fast moving for such a thing?


There are no books on Large LMs but almost any resource about neural networks covers fine tuning. I like the FastAI courses, and these do cover language models.


You can also check the NLP with transformers book


When you fine-tune it, do you train just the head/last few layers or do you also unfreeze the model afterwards and retrain the whole model with a very small LR for a few epochs?


You can take a network and its weights that someone else trained, and use that pretrained network to train on your own data, which is likely to be a better starting point than random weights.


How is this code licensed? I didn't see a license in the repo. It looks interesting!


The README indicates:

The base model checkpoint is licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.


Doesn't the Stack contain HumanEval? So you're basically comparing numbers on the pretraining data.


Can't find it now but pretty sure BigCode said somewhere they explicitly looked for it and removed it. Also subjective measure does match up to the benchmark. Our finetuned model performed +50% on HumanEval and then when using it felt at least that much improved.


You can view the prompts, solutions, and checks here[0]. See my sibling comment (to yours) where I quote the Human Eval paper and do some more analysis. But I think if you look at [0] you'll see that these aren't really unique problems and are likely to have large repetitions in the dataset. I should add to that comment to include the dataset[1] (too late to edit) where they mention that they just scrape all of GitHub (Jan 1 2015 - Mar 31 2022). They do exact and near de-duplicate but near de-duplication is messy.

> We implement near-deduplication in our pre-processing pipeline on top of exact deduplication. We first split the files into words/tokens based on non-alphanumeric characters and remove files with fewer than 10 tokens. Next, we compute the MinHash with 256 permutations of all documents, and use Locality Sensitive Hashing to find clusters of duplicates. We further reduce these clusters by ensuring that each file in the original cluster is similar to at least one other file in the reduced cluster. We consider two files similar when their Jaccard similarity exceeds 0.85.

Near-duplicates are still difficult to measure. So we should expect duplication, and it should be proportional to the number of samples we have (even if the same variance, but I'd wager higher variance with larger duplications).

[0] https://github.com/openai/code-align-evals-data/tree/97446d9...

[1] https://arxiv.org/abs/2211.15533


My favorite line from the HumanEval paper[0]

> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.

So to answer your question, yes, the evaluation dataset is spoiled. You can find such unique and never before seen docstrings like

> For a given list of input numbers calculate the Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the absolute difference between each element and a centerpoint (mean in this case)[1]

And here's a repo I found that is 8 years old[2]. But how about a more recent one that is even closer?[3] There's plenty more examples[4] (does anyone know how actually limit the date to prior to 2021? `pushed:<2021` doesn't work nor does using the `created` keyword. Date searching doesn't seem to work well).

In essence, we can still use this evaluation method to determine how good our model is at doing fuzzy searching. Which, mind you, is still a useful thing. But I would be careful in concluding that this means the model is good at generalizing arbitrary descriptions of code or novel pieces of code. That said, one may be able to argue that not many lines of code are actually that novel. Still, we need to be careful about our conclusions and understand the limitations of our metrics (something I am currently deeply troubled by)

[0] https://arxiv.org/abs/2107.03374

[1] https://github.com/openai/code-align-evals-data/blob/97446d9...

[2] https://github.com/bertomartin/stat4701/blob/ec2b64f629cbbf6...

[3] https://github.com/danielwatson6/hate-speech-project/blob/64...

[4] https://github.com/search?q=abs%28x+-+mean%29+for+language%3...


(follow-up: Figured this should be a different comment)

I wanted to demonstrate what I said above so I came up with some examples of things I think a human would have an easy time implementing but might be hard to implement. BUT a key part is that I expect these to be in the dataset! I just don't expect these to be in hundreds or thousands of githubs because they will be uncommon (but not rare). Also, we'll pretty much ask for few-liners to give the model the biggest advantage we can (errors will compound).

Prompt:

from torch import nn

class LipSwish(nn.Module):

""""

The Swish activation function is defined by a gated linear unit,

where the gate is defined by a sigmoid function and multiplies the input with

a learnable parameter, beta. Beta is initialized as 0.5.

The Lipswish function normalizes the output by the upper bound of 1.1.

""""

    def __init__(self:

        super().__init__()
Result: Mostly correct but missing the division by 1.1. The forward is `return x * F.sigmoid(self.beta * x)`, which is Swish (it also assumes we had "import torch" and applied type hinting). It did properly set the parameter (this is just a 3 liner)

Discussion: The Swish function should be in the dataset and is a well known activation function (though beta is not in the pytorch version). Despite LipSwish being in the dataset (introduced in 2019 from Residual Flows[0]) it is not common. I could get the code to generate the swish function (initializing beta, and performing the gate) but could not get the code to divide the output by 1.1. I would not expect a human to have difficulties with this.

Okay, so let's try something else that might be a bit more common and older. The same paper uses a concatenated activation function, and those aren't "uncommon". CReLU was introduced in 2016[1] and there's plenty of concatenated activations around since then. The pytorch documentation even uses it as an example[2]. There's far more examples of CReLU (3k python results for "class CReLU" vs 58 for "class LipSwish. Use these numbers as weak hints because search sucks and isn't always accurate).

Prompt:

from torch import nn

from torch.nn import functional as F

class CReLU(nn.Module):

""""

Concatenated version of ReLU. The activation is applied to both the positive and

negative of our input and the result is concatenated.

""""

    def __init__(self):

        super().__init__()

    def forward(self, x):
Result: `return torch.cat([x.clamp(min=0), -x.clamp(min=0)], 1)`. This is correct but not the expected one-liner result.

Discussion: This was a bit surprising, it didn't use functional as we might expect (or hinted). But interestingly it will if we change the class name to "ConcatenatedReLU". I found exact copies on GitHub with the full name (memorization) but the fist page of instances for CReLU I found used functional (I did find one that was exactly the above code, when adding "clamp" to the search, but missing the minus sign. There were plenty of errors in CReLU implementations). Interesting side note: CReLU continues and defines a function CReLU6 with uses the same docstring but clamps with a max of 6 on the positive input whereas Concatenated starts to define a convolutional block (Conv + BatchNorm + ReLU) called Conv2d.

So we have kinda mixed results, and in both cases these are rather odd and probably not what we wanted. We can clearly see that there are issues where a human would not have too much trouble. There's a big issue in these types of problems: we need to memorize a lot of information (otherwise we can't even write code or know library calls) but too much memorization prevents creativity. There is a lot of gray area between the _pure_ "Stochastic Parrot"/"Fancy copy machine" vs a generalized intelligence (with a broad and flexible definition of intelligence). I'd still call them stochastic parrots because to me the evidence suggests that we're closer to the memorization side than the creation side. But that doesn't mean these frameworks aren't useful. We all know a lot of code is boiler plate (otherwise we wouldn't have the joke "copy paste from SO") and these tools can be very useful for that. But I think the utility is highly going to depend on what you are coding for and how you code. If you're doing standard stuff, this probably has high utility to you and can save you a lot of time. The same way writing macros does, but this is FAR more powerful. It can also help novices a lot. Also, if your main errors are reading mistakes (e.g. you're dyslexic) -- this is my largest problem -- then this might make things difficult as you have a tendency to gloss over text and miss minor errors. I also don't think these tools would help if you're a researcher or writing optimized or specialized code. These differences are probably why we see such differences in people's reactions. But it may also be a hint into what people do and how they work when we see who raves and who rants about these.

[0] https://arxiv.org/abs/1906.02735

[1] https://arxiv.org/abs/1603.05201

[2] https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html

Edit: We can also check if code is in the stack[3]. We see that [0] is indeed in the dataset so we know there is information leakage. Interestingly the exact copy I found in the previous comment[4] isn't! (The repo, though the user is)

[3] https://huggingface.co/spaces/bigcode/in-the-stack

[4] https://github.com/bertomartin/stat4701/blob/ec2b64f629cbbf6...


Hi there, I have two question:

1 - Why did you choose Markdown? It seems an odd choice for training a model like this.

2 - Have you tried to train only one single PL and then benchmark it against this more general version?


1- We trained on languages that are most popular on Replit. Markdown is important because you need some amount of natural language in the data, and it will act as a sort of "natural language label" for code.

2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.


Have you thought of finding or creating something like this [0]?

I created this as the basis for my origami folding descriptive language. I tried to find something similar, requirements being both well structured and English-like but couldn't find any, so I created it.

The origami folding app will hopefully be out in 2 weeks, so you can see how it's used.

[0] https://github.com/fuzzthink/mation-spec


They trained on https://huggingface.co/datasets/bigcode/the-stack-dedup which is a massive curated dataset accumulated from GitHub. Details are here: https://www.bigcode-project.org/docs/about/the-stack/

Many of the most-represented "languages" on GitHub are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, and SVG.

More details from them here: https://blog.replit.com/llm-training


Did any interns help in developing this? If so are you planning on intimidating them as usual? :)

Reference: How Replit used legal threats to kill my open-source project https://intuitiveexplanations.com/tech/replit/


Wow. That's extremely poor behaviour if the account is accurate.



Very exciting, thanks for sharing all this




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: