Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They trained on https://huggingface.co/datasets/bigcode/the-stack-dedup which is a massive curated dataset accumulated from GitHub. Details are here: https://www.bigcode-project.org/docs/about/the-stack/

Many of the most-represented "languages" on GitHub are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, and SVG.

More details from them here: https://blog.replit.com/llm-training




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: