Hacker News new | past | comments | ask | show | jobs | submit login

My language has under 1000 hours of training data, so apparently is not well supported. How can I help add more training data? I actually have tens of hours of transcriptions of my own voice in my language, because I take many voice notes spanning back almost two decades. Most of it is very personal, but I could probably sort away a good portion for this and other projects.

https://keyboard.futo.org/whisper-training-data-breakdown




If you contribute to Mozilla Commmon Voice, it'll probably make its way into their training data eventually. https://commonvoice.mozilla.org/en/languages

If you want to use your existing recordings, uploading them to YouTube or archive.org with a Creative Commons license might also work.


Great, thanks, I'll spend some time with that.


Hi dotancohen,

More info of how to help here.

https://github.com/futo-org/voice-input-models


Thank you!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: