Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
High-performance speech recognition with no supervision at all (facebook.com)
212 points by panabee on May 21, 2021 | hide | past | favorite | 62 comments


Facebook's focus on unsupervised machine learning is a huge plus for under-resourced languages. They had a similar article for unsupervised machine translation[1] before, and I can see how it'll open doors for many African languages.

[1] https://engineering.fb.com/2018/08/31/ai-research/unsupervis...


Not only that but it's also great for the English models as well. Many models are trained using the libre speech dataset with 960h of labeled audio.

What if you trained with 960000 hours instead??


Labeled data is easy to come by in popular language though. Just get an Audible subscription.


Hilarious how the Arabic „hello“ in that picture is completely mangled since they screwed up ltr rendering

(basically instead of hello it says „o l l e h“)


In the title image, the Arabic word "ahlan" (welcome) is spelled correctly but left to right instead of right to left (should look like this: أهلا)


It takes effort to screw up a single word while copy-pasting it from a Google search result page.


It seems that the discriminator tests if the output of the generator obeys the pattern of a language. Isn't it still possible that after training the generator outputs the text wrong for the input? On the other words, the output is understandable but actually not what the sound indicate?


Context may help determine the intended word. Certainly any such system can be wrong sometimes.



Could someone explain what is being done here?

I see they are using a GAN, and doing unsupervised training. But then they appear to compare their model to supervised-trained models.

How do they do this? Do they tack a supervised-trained model onto the end of their unsupervised model? I imagine they must do supervised training at some point, else how can they convert sounds to text?


Hopefully they license it to Tesla so that my car stops misinterpreting 'call Jodie' as 'cool jodi'.


so close... you need to put your car in jabberish (jedi jibberish) command console which will then correctly understand "cool jodi" as "Cool Jedi" and enter mind control mode after which you'll only have to think of calling Jodie and the other phone will be ringing on speaker...


That's great, but it still performs 2 times worst than the best supervised model.

Also : "The discriminator itself is also a neural network. We train it by feeding it the output of the generator as well as showing it real text from various sources that were phonemized."

Is the "real text from various sources that were phonemized" a manually labelized database? If Yes, that step is supervised, which makes the whole thing actually supervised to some extent


If phonemization is converting words to phonetic symbols, it could be taken from existing human-written dictionaries automatically.


How does Facebook use speech recognition in its own products?


Our main usages of speech rec are:

- Understanding for videos for content moderation, ranking and recommendations

- Captioning

- Assistant tech in Portal, Oculus and upcoming glasses


And the not main usage?


For me the most scary part is not doing end-to-end encryption in Messenger, which means that the US government has a right to listen to my conversations on it.


Not if you use secret conversation mode or just use whatsapp


A benevolent guess would be: It's probably related to their AR/VR effort. They are basically betting on AR/VR (incl Quest, Portal, etc) to be the next computing platform and voice input (also output) is a key user interaction model. IIRC, they are building a team to replicate Google Assistant and Siri.


e.g., transcription of live streaming events?


Makes sense - transcribing speech from all uploaded video seems like it would be valuable for identifying interests and serving more personalized advertisements.

For example, in a video taken at the beach where someone says “wow that looks fun” (in any language) and there is a boat in the frame, you could serve them boat ads.


Haven't you ever wondered if some mobile apps listen to you? And then you get accurate targeted ads...


People keep downvoting me, because "we need you to send a proof", but sorry I don't want to share Snowden's fate, but I know for fact IOS/Android, most TVs and major TV cable boxes (USA) are indeed analyzing your speech and sending back home _keywords_ from your conversations. I can imagine that's how they avoid being sued into oblivion for clear 4A violation, as metatags have been considered not a violation of your privacy by courts.

But go ahead, try it yourself! Have a conversation about having children with your loved one, next to your phones and your TV box, no online search required. Give it few hours and turn your Sling on, or browse some Amazon/Youtube. All of sudden you will see ads for products and companies you have never heard of, trying to sell you diapers or baby cribs. Where do you think it came from? Google, as of today, still is unable to read your mind.


Ok, for those that want proof, its pretty simple to do.

1) we know that sending voice data to "HQ" costs power

2) we know that live transcription costs a huge wedge of power

3) we know that wakeword matching is quite power efficient. (see https://rhasspy.readthedocs.io/en/latest/wake-word/, https://github.com/MycroftAI/mycroft-precise)

So, in a quite room we know that to save power and data, devices won't be streaming data/listening. We can use that as a baseline for power and network usage.

Then we can start talking, measure that

then start saying watchwords and see what happens.

That aside, we know that its really expensive to listen & transcribe 24/7. Its far easier and cheaper to monitor your web activity and surmise your intentions from that. There are quite a few searches/website visits that strongly correlate to life events. Humans are not as unique and random as you may think, especially to a machine with perfect memory.


Did you try the example I described at your home?

You not going to distinguish what people want to BUY from Google search as good as from conversation. When I google "Ferrari" it may mean I am looking for Ferrari wallpaper, Ferrari stats, Ferrari parts, or want to buy new Ferrari. When I have a conversation with someone about buying Ferrari, the conclusion IS I am ready to buy a Ferrari.


They ain't doing shit. Voice assistants only listen for their built in trigger words, if you even turn that on, and they do what they say with the conversation afterwards.

4A doesn't matter anyway, what matters is GDPR, it's not worth building a system that is not GDPR compliant even if you're outside the EU.


Was anyone else surprised that they used simple k-means for clustering the phonemes? I skimmed the paper and it looks like they use a really high k (128), presumably to avoid having to do an elbow-plot like approach.

Maybe the computational benefits of such a simple algorithm outweighed the potentially bad clusters. Thoughts? I'll try to read the paper in more depth later in case they explain the choice and I missed it.


That's incredible. Since translation can be done this way too, it may truly be a short time until widespread use of babel-fish tech becomes reality.

What a world!


Babelfish like technology would be a dream come true for me. At this point in time, google translate speech recognition works well when the speaker speaks a bit slow and uses simple sentences. The translation is the same, works well for simple sentences even though the translations in my language sound weird because it uses the equivalent of old English. There is still a way to go before speech recognition can recognize fast speech with colloquialisms, and probably even a longer way to go before it can translate longer sentences with abstractions into something that is clear and concise in the target language.


> There is still a way to go before speech recognition can recognize fast speech with colloquialisms, and probably even a longer way to go before it can translate longer sentences with abstractions into something that is clear and concise in the target language.

Did you watch the video tweeted by Facebook's CTO?

https://twitter.com/schrep/status/1395766932104572928

The speed, accent, and word choices of the speaker would throw off previous state-of-the-art tech.

The amazing thing about this new unsupervised algorithm is they can throw unlimited amounts of unlabeled audio and text at a computer and wait for it to become great. It's not AGI but it is still amazing. Language detection is also already solved so there really is no labeling required.


I wonder if this or similar tech can be used for animals.


Nope, like autonmous driving the problem will be the last 20%


Just because one overhyped tech is not working out does not mean all new tech will fail.

We already have noise cancellation, good tiny batteries, fast wireless internet, and all the tech to learn spoken languages without supervision!

Autonomous driving has a much higher bar for accuracy.


Are you sure? A translation error played part in the dropping of the atomic bomb on Hiroshima.

https://pangeanic.com/knowledge/the-worst-translation-mistak...

Language has great ambiguity, depends heavily on time period, context and tone.

https://www.reddit.com/r/funny/comments/2chfge/you_need_some... That could lead to problems even without translation.


So maybe don't use the tech yet in such high-stakes situations. In my personal life, an automated translation error is very unlikely to cause a serious problem, while an automated driving error could kill me.

(Fwiw, if you read the comments at your second link, you'll find the image is a fake.)


Same for autonmous driving. Don't use it on the highway but in the city. Lower speed, lower risk of dying. Fake or not, doesn't make my point invalid.


If it's unsupervised why English transcription is much better than the other languages?


It looks like many of the components used are still English biased, for example the off the shelf tool they used to generate phonemes from text was a state of the art system for English and a more general but less performant version for other languages.


Don't want to repeat a comment, so I'll link it here: https://news.ycombinator.com/item?id=27236880



Whenever FB translated a comment for me, it was a complete gibberish. I'm a bit sceptical about their efforts. Is it really good quality?


Will Facebook license this tech out? I can't be the only one who's noticed Google's speech recognition has gotten significantly less accurate the last couple of years, probably the result of some cost-cutting strategy.


They have an open source repo linked from the article here: https://github.com/pytorch/fairseq/tree/master/examples/wav2...


There was a time a few years back where Google's speech recognition was virtually flawless (at least for my voice) but nowadays it verges on useless. I've had similar experiences with Google Maps (specifically navigation) and Google Translate. My understanding is that the implementation of these products is now mostly ML driven whereas in previous iterations ML was just one aspect of it. Curious to know how much truth there is to that or if I'm just viewing things with rose tinted glasses.


Indeed. Growing up playing with voice recognition on Windows I knew how to talk to computers. You speak clearly, enunciate your consonants, and keep a consistent pace and even tone. When I used my "computer voice" on Android I could carry on a text or IM conversation with my phone sitting in my pocket. Nowadays it hasn't gotten any better at understanding my natural drawl but the "computer voice" fills the sentences with bizarre punctuation and randomly-capitalized words that make me look like a lunatic.

I really don't know just what the hell my phone wants from me anymore.


I wonder if it's less accurate for specific people, but more accurate generally. In other words speech recognition was more accurate in the beginning on english-speaking men, and maybe now it's better for women, kids, people with accents, other languages, etc...

They train this on examples of speaking and maybe it's more broad now.


That's possible, but it wouldn't explain bizarre behavior like capitalizing random words in the middle of sentences or the absolute refusal to type the word "o'clock".


What in the world are you talking about?


He's saying that it might not be getting worse due to a cost-cutting strategy, but rather an attempt to optimize the value for the most possible people.


Why would FB work on such a tech?


Are you looking at Facebook with glasses from 2010? Because one would expect a almost-trillion-dollar company to work on more than a single-focus product.

Why would a search engine (1997) make glasses (2013)?


>Why would a search engine (1997) make glasses (2013)?

It literally was just because Sergey Brin thought it was cool.


To get even more data to analyze, sell, and leak.


Perhaps having the freedom to work on these problems is necessary to keep talent.

Positive stories associated with the brand seem especially needed these days. Perhaps FB keeps other “rainy day” projects as well.


I was contacted by their recruiters recently and started getting warmed up on their interviewing, etc. However, this was back in December when they put out ads against Apple's choice to allow users to opt out of tracking. It kind of made it difficult to want to follow through on the interview process after that... so I didn't.


I did the same. I don't think you are alone, at all.

FB pays well and works on really interesting things but I just can't justify the moral price of working there.


Thank you for your choice of not working there!


Yes, LeCun is still head of AI research there, and he's a big draw for talent. Any advancements like this (unsupervised learning) are a huge boon for technology. Often, the most time consuming part of machine learning is creating the labeled dataset. It's amazing to see what can be achieved without one.


Question: are there any efforts to communally create / crowd source training material for neural networks?

I’m thinking language like this, but also labelled imagery for, for example, face detection, which works better on white people.

Has anyone attempted to create a way for people to create and donate labelled data to a dataset?


This project is more on the academic side -- disclaimer, I was involved in it, but it's led by Justin Harris at Microsoft Research: "Sharing Updatable Models on Blockchain" https://github.com/microsoft/0xDeCA10B

The idea is that the smart contract is a learning algorithm, and people can donate data to a public repository stored on a blockchain. The learned model is publicly available for everyone. People can also receive incentives for donating data in some implementations.


Common Voice ( https://commonvoice.mozilla.org/ ) is one such project. I've urged the people I know who have less-common accents to contribute, but I'm not sure they have.

Good, high quality, wide coverage, labeled datasets are expensive to assemble. Most companies don't want to give them away. You can find a number from academia, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: