Facebook's focus on unsupervised machine learning is a huge plus for under-resourced languages. They had a similar article for unsupervised machine translation[1] before, and I can see how it'll open doors for many African languages.
It seems that the discriminator tests if the output of the generator obeys the pattern of a language. Isn't it still possible that after training the generator outputs the text wrong for the input? On the other words, the output is understandable but actually not what the sound indicate?
I see they are using a GAN, and doing unsupervised training. But then they appear to compare their model to supervised-trained models.
How do they do this? Do they tack a supervised-trained model onto the end of their unsupervised model? I imagine they must do supervised training at some point, else how can they convert sounds to text?
so close... you need to put your car in jabberish (jedi jibberish) command console which will then correctly understand "cool jodi" as "Cool Jedi" and enter mind control mode after which you'll only have to think of calling Jodie and the other phone will be ringing on speaker...
That's great, but it still performs 2 times worst than the best supervised model.
Also : "The discriminator itself is also a neural network. We train it by feeding it the output of the generator as well as showing it real text from various sources that were phonemized."
Is the "real text from various sources that were phonemized" a manually labelized database? If Yes, that step is supervised, which makes the whole thing actually supervised to some extent
For me the most scary part is not doing end-to-end encryption in Messenger, which means that the US government has a right to listen to my conversations on it.
A benevolent guess would be: It's probably related to their AR/VR effort. They are basically betting on AR/VR (incl Quest, Portal, etc) to be the next computing platform and voice input (also output) is a key user interaction model. IIRC, they are building a team to replicate Google Assistant and Siri.
Makes sense - transcribing speech from all uploaded video seems like it would be valuable for identifying interests and serving more personalized advertisements.
For example, in a video taken at the beach where someone says “wow that looks fun” (in any language) and there is a boat in the frame, you could serve them boat ads.
People keep downvoting me, because "we need you to send a proof", but sorry I don't want to share Snowden's fate, but I know for fact IOS/Android, most TVs and major TV cable boxes (USA) are indeed analyzing your speech and sending back home _keywords_ from your conversations. I can imagine that's how they avoid being sued into oblivion for clear 4A violation, as metatags have been considered not a violation of your privacy by courts.
But go ahead, try it yourself! Have a conversation about having children with your loved one, next to your phones and your TV box, no online search required. Give it few hours and turn your Sling on, or browse some Amazon/Youtube. All of sudden you will see ads for products and companies you have never heard of, trying to sell you diapers or baby cribs. Where do you think it came from? Google, as of today, still is unable to read your mind.
So, in a quite room we know that to save power and data, devices won't be streaming data/listening. We can use that as a baseline for power and network usage.
Then we can start talking, measure that
then start saying watchwords and see what happens.
That aside, we know that its really expensive to listen & transcribe 24/7. Its far easier and cheaper to monitor your web activity and surmise your intentions from that. There are quite a few searches/website visits that strongly correlate to life events. Humans are not as unique and random as you may think, especially to a machine with perfect memory.
You not going to distinguish what people want to BUY from Google search as good as from conversation. When I google "Ferrari" it may mean I am looking for Ferrari wallpaper, Ferrari stats, Ferrari parts, or want to buy new Ferrari. When I have a conversation with someone about buying Ferrari, the conclusion IS I am ready to buy a Ferrari.
They ain't doing shit. Voice assistants only listen for their built in trigger words, if you even turn that on, and they do what they say with the conversation afterwards.
4A doesn't matter anyway, what matters is GDPR, it's not worth building a system that is not GDPR compliant even if you're outside the EU.
Was anyone else surprised that they used simple k-means for clustering the phonemes? I skimmed the paper and it looks like they use a really high k (128), presumably to avoid having to do an elbow-plot like approach.
Maybe the computational benefits of such a simple algorithm outweighed the potentially bad clusters. Thoughts? I'll try to read the paper in more depth later in case they explain the choice and I missed it.
Babelfish like technology would be a dream come true for me. At this point in time, google translate speech recognition works well when the speaker speaks a bit slow and uses simple sentences. The translation is the same, works well for simple sentences even though the translations in my language sound weird because it uses the equivalent of old English. There is still a way to go before speech recognition can recognize fast speech with colloquialisms, and probably even a longer way to go before it can translate longer sentences with abstractions into something that is clear and concise in the target language.
> There is still a way to go before speech recognition can recognize fast speech with colloquialisms, and probably even a longer way to go before it can translate longer sentences with abstractions into something that is clear and concise in the target language.
Did you watch the video tweeted by Facebook's CTO?
The speed, accent, and word choices of the speaker would throw off previous state-of-the-art tech.
The amazing thing about this new unsupervised algorithm is they can throw unlimited amounts of unlabeled audio and text at a computer and wait for it to become great. It's not AGI but it is still amazing. Language detection is also already solved so there really is no labeling required.
So maybe don't use the tech yet in such high-stakes situations. In my personal life, an automated translation error is very unlikely to cause a serious problem, while an automated driving error could kill me.
(Fwiw, if you read the comments at your second link, you'll find the image is a fake.)
It looks like many of the components used are still English biased, for example the off the shelf tool they used to generate phonemes from text was a state of the art system for English and a more general but less performant version for other languages.
Will Facebook license this tech out? I can't be the only one who's noticed Google's speech recognition has gotten significantly less accurate the last couple of years, probably the result of some cost-cutting strategy.
There was a time a few years back where Google's speech recognition was virtually flawless (at least for my voice) but nowadays it verges on useless. I've had similar experiences with Google Maps (specifically navigation) and Google Translate. My understanding is that the implementation of these products is now mostly ML driven whereas in previous iterations ML was just one aspect of it. Curious to know how much truth there is to that or if I'm just viewing things with rose tinted glasses.
Indeed. Growing up playing with voice recognition on Windows I knew how to talk to computers. You speak clearly, enunciate your consonants, and keep a consistent pace and even tone. When I used my "computer voice" on Android I could carry on a text or IM conversation with my phone sitting in my pocket. Nowadays it hasn't gotten any better at understanding my natural drawl but the "computer voice" fills the sentences with bizarre punctuation and randomly-capitalized words that make me look like a lunatic.
I really don't know just what the hell my phone wants from me anymore.
I wonder if it's less accurate for specific people, but more accurate generally. In other words speech recognition was more accurate in the beginning on english-speaking men, and maybe now it's better for women, kids, people with accents, other languages, etc...
They train this on examples of speaking and maybe it's more broad now.
That's possible, but it wouldn't explain bizarre behavior like capitalizing random words in the middle of sentences or the absolute refusal to type the word "o'clock".
He's saying that it might not be getting worse due to a cost-cutting strategy, but rather an attempt to optimize the value for the most possible people.
Are you looking at Facebook with glasses from 2010? Because one would expect a almost-trillion-dollar company to work on more than a single-focus product.
Why would a search engine (1997) make glasses (2013)?
I was contacted by their recruiters recently and started getting warmed up on their interviewing, etc. However, this was back in December when they put out ads against Apple's choice to allow users to opt out of tracking. It kind of made it difficult to want to follow through on the interview process after that... so I didn't.
Yes, LeCun is still head of AI research there, and he's a big draw for talent. Any advancements like this (unsupervised learning) are a huge boon for technology. Often, the most time consuming part of machine learning is creating the labeled dataset. It's amazing to see what can be achieved without one.
This project is more on the academic side -- disclaimer, I was involved in it, but it's led by Justin Harris at Microsoft Research: "Sharing Updatable Models on Blockchain" https://github.com/microsoft/0xDeCA10B
The idea is that the smart contract is a learning algorithm, and people can donate data to a public repository stored on a blockchain. The learned model is publicly available for everyone. People can also receive incentives for donating data in some implementations.
Common Voice ( https://commonvoice.mozilla.org/ ) is one such project. I've urged the people I know who have less-common accents to contribute, but I'm not sure they have.
Good, high quality, wide coverage, labeled datasets are expensive to assemble. Most companies don't want to give them away. You can find a number from academia, though.
[1] https://engineering.fb.com/2018/08/31/ai-research/unsupervis...