Yes, if they just paid people to record their voice they would not get training ...

Yes, if they just paid people to record their voice they would not get training data for real use cases.

I cannot find the blog post now, but quite a few years ago I recall some Google employees noticed a large number of queries for "cha cha cha cha cha..." from Android users in New York. All of the queries were done using voice search, so they listened to a few of the recordings. It turns out that their speech-to-text models were interpreting the sound of the NYC metro pulling into a station as speech.

Obviously they didn't have enough training data of people trying to talk next to a train.