Text to speech is easy, right? You just get a bunch of white noise and squirt it...

Text to speech is easy, right? You just get a bunch of white noise and squirt it out a speaker with a bit of envelope shaping. Short rapid burst is a 'tuh'. Or maybe a 'kuh'. Or a 'puh' or 'duh' or 'buh'. More gentle with a bit of sustain is a 'luh' or 'muh' sound. I had a speech synth on a CP/M computer that did this. You might understand what was being said, if you knew what was being said.

People had a lists of phonemes and improved those.

Then people experimented with different waveforms.

Here's a collection of different voices. (Poor quality sound, unfortunately.) (http://www.youtube.com/watch?v=aFQOYBNAMHg)

Why did all those people take so long to make the jump to biphones, to smoothing out the joins between individual phonemes?

You had the Japanese with their '5th generation' research who were physically modelling the human mouth, tongue, and larynx, and blowing air through it. (You don't hear much about the Japanese 5th generation stuff nowadays. I'd be interested if there's a list of things that come from that research anywhere.)

Saying "talking computers" is easy; doing it is tricky.

EDIT: (http://www.japan-101.com/business/fifth_generation_computer....)

> By any measure the project was an abject failure. At the end of the ten year period they had burned through over 50 billion yen and the program was terminated without having met its goals. The workstations had no appeal in a market where single-CPU systems could outrun them, the software systems never worked, and the entire concept was then made obsolete by the internet.