I can’t square this with the speed. A couple of layers doing STT are technically still part of the neural network, no? Because the increase in token base to cover multimodal tokenization would make even text inference slower, not twice as fast, as 4-turbo.
Open ai give so little information on the details of their models now that one can only speculate how they've managed to cut down inference costs.
STT throws away a lot of information that is clearly being preserved in a lot of these demos so that's definitely not happening here in that sense. That said, the tokens would be merged to a shared embedding space. Hard to say how they are approaching it exactly.
I'd mentally change the acronym to Speech to Tokens. Parsing emotion and other non-explicit indicators in speech has been an ongoing part of research for years now. Meta-data of speaker identity, inflection, etc could easily be added and current LLMs already work with it just fine. For instance asking Claude, with 0 context, to parse the meaning of "*laughter* Yeah, I'm sure that's right." instantly yields:
----
The phrase "*laughter* Yeah, I'm sure that's right" appears to be expressing sarcasm or skepticism about whatever was previously said or suggested. Here's a breakdown of its likely meaning:
"*laughter*" - This typically indicates the speaker is laughing, which can signal amusement, but in this context suggests they find whatever was said humorous in an ironic or disbelieving way.
"Yeah," - This interjection sets up the sarcastic tone. It can mean "yes" literally, but here seems to be used facetiously.
"I'm sure that's right." - This statement directly contradicts and casts doubt on whatever was previously stated. The sarcastic laughter coupled with "I'm sure that's right" implies the speaker believes the opposite of what was said is actually true.
So in summary, by laughing and then sarcastically saying "Yeah, I'm sure that's right," the speaker is expressing skepticism, disbelief or finding humor in whatever claim or suggestion was previously made. It's a sarcastic way of implying "I highly doubt that's accurate or true."
It could be added. Still wouldn't sound as good as what we have here. Audio is Audio and Text is Text and no amount of metadata we can practically provide will replace the information present in sound.
You can't exactly metadata your way out of this (skip to 11:50)
I'm not sure why you say so? To me that seems obviously literally just swapping/weighting between a set of predefined voices. I'm sure you've played a game with a face generator - it's the exact same thing, except with audio. I'd also observe in the demo that they explicitly avoided anything particularly creative, instead sticking within an extremely narrow domain very basic adjectives: neutral, dramatic, singing, robotic, etc. I'm sure it also has happy, sad, angry, mad, and so on available.
But if the system can create a flamboyantly homosexual Captain Picard with a lisp and slight stutter engaging in overt innuendo when stating, "Number one, Engage!" then I look forward to eating crow! But as the instructions were all conspicuously just "swap to pretrained voice [x,y,z]", I suspect crow will not be on the menu any time soon.
I'm sorry but you don't know what you're talking about and I'm done here. Clearly you've never worked with or tried to train STT or TTS models in any real capacity so inventing dramatic capabilities, disregarding latency and data requirements must come easily for you.
Open AI have explicitly made this clear. You are wrong. There's nothing else left to say here.
But I’m not an expert!