Hacker News new | past | comments | ask | show | jobs | submit login

I think the live demo that happened on the livestream is best to get a feel for this model[0].

I don't really care whether it's stronger than gpt-4-turbo or not. The direct real-time video and audio capabilities are absolutely magical and stunning. The responses in voice mode are now instantaneous, you can interrupt the model, you can talk to it while showing it a video, and it understands (and uses) intonation and emotion.

Really, just watch the live demo. I linked directly to where it starts.

Importantly, this makes the interaction a lot more "human-like".

[0]: https://youtu.be/DQacCB9tDaw?t=557




The demo is impressive but personally, as a commercial user, for my practical use cases, the only thing I care about is how smart it is, how accurate are its answers and how vast is its knowledge. These haven’t changed much since GPT-4, yet they should, as IMHO it is still borderline in its abilities to be really that useful


But that's not the point of this update


I know, and I know my comment is dismissive of the incredible work shown here, as we’re shown sci-fi level tech. But I feel I have this kettle, that boils water in 10min, and it really should boil it in 1, but instead is now voice operated.

I hope the next version delivers on being smarter, as this update instead of making me excited, makes me feel they’ve reached a plateau on the improvement of the core value and are distracting us with fluff instead


Everything is amazing & Nobody is happy: https://www.youtube.com/watch?v=PdFB7q89_3U


gpt4 isn't quite "amazing" in terms of commercial use. Gpt4 is often good, and also often mediocre or bad. Its not going to change the world, it needs to get better.


Near real-time voice feedback isn't amazing? Has the bar risen this high?

I already know an application for this, and AFAIK it's being explored in the SaaS space: guided learning experiences and tutoring for individuals.

My kids, for instance, love to hammer Alexa with random questions. They would spend a huge amount of time using a better interface, esp. with quick feedback, that provided even deeper insight and responses to them.

Taking this and tuning it to specific audiences would make it a great tool for learning.


"My kids, for instance, love to hammer Alexa with random questions. They would spend a huge amount of time using a better interface, esp. with quick feedback, that provided even deeper insight and responses to them."

Great, using GPT-4 the kids will be getting a lot of hallucinated facts returned to them. There are good use cases for tranformer currently but they're not at the "impact company earnings or country GDP" stage currently, which is the promise that the whole industry has raised/spent 100+B dollars on. Facebook alone is spending 40B on AI. I believe in the AI future, but the only thing that matters for now is that the models improve.


I always double-check even the most obscure facts returned by GPT-4 and have yet to see a hallucination (as opposed to Claude Opus that sometimes made up historical facts). I doubt stuff interesting to kids would be so out of the data distribution to return a fake answer.

Compared to YouTube and Google SEO trash, or Google Home / Alexa (which do search + wiki retrieval), at the moment GPT-4 and Claude are unironically safer for kids: no algorithmic manipulation, no ads, no affiliated trash blogs, and so on. Bonus is that it can explain on the level of complexity the child will understand for their age



My kids get erroneous responses from Alexa. This happens all the time. The built-in web search doesn't provide correct answers, or is confusing outright. That's when they come to me or their Mom and we provide a better answer.

I still see this as a cool application. Anything that provides easier access to knowledge and improved learning is a boon.

I'd rather worry about the potential economic impact than worry about possible hallucinations from fun questions like "how big is the sun?" or "what is the best videogame in the world?", etc.

There's a ton you can do here, IMO.

Take a look at mathacademy.com, for instance. Now slap a voice interface on it, provide an ability for kids/participants to ask questions back and forth, etc. Boom: you've got a math tutor that guides you based on your current ability.

What if we could get to the same style of learning for languages? For instance, I'd love to work on Spanish. It'd be far more accessible if I could launch a web browser and chat through my mic in short spurts, rather than crack open Anki and go through flash cards, or wait on a Discord server for others to participate in immersive conversation.

Tons of cool applications here, all learning-focused.


People should be more worried about how much this will be exploited by scammers. This thing is miles ahead of the crap fraudsters use to scam MeeMaw out of her life savings.


It's an impressive demo, it's not (yet) an impressive product.

It seems like the people who are ohhing and ahhing at the former and the people who are frustrated that this kind of this is unbelivably impractical to productize will be doomed to talk past one another forever. The text generation models, image generation models, speech-to-text and text-to-speech have reached impressive product stages. Multi-model hasn't got there because no one is really sure what to actually do with the thing outside of make cool demos.


Multi modal isn't there because "this is an image of a green plant" is viable in a demo, but its not commercially viable. "This is an image of a monstera deliciosa" is commercially viable, but not yet demoable. The models need to improve to be usable.


Sure, but "not enough, I want moar" is a trivial demand. So trivial that it goes unsaid.


It's equivalent to "nothing to see here" which is exactly the TLDR I was looking for.


Watch the last few minutes of that linked video, Mira strongly hints that there’s another update coming for paid users and seems to make clear that GPT4o is moreso for free tier users (even though it is obviously a huge improvement in many features for everyone).


There is room for more than one use case and large language model type.

I predict there will be a zoo (more precisely tree, as in "family tree") of models and derived models for particular application purposes, and there will be continued development of enhanced "universal"/foundational models as well. Some will focus on minimizing memory, others on minimizing pre-training or fine-tuning energy consumption, some need high accuracy, others hard realtime speed, yet others multimodality like GPT4.o, some multilinguality, and so on.

Previous language models that encoded dictionaries for spellcheckers etc. never got standardized (for instance, compare aspell dictionaries to the ones from LibreOffice to the language model inside CMU PocketSphinx) so that you could use them across applications or operating systems. As these models are becoming more common, it would be interesting to see this aspect improve this time around.

https://www.rev.com/blog/resources/the-5-best-open-source-sp...


I disagree, transfer learning and generalization are hugely powerful and specialized models won't be as good because their limited scope limits their ability to generalize and transfer knowledge from one domain to another.

I think people who emphasis specialized models are operating under a false assumption that by focusing the model it'll be able to go deeper in that domain. However, the opposite seems to be true.

Granted, specialized models like AlphaFold are superior in their domain but I think that'll be less true as models become more capable at general learning.


They say it's twice as fast/cheap, which might matter for your use case.


It's twice as fast/cheap relative to GPT-4-turbo, which is still expensive compared to GPT-3.5-turbo and Claude Haiku.

https://openai.com/api/pricing/


For commercial use at scale, of course cost matters.

For the average Joe programmer like me, GPT4 is already "dirt cheap". My typical monthly bill is $0-3 using it as much as I like.

The one time it was high was when I had it take 90+ hours of Youtube video transcripts, and had it summarize each video according to the format I wanted. It produced about 250 pages of output.

That month I paid $12-13. Well worth it, given the quality of the output. And now it'll be less than $7.

For the average Joe, it's not expensive. Fast food is.


but better afaik


But may not be better enough to warrant the cost difference. LLM cost econonmics are complicated.


I’d much rather have it be slower, more expensive, but smarter


Depends what you want it for. I'm still holding out for a decent enough open model, Llama 3 is tantalisingly close, but inference speed and cost are serious bottlenecks for any corpus-based use case.


I think, that might come with the next GPT version.

OpenAI seems to build in cycles. First they focus on capabilities, then they work on driving the price down (occasionally at some quality degradation)


Then the current offering should suffice, right?


I understand your point, and agree that it is "borderline" in its abilities — though I would instead phrase it as "it feels like a junior developer or an industrial placement student, and assume it is of a similar level in all other skills", as this makes it clearer when it is or isn't a good choice, and it also manages expectations away from both extremes I frequently encounter (that it's either Cmdr Data already, or that's it's a no good terrible thing only promoted by the people who were previously selling Bitcoin as a solution to all the economics).

That said, given the price tag, when AI becomes genuinely expert then I'm probably not going to have a job and neither will anyone else (modulo how much electrical power those humanoid robots need, as the global electricity supply is currently only 250 W/capita).

In the meantime, making it a properly real-time conversational partner… wow. Also, that's kinda what you need for real-time translation, because: «be this, that different languages the word order totally alter and important words at entirely different places in the sentence put», and real-time "translation" (even when done by a human) therefore requires having a good idea what the speaker was going to say before they get there, and being able to back-track when (as is inevitable) the anticipated topic was actually something completely different and so the "translation" wasn't.


I guess I feel like I’ll get to keep my job a while longer and this is strangely disappointing…

A real time translator would be a killer app indeed, and it seems not so far away, but note how you have to prompt the interaction with ‘Hey ChatGPT’; it does not interject on its own. It is also unclear if it is able to understand if multiple people are speaking and who’s who. I guess we’ll see soon enough :)


> It is also unclear if it is able to understand if multiple people are speaking and who’s who. I guess we’ll see soon enough :)

Indeed; I would be pleasantly surprised if it can both notice and separate multiple speakers, but only a bit surprised.


One thing I've noticed, is the more context and more precise the context I give it the "smarter" it is. There are limits to it of course. But, I cannot help but think that's where next barrier will be brought down. An agent or multiple of that tag along with everything I do throughout the day to have the full context. That way, I'll get smarter and more to the point help as well as not spending much time explaining the context.. but, that will open a dark can that I'm not sure people will want to open - having an AI track everything you do all the time (even if only in certain contexts like business hours / env).


There are definitely multiple dimensions these things are getting better in. The popular focus has been on the big expensive training runs but inference , context size, algorithms, etc are all getting better fast


I have a few LLM benchmarks that were extracted from real products.

GPT-4o got slightly better overall. Ability to reason improved more than the rest.


Its faster, smarter and cheaper over the API. Better than a kick in the teeth.


Absolutely agree.

This model isn't about basemark chasing or being a better code generator; it's entirely explicitly focused on pushing prior results into the frame of multi-modal interaction.

It's still a WIP, most of the videos show awkwardness where its capacity to understand the "flow" of human speech is still vestigial. It doesn't understand how humans pause and give one another space for such pauses yet.

But it has some indeed magic ability to share a deictic frame of reference.

I have been waiting for this specific advance, because it is going to significantly quiet the "stochastic parrot" line of wilfully-myopic criticism.

It is very hard to make blustery claims about "glorified Markov token generation" when using language in a way that requires both a shared world model and an understanding of interlocutor intent, focus, etc.

This is edging closer to the moment when it becomes very hard to argue that system does not have some form of self-model and a world model within which self, other, and other objects and environments exist with inferred and explicit relationships.

This is just the beginning. It will be very interesting to see how strong its current abilities are in this domain; it's one thing to have object classification—another thing entirely to infer "scripts plans goals..." and things like intent, and, deixis. E.g. how well does it now understand "us" and "them" and "this" vs "that"?

Exciting times. Scary times. Yee hawwwww.


What part of this makes you think GPT-4 suddenly developed a world model? I find this comment ridiculous and bizarre. Do you seriously think snappy response time + fake emotions is an indicator of intelligence? It seems like you are just getting excited and throwing out a bunch of words without even pretending to explain yourself:

> using language in a way that requires both a shared world model

Where? What example of GPT-4o requires a shared world model? The customer support example?

The reason GPT-4 does not have any meaningful world model (in the sense that rats have meaningful world models) is that it freely believes contradictory facts without being confused, freely confabulates without having brain damage, and it has no real understanding of quantity or causality. Nothing in GPT-4o fixes that, and gpt2-chatbot certainly had the same problems with hallucinations and failing the same pigeon-level math problems that all other GPTs fail.


One of the most interesting things about the advent of LLMs is people bringing out all sorts of "reasons" GPT doesn't have true 'insert property' but all those reasons freely occur in humans as well

>that it freely believes contradictory facts without being confused,

Humans do this. You do this. I guess you don't have a meaningful world model.

>freely confabulates without having brain damage

Humans do this

>and it has no real understanding of quantity or causality.

Well this one is just wrong.


So many even here on HN have a near-religious belief that intelligence is unique to humans and animals, and somehow a fundamental phenomenon that cannot ever be created using other materials.


It reminds me of the geocentric mindset.


It's a defensive response to an emerging threat to stability and current social tiers.


>>and it has no real understanding of quantity or causality.

>Well this one is just wrong.

Is it?

--

Me: how many characters are in: https://google.com

ChatGPT: The URL "https://google.com" has 12 characters, including the letters, dots, and slashes.

--

What is it counting there? 12 is wrong no matter how you dice that up.

Part of the reason is it has no concept of the actual string. That URL breaks into four different tokens in 3.5 and 4: "http", "://", "google" and ".com".

Its not able to figure out the total length, or even the length of its parts and add them together.

I ask it to double check, it tells me 13 and then 14. I tell it the answer and suddenly its able...

---

Me: I think its 18

ChatGPT: Let's recount together:

"https://" has 8 characters. "google" has 6 characters. ".com" has 4 characters. Adding these up gives a total of 8 + 6 + 4 = 18 characters. You're correct! My apologies for the oversight earlier.

---

Count me out.


LLMs process text, but only after it was converted to a stream of tokens. As a result, LLMs are not very good at answering questions about letters in the text. That information was lost during the tokenization.

Humans process photons, but only after converting them into nerve impulses via photoreceptor cells in the retina, which are sensitive to wavelengths ranges described as "red", "green" or "blue".

As a result, humans are not very good at distinguishing different spectra that happen to result in the same nerve impulses. That information was lost by the conversion from photons to nerve impulses. Sensors like the AS7341 that have more than 3 color channels are much better at this task.


Yet I can learn there is a distinction between different spectra that happen to result in the same nerve impulses. I know if I have a certain impulse, that I can't rely on it being a certain photon. I know to use tools, like the AS7341, to augment my answer. I know to answer "I don't know" to those types of questions.

I am a strong proponent of LLM's, but I just don't agree with the personification and trust we put into its responses.

Everyone in this thread is defending that ChatGPT can't count for _reasons_ and how its okay, but... how can we trust this? Is this the sane world we live in?

"The AGI can't count letters in a sentence, but any day not he singularity will happen, the AI will escape and take over the world."

I do like to use it for opinion related questions. I have a specific taste in movies and TV shows and by just listing what I like and going back and forth about my reasons for liking or not liking it's suggestions, I've been able to find a lot of gems I would have never heard of before.


That URL breaks into four different tokens in 3.5 and 4: "http", "://", "google" and ".com".

Except that "http" should be "https". Silly humans, claiming to be intelligent when they can't even tokenize strings correctly.


A wee typo.


How much of your own sense of quantity is visual, do you think? How much of your ability to count the lengths of words depends on your ability to sound them out and spell?

I suspect we might find that adding in the multimodal visual and audio aspects to the model gives these models a much better basis for mental arithmetic and counting.


I'd counter by pasting a picture of an emoji here, but HN doesn't allow that, as a means to show the confusion that can be caused by characters versus symbols.

Most LLMs can just pass the string to an tool to count it to bypass it's built in limitations.


It seems you're already aware LLMs receive tokens not words.

Does a blind man not understand quantity because you asked him how many apples are in front of him and he failed ?


I do, but I think it shows it's limitations.

I don't think that test determines his understanding of quantity at all, he has other senses like touch to determine the correct answer. He doesn't make up a number and then give justification.

GPT was presented with everything it needed to answer the question.


Nobody said GPT was perfect. Everything has limitations.

>he has other senses like touch to determine the correct answer

And? In my hypothetical, you're not allowing him to use touch.

>I don't think that test determines his understanding of quantity at all

Obviously

>GPT was presented with everything it needed to answer the question.

No, it was not.


How was it not? It's a text interface. It was given text.

The deaf example now is like asking GPT "What am I pointing at?"


Please try to actually understand what og_kalu is saying instead of being obtuse about something any grade-schooler intuitively grasps.

Imagine a legally blind person, they can barely see anything; just general shapes flowing into one another. In front of them is a table onto which you place a number of objects. The objects are close together and small enough such that they merge into one blurred shape for our test person.

Now when you ask the person how many objects are on the table, they won't be able to tell you! But why would that be? After all, all the information is available to them! The photons emitted from the objects hit the retina of the person, the person has a visual interface and they were given all the visual information they need!

Information lies within differentiation, and if the granularity you require is higher than the granularity of your interface, then it won't matter whether or not the information is technically present; you won't be able to access it.


I think we agree. ChatGPT can't count, as the granularity that requires is higher than the granularity ChatGPT provides.

Also the blind person wouldn't confidently answer. A simple "the objects blur together" would be a good answer. I had ChatGPT telling me 5 different answers back to back above.


No, think about it. The granularity of the interface (the tokenizer) is the problem, the actual model could count just fine.

If the legally blind person never had had good vision or corrective instruments, had never been told that their vision is compromised and had no other avenue (like touch) to disambiguate and learn, then they would tell you the same thing ChatGPT told you. "The objects blur together" implies that there is already an understanding of the objects being separate present.

You can even see this in yourself. If you did not get an education in physics and were asked to describe of how many things a steel cube is made up, you wouldn't answer that you can't tell. You would just say one, because you don't even know that atoms are a thing.


I agree, but I don't think that changes anything, right?

ChatGPT can't count, the problem is the tokenizer.

I do find it funny we're trying to chat with an AI that is "equivalent to a legally blind person with no correction"

> You would just say one, because you don't even know that atoms are a thing.

My point also. I wouldnt start guessing "10" and then "11" and then "12" when asked to double check only to capitulate when told the correct answer.


You consistently refuse to take the necessary reasoning steps yourself. If your next reply also requires me to lead you every single millimeter to the conclusion you should have reached on your own, then I won't reply again.

First of all, it obviously changes everything. A shortsighted person requires prescription glasses, someone that is fundamentally unable to count is incurable from our perspective. LLMs could do all of these things if we either solve tokenization or simply adapt the tokenizer to relevant tasks. This is already being done for program code, it's just that aside from gotcha arguments, nobody really cares about letter counting that much.

Secondly, the analogy was meant to convey that the intelligence of a system is not at all related to the problems at its interface. No one would say that legally blind people are less insightful or intelligent, they just require you to transform input into representations accounting for their interface problems.

Thirdly, as I thought was obvious, the tokenizer is not a uniform blur. For example, a word like "count" could be tokenized as "c|ount" or " coun|t" (note the space) or ". count" depending on the surrounding context. Each of these versions will have tokens of different lengths, and associated different letter counts. If you've been told that the cube had 10, 11 or 12 trillion constituent parts by various people depending on the random circumstances you've talked to them in, then you would absolutely start guessing through the common answers you've been given.


I do agree I've been obtuse, apologies. I think I was just being too literal or something, as I do agree with you.


Apologies from me as well. I've been unnecessarily aggressive in my comments. Seeing very uninformed but smug takes on AI here over the last year has made me very wary of interactions like this, but you've been very calm in your replies and I should have been so as well.


Its first answer of 12 is correct, there are 12 _unique_ characters in https://google.com.


The unique characters are:

h t p s : / g o l e . c m

There are 13 unique characters.


OK neither GPT-4o nor myself is great at counting apparently


I agree. The interesting lesson I take from the seemingly strong capabilities of LLMs is not how smart they are but how dumb we are. I don't think LLMs are anywhere near as smart as humans yet, but it feels each new advance is bringing the finish line closer rather than the other way round.


Moravec's paradox states that, for AI, the hard stuff is easiest and the easy stuff is hardest. But there's no easy or hard; there's only what the network was trained to do.

The stuff that comes easy to us, like navigating 3D space, was trained by billions of years of evolution. The hard stuff, like language and calculus, is new stuff we've only recently become capable of, seemingly by evolutionary accident, and aren't very naturally good at. We need rigorous academic training at it that's rarely very successful (there's only so many people with the random brain creases to be a von Neumann or Einstein), so we're impressed by it.


If someone found a way to put an actual human brain into SW, but no one knew it was a real human brain -- I'm certain most of HN would claim it wasn't AGI. "Kind of sucks at math", "Knows weird facts about Tik Tok celebrities, but nothing about world events", "Makes lots of grammar mistakes", "scores poorly on most standardized tests, except for one area that he seems to well", and "not very creative".


What is a human brain without the rest of it's body? Humans aren't brains. Our nervous systems aren't just the brain either.


It's meant to explore a point. Unless your point is that AGI can only exist with a human body too.


It's an open question as to whether AGI needs a (robot) body. It's also a big question whether the human brain can function in a meaningful capacity kept alive without a body.


i don't think making the same mistakes as a human counts as a feature. I see that a lot when people point out a flaw with an llm, the response is always "well a human would make the same mistake!". That's not much of an excuse, computers exist because they do the things humans can't do very well like following long repetitive lists of instructions. Further, upthread, there's discussion about adding emotions to an llm. An emotional computer that makes mistakes sometimes is pretty worthless as a "computer".


It's not about counting as a feature. It's the blatant logical fallacy. If a trait isn't a reason humans don't have a certain property then it's not a reason for machines either. Can't eat your cake and have it.

>That's not much of an excuse, computers exist because they do the things humans can't do very well like following long repetitive lists of instructions.

Computers exist because they are useful, nothing more and nothing less. If they were useful in a completely different way, they would still exist and be used.


It's objectively true that LLMs do not have bodies. To the extent general intelligence relies on being emobodied (allowing you to manipulate the world and learn from that), it's a legitimate thing to point out.


>But it has some indeed magic ability to share a deictic frame of reference.

They really Put That There!

https://www.youtube.com/watch?v=RyBEUyEtxQo

Oh, shit.


In my view, this was in response to the machine being colourblind haha


I expect the really solid use case here will be voice interfaces to applications that don't suck. Something I am still surprised at is that vendors like Apple have yet to allow me to train the voice to text model so that it only responds to me and not someone else.

So local modelling (completely offline but per speaker aware and responsive), with a really flexible application API. Sort of the GTK or QT equivalent for voice interactions. Also custom naming, so instead of "Hey Siri" or "Hey Google" I could say, "Hey idiot" :-)

Definitely some interesting tech here.


I assume (because they don't address it or look at all phased) the audio cutting in and out is just an artefact of the stream?


Haven’t tried it but from work I’ve done on voice interaction this happens a lot when you have a big audience making noise. The interruption feature will likely have difficulty in noisy environments.


Yeah that was actually my first thought (though no professional experience with it/on that side) - it's just that the commenter I replied to was so hyped about it and how fluid & natural it was and I thought that made it really jarr.


Interesting that they decided to keep the horrible ChatGPT tone ("wow you're doing a live demo right now?!"). It comes across just so much worse in voice. I don't need my "AI" speaking to me like I'm a toddler.


It is cringe overenthusiastic, but a proper instructions/system prompt will fix that mostly


You can tell it not to talk like this using custom prompts.


One of the linked demos is it being sarcastic, so maybe you can make it remember to be a little more edgy.


tell it to speak to you differently

with a GPT you can modify the system prompt


It still refuses to go outside the deeply sanitise tone that "alignment" enforces on you.


it should be possible to imitate any voice you want like your actual parents soon enough


That won't be Black Mirror levels of creepy /s


Did you miss the part where they simply asked it to change its manner of speaking and the amount of emotion it used?


Call me overly paranoid/skeptical, but I'm not convinced that this isn't a human reading (and embellishing) a script. The "AI" responses in the script may well have actually been generated by their LLM, providing a defense against it being fully fake, but I'm just not buying some of these "AI" voices.

We'll have to see when end users actually get access to the voice features "in the coming weeks".


It's weird that the "airplane mode" seems to be ON on the phone during the entire presentation.


This was on purpose - they connected it to the internet via a USB-C cable it appears, for consistent internet instead of having it switch WiFi

Probably some kinks there they are working out


> Probably some kinks there they are working out

Or just a good idea for a live demo on a congested network/environment with a lot of media present, at least one live video stream (the one we're watching the recording of), etc.

At least that's how I understood it, not that they had a problem with it (consistently or under regular conditions, or specific to their app).


That's very common practice for live demos. To avoid situations like this:

https://www.youtube.com/watch?v=6lqfRx61BUg


And eliminate the change of some prankster affecting the demo by attacking the wifi.


They mention at the beginning of the video that they are using hardwired internet for reliability reasons.


You would want to make sure that it is always going over WiFi for the demo and doesn't start using the cellular network for a random reason.


You can turn off mobile data. They probably just wanted wired internet.


This is going straight into 'Her' territory


Hectic!

Thanks for this.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: