GPT-4o: The next game-changing update from OpenAI

M Tony Rodriguez

13 May 2024 • 4 min read

OpenAI just unveiled their new realtime GPT model, and this is one of those tech demos you have to see for yourself to really appreciate. But I still want to recount what happened in the demo and unpack my thoughts about why this is a huge deal for AI. A big step towards the technology in the movie Her.

OpenAI, the makers of ChatGPT, today unveiled their latest model, GPT 4o (“o” for “omni”) which brings all of the paid functionality of ChatGPT Pro users, to all free users. This means that voice, vision and data analytics are now freely available for anyone to make use of at chatgpt.com. This is in keeping with OpenAI’s mission to have AI serve humanity at large by making it freely accessible to as many people as possible.

In their own words:

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

The demo was all about the realtime capabilities of the model. And at first it was the kind of thing that seems like it’s just the same as before but faster. The response happening within half a second instead of 2-3 seconds. Which is good and all, but hard to appreciate what it really means. Really, this opens up huge new areas of what you can do with ChatGPT

First, the model does respond much faster. Within a fraction of a second. And it also seemed intuitively to me like it would begin to answer before it had finished processing what it was going to say. It seems to continue to think about what you said after it starts responding.

So the first part of the demo that made me make a series of involuntary vocalizations, was when the dude asked it to give him input on his “deep breathing,” read, hyperventilation, and the model said “woah you need to slow down!” He asked it to give him feedback on his breathing. So it can hear stuff like that. It can understand emotional expression “across the board,” is what they said. And it can talk in a wide range of emotive styles.

They asked it for a bedtime story and described how they wanted it to tell the story. He talked to it like a director, saying “more expressive, maximum drama!” And it was indeed very expressive. Not like, Hollywood is Dead type levels of acting, but a good demonstration of the voice model. And just for good measure they also had it do a robot voice, which it nailed.

Realtime Vision

The 4O model is also capable of the same kind of realtime reasoning with vision. To summarize my perspective on this part of the demo, it basically upgrades GPT’s vision capability from photos to video. It can see in real time.

They demonstrated this by asking it for help with a basic linear equation. The model commented on what it saw the guy doing before he asked it a follow up question. If i was doing this demo i wouldn’t have interrupted the model at that moment, because to me, the ability to respond to a change in the vision without a follow up prompt is a big deal. It’s literally what Google faked in a video half a year ago. OpenAI actually has this now. Or at least the demo would suggest that they do.

After the livestream they updated their website with a bunch of other demo videos of example use cases. Including one of my dream use case, a realtime seeing AI that can give a running commentary of what it can see and i can ask questions about things that happened in front of me.

Rollout

The new model itself is available starting today for some users, being gradually rolled out as to make sure OpenAI don’t get immediately swamped with support requests if it doesn’t go as planned. The new realtime voice mode, though, will become available in alpha for ChatGPT Plus subscribers “in the coming weeks.”

Without the realtime voice mode, the new model offers faster text generation that so far appears to be on-par with, if not smarter than the version it replaces.

If you want to know more about this please check out their official blog post detailing all the capabilities of the new model. There are a lot of videos on this page and they are worth a look.

https://openai.com/index/hello-gpt-4o/

Significance

To me this feels just as significant an update as ChatGPT itself. If they had Apple’s marketing team they would’ve called this update something like “the biggest upgrade to ChatGPT since ChatGPT.” Like, it’s the same amount of advancement in human-computer interaction that we saw when ChatGPT first became available. You were always able to use a computer to do what ChatGPT does, it just made it easier, more accessible, faster, etc.. This is another big step in that direction.

The way we interact with computers is going to continue to trend in this direction. Someday before too long, the way you interact with your computer will be similar to how you interact with people: by simply communicating with your computer. They have brought us a significant step towards personal AI companions that can help people use their knowledge to enhance everyone’s lives in very real and meaningful ways.

It may be a good time to rewatch Spike Jonze’s “her.”

Realtime Vision

Rollout

Significance

Sign up for more like this.