> The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.
> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.
-
This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.
-
Edit: Apparently it does.
It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)
yes it transcribes inputs automatically, but not in realtime.
outputs are sent in text + audio but you'll get the text very quickly and audio a bit slower, and of course the audio takes time to play back. the text also doesn't currently have timing cues so its up to you if you want to try to play it "in sync". if the user interrupts the audio, you need to send back a truncation event so it can roll its own context back, and if you never presented the text to the user you'll need to truncate it there as well to ensure your storage isn't polluted with fragments the user never heard.
bcherry
2 days ago
It's incredible that people are talking about the downfall of software engineering - now, at many companies, hundreds of call center roles will be replaced by a few engineering roles. With image fine-tuning, now we can replace radiologists with software engineers, etc. etc.
pants2
2 days ago
Replacing call center roles with this is something I can see happening with the realtime api + voice output.
Radiologists, I'm not sure what we need is just image model finetuning + LLMs to get there.
mrbungie
2 days ago
What's the role of the software engineer besides setting this up?
Your example makes me think it will merely moves QA into essentially providing countless cases and then updating them over time to improve the AIs data.
And is it really gonna be cheaper than human support?
What's gonna happen when we will find out (see the impossibility to reach a human when interacting with many companies alredy) this is gonna bring (maybe, eventually) costs down, and revenue too because pissed off customers will move elsewhere.
epolanski
2 days ago
More than a majority of a software engineer’s time is spent on bug triage, reproducing bugs, simulating constituents in a test, and debugging fixes.
Doesn’t matter what the computer becomes — AI, AGI or God-incarnate — there’s always a role between that and the end-user. That role today is called software engineer. Tomorrow, it’ll be whatever whatever. Perhaps paid the same or less or more. Doesn’t matter.
There’s always an intermediary to deal with the shit.
Hmm, I wonder if that’s the roles priests & the clergy have been playing all this while. Except, maybe humanity is the shit God (as an end user) has to deal with
cafed00d
2 days ago
I'd much rather talk to ChatGPT than a human support rep, provided they have the same level of ability (tools) to help you.
pants2
a day ago
People have been trying to replace radiologists for several years now. Maybe they'll get there, but it doesn't seem to be easy.
skybrian
2 days ago
Radiologists will not be replaced. They will just have better tools.
dcl
2 days ago
the _role_ of radiologists isn’t going away, but as with software engineers, better tools means there are fewer needed to serve the same patient population. So it’s highly likely that there is going to be displacement within that industry as well.
karmajunkie
2 days ago
We really can't, it's a tool not a radiologist. Medicine is a critical field, can't afford hallucinations and sloppiness
visarga
2 days ago
A radiologist makes critical life-or-death judgements. An algorithm will not, and should not, replace them.
djhn
2 days ago
A modern insulin pump also uses algorithms to make critical life or death decisions, should we replace these with doctors?
falcor84
2 days ago
I don't believe that is comparable. 1. Modern algorithms started out as a cronjob (which already worked better than the alternative) 2. Advances in applying optimal control theory are well known, (mostly) deterministic and explainable. They are in no way comparable to the black box that is the current state of computer vision. 3. Their failure can be readily observed and compensated for, since the patient will definitely notice. The same cannot be said about imaging.
djhn
2 days ago
Does the insulin pump operate in a general space as radiology/diagnosis or is it constrained very precisely?
OpenAI just launched the equivalent of Velvet as a full fledged feature today.
But seperate from that you typically want some application specific storage of the current "conversation" in a very different format than raw request logging.
BoorishBears
2 days ago
I've never seen a company publishing consistently groundbreaking features at such a speed like this one. I really wonder how their teams work. It's unprecedented at what i've seen in 15 years software
siva7
2 days ago
I wonder how much they use their own products internally to speed up development and decisions.
pheeney
2 days ago
They definitely use their own products internally, perhaps to a fault: While chatting with OpenAI recruiters, I received calendar events with nonsensical DALLE-generated calendar images, and "interview prep" guides that were clearly written by an older GPT model.
abound
2 days ago
And I wonder how much they use them externally to influence the online conversations about their own products/company.
amlib
2 days ago
They have roles on their “Leverage Engineering” team which appears to be exactly this
AFAIK a lot of these ideas are not new (the JSON thing was done with OS models before) and OpenAI is possibly the hottest startup with the most funding this decade (maybe even past two decades?), so I think this is actually all within expectations.
IdiocyInAction
2 days ago
They're exceptional at executing and delivering, you don't get that just through having more funding.
sk11001
2 days ago
How are they exceptional?
Their web UI was a glitchy mess for over a year. Rollouts of just data is staggered and often delayed. They still can’t adhere to a JSON schema accurately, even though others have figured this out ages ago. There are global outages regularly. Etc…
I’m impressed by some aspects of their rapid growth, but these are financial achievements (credit due Sam) more than technical ones.
jiggawatts
2 days ago
I have a few qualms with this app:
1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.
3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?
closewith
2 days ago
Not sure why you are being downvoted. You are generally right. Most of their new product rollouts were acoompanied with huge production instabilities for paying customers. Only in the most recent ones did they manage that better.
> They still can’t adhere to a JSON schema accurately
Strict mode for structured output fixes at least this though.
hobofan
2 days ago
It’s literally just a bunch of ex-stripe employees and data scientists..
testfrequency
2 days ago
> OpenAI is possibly the hottest startup with the most funding this decade (maybe even past two decades?)
It depends on how you define startup but I don't think they will surpass Uber, ByteDance, or SpaceX until this next rumored funding round.
I'm excluding companies that have raised funding post IPO since that's an obvious cutoff for startups. The other cuttof being break even, in which case Uber has raised well over $20 billion.
throwup238
2 days ago
Is it that most models are based on the transformer architecture ? And so performance improvements can then we used throughout their different products ?
roboboffin
2 days ago
GPT 5 is writing their code
nextworddev
2 days ago
> 11:43 Fields are generated in the same order that you defined them in the schema, even though JSON is supposed to ignore key order. This ensures you can implement things like chain-of-thought by adding those keys in the correct order in your schema design.
Why not use an array of key value pairs if you want to maintain ordering without breaking traditional JSON rules?
[
{key1:value1}, {key2:value2}
]
ponty_rick
2 days ago
> even though JSON is supposed to ignore key order
Most tools preserve the order. I consider it to be an unofficial feature of JSON at this point. A lot of people think of it as a soft guarantee, but it’s a hard guarantee in all the recent JavaScript and python versions. There are some common places where it’s lost, like JSONB in Postgres, but it’s good to be aware that this unofficial feature is commonly being used.
benatkin
2 days ago
I don't think openai models supports this pattern. You can only have array of fixed types. Or basically keys should be same. See [1]
It's nice to have have a solution from OpenAI given how much they use a variant of this internally. I've tried like 5 YC startups and I don't think anyone's really solved this.
There's the very real risk of vendor lock-in but quickly scanning the docs seems like it's a pretty portable implementation.
serjester
2 days ago
It's pretty amazing that they made prompt caching automatic. It's rare that a company gives a 50% discount without the customer explicitly requesting it! Of course... they might be retaining some margin, judging by their discount being 50% vs. Anthropic's 90%.
- Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it (except EU).
thenameless7741
3 days ago
> Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it.
So regular paying users from EU are still left out in the cold.
visarga
2 days ago
It's probably stuck in legal limbo in the EU. The recently passed EU AI Act prohibits "AI systems aiming to identify or infer emotions", and Advanced Voice does definitely infer the user's emotions.
(There is an exemption for "AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use", but Advanced Voice probably doesn't benefit from that exemption.)
AlanYx
2 days ago
Apparently this prohibition only applies to "situations related to the workplace and education", and, in this context, "That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons"
So it seems to be possible to use this in a personal context.
> Therefore, the placing on the market, the putting into service, or the use of AI systems intended to be used to detect the emotional state of individuals in situations related to the workplace and education should be prohibited. That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use.
qwertox
2 days ago
This is true, though it may not make sense commercially for them to offer an API that can't be used for workplace (business) applications or education.
AlanYx
2 days ago
I see what you mean, but I think that "workplace" specifically refers to the context of the workplace, so that an employer cannot use AI to monitor the employees, even if they have been pressured to agree to such a monitoring. I think this is unrelated to "commercially offering services which can detect emotions".
But then I don't get the spirit of that limitation, as it should be just as applicable to TVs listening in on your conversations and trying to infer your emotions. Then again, I guess that for these cases there are other rules in place which prohibit doing this without the explicit consent of the user.
qwertox
2 days ago
> I think that
> I think this
> I don't get the spirit of that limitation
> I guess that
In a nutshell, this uncertainty is why firms are going to slow-roll EU rollout of AI and, for designated gatekeepers, other features. Until there is a body of litigated cases to use as reference, companies would be placing themselves on the hook for tremendous fines, not to mention the distraction of the executives.
Which, not making any value judgement here, is the point of these laws. To slow down innovation so that society, government, regulation, can digest new technologies. This is the intended effect, and the laws are working.
runako
2 days ago
Companies like OpenAI definitely have the resources to let some lawyers analyze the situation and at this point it should be clear to them if they can or can't do this. It's far more likely that they're holding back because of limitations in hardware resources.
I use those words because I've never read any of the points in the EU AIA.
qwertox
2 days ago
They definitely do have the resources, but laws and regulations are frequently ambiguous. This is one reason the outcome of litigation is often unpredictable.
I would wager this -- OpenAI lawyers have looked that the situation. They have not been able to credibly say "yes, this is okay" and so management makes the decision to wait. Obviously, they would prefer to compete in Europe if it were a no-brainer decision.
It may be possible that the path to get to "yes, definitely" includes some amount of discussion with the relevant EU authorities and/or product modification. These things will take time.
runako
2 days ago
Yes, but it works with a vpn and the change in latency isn’t big enough to have a noticeable impact on usability.
Version467
2 days ago
I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.
The two examples shown in the DevDay are the things I don't really want to do in the future. I don't want to talk to anybody, and I don't want to wait for their answer in a human form. That's why I order my food through an app or Whatsapp, or why I prefer to buy my tickets online. In the rare case I call to order food, it's because I have a weird question or a weird request (can I pick it up in X minutes? Can you prepare it in a different way?)
I hope we don't start seeing apps using conversations as interfaces because it would really horrible (leaving aside the fact that a lot of people don't know how to communicate themselves, different accents, sound environments, etc), while clicking or typing work almost the same for everyone (at least much more normalized than talking)
101008
2 days ago
> I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.
The market for realistic voice agents is huge, but also very fragmented. Customer service is the obvious example, large companies employ tens of thousands of customer service phone agents, and a large # of those calls can be handled, at least in part, with a sufficiently smart voice agent.
Sales is another, just calling back leads and checking in on them. Voice clone the original sales agent, give the AI enough context about previous interactions, and a lot of boring legwork can be handled by AI.
Answering simple questions is another great example, restaurants get slammed with calls during their busiest hours (seriously getting ahold restaurant staff during peak hours can be literally impossible!) having an AI that can pick up the phone and answer basic questions (what's in certain dishes, what is the current wait time, what is the largest group that can be sat together, etc) is super useful.
A lot of small businesses with only a single employee can benefit from having a voice AI assistant picking up the phone and answering the easy everyday queries and then handing everything else off to the owner.
The key is that these voice AIs should be seamless, you ask your question, they answer, and you ideally don't even know it is an AI.
com2kid
2 days ago
Hopefully, it won’t cause a plethora of nuisance phone calls. As the cost tends to zero, then it will be much easier to spam people; even more so than now.
roboboffin
2 days ago
And after your mis-led by a sales agent, it doesn't make you as angry because it's just an AI.
axus
2 days ago
they're definitely going to instruct the AI agents to lie to you, and deliberately waste your time, and be pushier than ever, because it's not costing them anything to have a real human on the line even longer. at least we'll have our own agents to waste their compute in turn
93po
2 days ago
Any company that is that scummy already has sales people working for it who are that scummy and lying non-stop.
The AI isn't changing that equation at all.
com2kid
2 days ago
AI is actually better here.
1. AI instructions are legible. There is no record asking John to sell the customer things they don't need. There is a record if the AI does it.
2. AI interactions are legible. If a sales guy tells you something false on a zoom call, there is no record of it. If the AI does, there is a record.
JamesBarney
2 days ago
oh i totally agree, im saying it's gonna get worse
93po
2 days ago
I would love a work assistant, some sort of secretary idk I can talk to while I code.
"What are today's most important tasks? Anything I forgot before I log off? Can you write John to check the blocking PR? Let's fix this bug together".
epolanski
2 days ago
One thing I'm really excited for is having this real-time voice model in video game characters. It would be really cool to be able to have conversations with NPCs, and actually have to pick their brain for information about a quest or something.
corlinp
2 days ago
"Thrall, the elements hate you" as his model bursts into tears.
whtsthmttrmn
a day ago
You're right, having a voice conversation for any reason is just so passe these days. They should stop adding microphones to phones and everything. So old-fashioned and inefficient. And who wants to ever have to actually talk to someone or some AI to ask for anything? I'm sure our vocal cords will evolve away soon. They are so primitive. Vestigial organs.
ilaksh
2 days ago
I love having voice conversations with friends, family, and people I care of. Not with businesses.
101008
2 days ago
You made my day
olafgeibig
2 days ago
keep in mind that this is just v1 of the realtime api. they'll add realtime vision/video down the road which can also have wide applications beyond synchronous communication.
bcherry
2 days ago
Holy crud, I figured they would guard this for a long time and I was really salivating to make some stuff with it. The doors are wide open for all sorts of stuff now, Advanced Voice is the first feature since ChatGPT initially came out that really has my jaw on the floor.
superdisk
2 days ago
Try notebook LM, it's the chatgpt moment for Google's deepmind
> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.
As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.
This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.
minimaxir
2 days ago
> and $0.24 per minute of audio output
That is substantially more expensive than TTS (text-to-speech) which already is quite expensive.
OutOfHere
2 days ago
Fair, it wouldn't work well for on-demand generation in an app, but for ad-hoc cases like a voice-over it's not a huge expense.
If OpenAI decides to fully ignore ethics and dive deep into voice cloning, then all bets are off.
minimaxir
2 days ago
I agree. I'm wondering if it is possible to disable output streaming of audio and just get the text response event.
correct - you should also be able to save a lot by skipping their built-in VAD and doing turn detection (if you need it) locally to avoid paying for silent inputs.
bcherry
2 days ago
I just need their API to be faster. 15-30 seconds per request using 4o-mini isn't good enough for responsive applications.
N_A_T_E
2 days ago
You should try Azure: it comes with dedicated capacity which is typically a very expensive "call our sales team" feature with OpenAI
BoorishBears
2 days ago
The new Realtime Websocket API appears to send back responses within less than a second. It might be just what you want.
simonw
2 days ago
yes and you can use it in text-text mode if you want. a key benefit is for turn-based usages (where you have running back and forth between user and assistant) you only need to send the incremental new input message for each generation. this is better than "prompt caching" on the chat completions API, which is basically a pricing optimization, as it's actually a technical advantage that uses less upstream bandwidth.
bcherry
2 days ago
That is odd. Longest I’ve experienced in my use of it is a few seconds.
carlgreene
2 days ago
That doesn’t match my experience using it a lot at all
petesergeant
2 days ago
I didn't expect an API for advanced voice so soon. That's pretty great. Here's the thing I was really wondering: Audio is $.06/min in, $.24/min out. Can't wait to try some language learning apps built with this. It'll also be fun for controlling robots.
modeless
2 days ago
Loving these live updates, keep em coming! Thanks Simon!
sammyteee
2 days ago
> The first big announcement: a realtime API, providing the ability to use WebSockets to implement voice input and output against their models.
I guess this is using their "old" turn-based voice system?
nielsole
2 days ago
No, it's the same thing as ChatGPT advanced voice. Full speech-to-speech model.
Interesting choice of a 24kHz sample rate for PCM audio. I wonder if the model was trained on 24kHz audio, rather than the usual 8/16kHz for ML models.
jbaudanza
2 days ago
Image output for 4o in the API would be very nice but i'm not sure if that's at all in the cards.
Audio output in the api now but you lose image input. Why ? That's a shame.
og_kalu
2 days ago
Any word on increased weekly caps on o1 usage?
hidelooktropic
3 days ago
Weekly caps are for standard accounts (not going to be talked about at DevDay). The blog does note RPM changes for the API though:
"10:30 They started with some demos of o1 being used in applications, and announced that the rate limit for o1 doubled to 10000 RPM (from 5000 RPM) - same as GPT-4 now."
zamadatix
2 days ago
Using structured outputs for generative ui is such a cool idea does anyone know some cool web demos related to this ?
lysecret
2 days ago
I just had an evil thought: once AIs are fast enough, it would be possible to create a “dynamic” user interface on the fly using an AI. Instead of Java or C# code running in an event loop processing mouse clicks, in principle we could have a chat bot generate the UI elements in a script like WPF or plain HTML and process user mouse and keyboard input events!
If you squint at it, this is what chat bots do now, except with a “terminal” style text UI instead of a GUI or true Web UI.
The first incremental step had already been taken: pretty-printing of maths and code. Interactive components are a logical next step.
It would be a mere afternoon of work to write a web server where the dozens of “controllers” is replaced with a single call to an LLM API that simply sends the previous page HTML and the request HTML with headers and all.
“Based on the previous HTML above and the HTTP request below, output the response HTML.”
Just sprinkle on some function calling and a database schema, and the site is done!
jiggawatts
2 days ago
That actually sounds pretty entertaining. Especially if there is dynamic user input, like text box input
ghthor
2 days ago
Other than being borderline impossible to secure, it “should just work” once the AIs get smart enough.
Fine-tuning the model based on example pages and responses might be all that’s required for a sufficient level of consistency.
An immediate use-case might be prototyping in-place.
If you have an existing site, you can capture the request-response pairs and train the AI on it, annotated with the spec docs. Then tell it to implement some new functionality and it should be able to. Just route a subset of the site to the AI instead of the normal controllers.
One could “design” new components and functionality in English and try it instantly with no compilation or deployment steps!
> The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.
> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.
-
This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.
-
Edit: Apparently it does.
It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)
and `response.done` [1] with the response text.
[0] https://platform.openai.com/docs/api-reference/realtime-serv...
[1] https://platform.openai.com/docs/api-reference/realtime-serv...
qwertox
2 days ago
yes it transcribes inputs automatically, but not in realtime.
outputs are sent in text + audio but you'll get the text very quickly and audio a bit slower, and of course the audio takes time to play back. the text also doesn't currently have timing cues so its up to you if you want to try to play it "in sync". if the user interrupts the audio, you need to send back a truncation event so it can roll its own context back, and if you never presented the text to the user you'll need to truncate it there as well to ensure your storage isn't polluted with fragments the user never heard.
bcherry
2 days ago
It's incredible that people are talking about the downfall of software engineering - now, at many companies, hundreds of call center roles will be replaced by a few engineering roles. With image fine-tuning, now we can replace radiologists with software engineers, etc. etc.
pants2
2 days ago
Replacing call center roles with this is something I can see happening with the realtime api + voice output.
Radiologists, I'm not sure what we need is just image model finetuning + LLMs to get there.
mrbungie
2 days ago
What's the role of the software engineer besides setting this up?
Your example makes me think it will merely moves QA into essentially providing countless cases and then updating them over time to improve the AIs data.
And is it really gonna be cheaper than human support?
What's gonna happen when we will find out (see the impossibility to reach a human when interacting with many companies alredy) this is gonna bring (maybe, eventually) costs down, and revenue too because pissed off customers will move elsewhere.
epolanski
2 days ago
More than a majority of a software engineer’s time is spent on bug triage, reproducing bugs, simulating constituents in a test, and debugging fixes.
Doesn’t matter what the computer becomes — AI, AGI or God-incarnate — there’s always a role between that and the end-user. That role today is called software engineer. Tomorrow, it’ll be whatever whatever. Perhaps paid the same or less or more. Doesn’t matter.
There’s always an intermediary to deal with the shit.
Hmm, I wonder if that’s the roles priests & the clergy have been playing all this while. Except, maybe humanity is the shit God (as an end user) has to deal with
cafed00d
2 days ago
I'd much rather talk to ChatGPT than a human support rep, provided they have the same level of ability (tools) to help you.
pants2
a day ago
People have been trying to replace radiologists for several years now. Maybe they'll get there, but it doesn't seem to be easy.
skybrian
2 days ago
Radiologists will not be replaced. They will just have better tools.
dcl
2 days ago
the _role_ of radiologists isn’t going away, but as with software engineers, better tools means there are fewer needed to serve the same patient population. So it’s highly likely that there is going to be displacement within that industry as well.
karmajunkie
2 days ago
We really can't, it's a tool not a radiologist. Medicine is a critical field, can't afford hallucinations and sloppiness
visarga
2 days ago
A radiologist makes critical life-or-death judgements. An algorithm will not, and should not, replace them.
djhn
2 days ago
A modern insulin pump also uses algorithms to make critical life or death decisions, should we replace these with doctors?
falcor84
2 days ago
I don't believe that is comparable. 1. Modern algorithms started out as a cronjob (which already worked better than the alternative) 2. Advances in applying optimal control theory are well known, (mostly) deterministic and explainable. They are in no way comparable to the black box that is the current state of computer vision. 3. Their failure can be readily observed and compensated for, since the patient will definitely notice. The same cannot be said about imaging.
djhn
2 days ago
Does the insulin pump operate in a general space as radiology/diagnosis or is it constrained very precisely?
visarga
2 days ago
saw velvet show hn the other dya, could be usful for storng these https://news.ycombinator.com/item?id=41637550
tough
2 days ago
OpenAI just launched the equivalent of Velvet as a full fledged feature today.
But seperate from that you typically want some application specific storage of the current "conversation" in a very different format than raw request logging.
BoorishBears
2 days ago
I've never seen a company publishing consistently groundbreaking features at such a speed like this one. I really wonder how their teams work. It's unprecedented at what i've seen in 15 years software
siva7
2 days ago
I wonder how much they use their own products internally to speed up development and decisions.
pheeney
2 days ago
They definitely use their own products internally, perhaps to a fault: While chatting with OpenAI recruiters, I received calendar events with nonsensical DALLE-generated calendar images, and "interview prep" guides that were clearly written by an older GPT model.
abound
2 days ago
And I wonder how much they use them externally to influence the online conversations about their own products/company.
amlib
2 days ago
They have roles on their “Leverage Engineering” team which appears to be exactly this
https://openai.com/careers/full-stack-software-engineer-leve...
jonchurch_
2 days ago
AFAIK a lot of these ideas are not new (the JSON thing was done with OS models before) and OpenAI is possibly the hottest startup with the most funding this decade (maybe even past two decades?), so I think this is actually all within expectations.
IdiocyInAction
2 days ago
They're exceptional at executing and delivering, you don't get that just through having more funding.
sk11001
2 days ago
How are they exceptional?
Their web UI was a glitchy mess for over a year. Rollouts of just data is staggered and often delayed. They still can’t adhere to a JSON schema accurately, even though others have figured this out ages ago. There are global outages regularly. Etc…
I’m impressed by some aspects of their rapid growth, but these are financial achievements (credit due Sam) more than technical ones.
jiggawatts
2 days ago
I have a few qualms with this app:
1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.
3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?
closewith
2 days ago
Not sure why you are being downvoted. You are generally right. Most of their new product rollouts were acoompanied with huge production instabilities for paying customers. Only in the most recent ones did they manage that better.
> They still can’t adhere to a JSON schema accurately
Strict mode for structured output fixes at least this though.
hobofan
2 days ago
It’s literally just a bunch of ex-stripe employees and data scientists..
testfrequency
2 days ago
> OpenAI is possibly the hottest startup with the most funding this decade (maybe even past two decades?)
It depends on how you define startup but I don't think they will surpass Uber, ByteDance, or SpaceX until this next rumored funding round.
I'm excluding companies that have raised funding post IPO since that's an obvious cutoff for startups. The other cuttof being break even, in which case Uber has raised well over $20 billion.
throwup238
2 days ago
Is it that most models are based on the transformer architecture ? And so performance improvements can then we used throughout their different products ?
roboboffin
2 days ago
GPT 5 is writing their code
nextworddev
2 days ago
> 11:43 Fields are generated in the same order that you defined them in the schema, even though JSON is supposed to ignore key order. This ensures you can implement things like chain-of-thought by adding those keys in the correct order in your schema design.
Why not use an array of key value pairs if you want to maintain ordering without breaking traditional JSON rules?
[ {key1:value1}, {key2:value2} ]
ponty_rick
2 days ago
> even though JSON is supposed to ignore key order
Most tools preserve the order. I consider it to be an unofficial feature of JSON at this point. A lot of people think of it as a soft guarantee, but it’s a hard guarantee in all the recent JavaScript and python versions. There are some common places where it’s lost, like JSONB in Postgres, but it’s good to be aware that this unofficial feature is commonly being used.
benatkin
2 days ago
I don't think openai models supports this pattern. You can only have array of fixed types. Or basically keys should be same. See [1]
[1]: https://platform.openai.com/docs/guides/structured-outputs/s...
YetAnotherNick
2 days ago
The eval platform is a game changer.
It's nice to have have a solution from OpenAI given how much they use a variant of this internally. I've tried like 5 YC startups and I don't think anyone's really solved this.
There's the very real risk of vendor lock-in but quickly scanning the docs seems like it's a pretty portable implementation.
serjester
2 days ago
It's pretty amazing that they made prompt caching automatic. It's rare that a company gives a 50% discount without the customer explicitly requesting it! Of course... they might be retaining some margin, judging by their discount being 50% vs. Anthropic's 90%.
alach11
2 days ago
This was first done by deepseek. [1]
[1]: https://platform.deepseek.com/api-docs/news/news0802/
WiSaGaN
2 days ago
Haven’t tried Deepseek - how do they compare to OaI?
nextworddev
2 days ago
They release SOTA open source coding models. [1] Their API us also incredibly cheap due to the novel attention and MoE arch.
[1]: https://aider.chat/docs/leaderboards/
WiSaGaN
2 days ago
Aider benchmarks them great for coding... super slow token generation. Much cheaper but once you're used to the speed... it's too slow.
voiper1
2 days ago
Blog updates:
- Introducing the Realtime API: https://openai.com/index/introducing-the-realtime-api/
- Introducing vision to the fine-tuning API: https://openai.com/index/introducing-vision-to-the-fine-tuni...
- Prompt Caching in the API: https://openai.com/index/api-prompt-caching/
- Model Distillation in the API: https://openai.com/index/api-model-distillation/
Docs updates:
- Realtime API: https://platform.openai.com/docs/guides/realtime
- Vision fine-tuning: https://platform.openai.com/docs/guides/fine-tuning/vision
- Prompt Caching: https://platform.openai.com/docs/guides/prompt-caching
- Model Distillation: https://platform.openai.com/docs/guides/distillation
- Evaluating model performance: https://platform.openai.com/docs/guides/evals
Additional updates from @OpenAIDevs: https://x.com/OpenAIDevs/status/1841175537060102396
- New prompt generator on https://playground.openai.com
- Access to the o1 model is expanded to developers on usage tier 3, and rate limits are increased (to the same limits as GPT-4o)
Additional updates from @OpenAI: https://x.com/OpenAI/status/1841179938642411582
- Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it (except EU).
thenameless7741
3 days ago
> Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it.
So regular paying users from EU are still left out in the cold.
visarga
2 days ago
It's probably stuck in legal limbo in the EU. The recently passed EU AI Act prohibits "AI systems aiming to identify or infer emotions", and Advanced Voice does definitely infer the user's emotions.
(There is an exemption for "AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use", but Advanced Voice probably doesn't benefit from that exemption.)
AlanYx
2 days ago
Apparently this prohibition only applies to "situations related to the workplace and education", and, in this context, "That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons"
So it seems to be possible to use this in a personal context.
https://artificialintelligenceact.eu/recital/44/
> Therefore, the placing on the market, the putting into service, or the use of AI systems intended to be used to detect the emotional state of individuals in situations related to the workplace and education should be prohibited. That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use.
qwertox
2 days ago
This is true, though it may not make sense commercially for them to offer an API that can't be used for workplace (business) applications or education.
AlanYx
2 days ago
I see what you mean, but I think that "workplace" specifically refers to the context of the workplace, so that an employer cannot use AI to monitor the employees, even if they have been pressured to agree to such a monitoring. I think this is unrelated to "commercially offering services which can detect emotions".
But then I don't get the spirit of that limitation, as it should be just as applicable to TVs listening in on your conversations and trying to infer your emotions. Then again, I guess that for these cases there are other rules in place which prohibit doing this without the explicit consent of the user.
qwertox
2 days ago
> I think that
> I think this
> I don't get the spirit of that limitation
> I guess that
In a nutshell, this uncertainty is why firms are going to slow-roll EU rollout of AI and, for designated gatekeepers, other features. Until there is a body of litigated cases to use as reference, companies would be placing themselves on the hook for tremendous fines, not to mention the distraction of the executives.
Which, not making any value judgement here, is the point of these laws. To slow down innovation so that society, government, regulation, can digest new technologies. This is the intended effect, and the laws are working.
runako
2 days ago
Companies like OpenAI definitely have the resources to let some lawyers analyze the situation and at this point it should be clear to them if they can or can't do this. It's far more likely that they're holding back because of limitations in hardware resources.
I use those words because I've never read any of the points in the EU AIA.
qwertox
2 days ago
They definitely do have the resources, but laws and regulations are frequently ambiguous. This is one reason the outcome of litigation is often unpredictable.
I would wager this -- OpenAI lawyers have looked that the situation. They have not been able to credibly say "yes, this is okay" and so management makes the decision to wait. Obviously, they would prefer to compete in Europe if it were a no-brainer decision.
It may be possible that the path to get to "yes, definitely" includes some amount of discussion with the relevant EU authorities and/or product modification. These things will take time.
runako
2 days ago
Yes, but it works with a vpn and the change in latency isn’t big enough to have a noticeable impact on usability.
Version467
2 days ago
I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.
The two examples shown in the DevDay are the things I don't really want to do in the future. I don't want to talk to anybody, and I don't want to wait for their answer in a human form. That's why I order my food through an app or Whatsapp, or why I prefer to buy my tickets online. In the rare case I call to order food, it's because I have a weird question or a weird request (can I pick it up in X minutes? Can you prepare it in a different way?)
I hope we don't start seeing apps using conversations as interfaces because it would really horrible (leaving aside the fact that a lot of people don't know how to communicate themselves, different accents, sound environments, etc), while clicking or typing work almost the same for everyone (at least much more normalized than talking)
101008
2 days ago
> I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.
The market for realistic voice agents is huge, but also very fragmented. Customer service is the obvious example, large companies employ tens of thousands of customer service phone agents, and a large # of those calls can be handled, at least in part, with a sufficiently smart voice agent.
Sales is another, just calling back leads and checking in on them. Voice clone the original sales agent, give the AI enough context about previous interactions, and a lot of boring legwork can be handled by AI.
Answering simple questions is another great example, restaurants get slammed with calls during their busiest hours (seriously getting ahold restaurant staff during peak hours can be literally impossible!) having an AI that can pick up the phone and answer basic questions (what's in certain dishes, what is the current wait time, what is the largest group that can be sat together, etc) is super useful.
A lot of small businesses with only a single employee can benefit from having a voice AI assistant picking up the phone and answering the easy everyday queries and then handing everything else off to the owner.
The key is that these voice AIs should be seamless, you ask your question, they answer, and you ideally don't even know it is an AI.
com2kid
2 days ago
Hopefully, it won’t cause a plethora of nuisance phone calls. As the cost tends to zero, then it will be much easier to spam people; even more so than now.
roboboffin
2 days ago
And after your mis-led by a sales agent, it doesn't make you as angry because it's just an AI.
axus
2 days ago
they're definitely going to instruct the AI agents to lie to you, and deliberately waste your time, and be pushier than ever, because it's not costing them anything to have a real human on the line even longer. at least we'll have our own agents to waste their compute in turn
93po
2 days ago
Any company that is that scummy already has sales people working for it who are that scummy and lying non-stop.
The AI isn't changing that equation at all.
com2kid
2 days ago
AI is actually better here.
1. AI instructions are legible. There is no record asking John to sell the customer things they don't need. There is a record if the AI does it.
2. AI interactions are legible. If a sales guy tells you something false on a zoom call, there is no record of it. If the AI does, there is a record.
JamesBarney
2 days ago
oh i totally agree, im saying it's gonna get worse
93po
2 days ago
I would love a work assistant, some sort of secretary idk I can talk to while I code.
"What are today's most important tasks? Anything I forgot before I log off? Can you write John to check the blocking PR? Let's fix this bug together".
epolanski
2 days ago
One thing I'm really excited for is having this real-time voice model in video game characters. It would be really cool to be able to have conversations with NPCs, and actually have to pick their brain for information about a quest or something.
corlinp
2 days ago
"Thrall, the elements hate you" as his model bursts into tears.
whtsthmttrmn
a day ago
You're right, having a voice conversation for any reason is just so passe these days. They should stop adding microphones to phones and everything. So old-fashioned and inefficient. And who wants to ever have to actually talk to someone or some AI to ask for anything? I'm sure our vocal cords will evolve away soon. They are so primitive. Vestigial organs.
ilaksh
2 days ago
I love having voice conversations with friends, family, and people I care of. Not with businesses.
101008
2 days ago
You made my day
olafgeibig
2 days ago
keep in mind that this is just v1 of the realtime api. they'll add realtime vision/video down the road which can also have wide applications beyond synchronous communication.
bcherry
2 days ago
Holy crud, I figured they would guard this for a long time and I was really salivating to make some stuff with it. The doors are wide open for all sorts of stuff now, Advanced Voice is the first feature since ChatGPT initially came out that really has my jaw on the floor.
superdisk
2 days ago
Try notebook LM, it's the chatgpt moment for Google's deepmind
jacooper
2 days ago
I wish I could but not available in UK, IIRC
world2vec
2 days ago
For anyone who’s interested, I’ve written up details of how the underlying live blog system works here: https://til.simonwillison.net/django/live-blog
simonw
a day ago
From the Realtime API blog post: https://openai.com/index/introducing-the-realtime-api/
> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.
As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.
This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.
minimaxir
2 days ago
> and $0.24 per minute of audio output
That is substantially more expensive than TTS (text-to-speech) which already is quite expensive.
OutOfHere
2 days ago
Fair, it wouldn't work well for on-demand generation in an app, but for ad-hoc cases like a voice-over it's not a huge expense.
If OpenAI decides to fully ignore ethics and dive deep into voice cloning, then all bets are off.
minimaxir
2 days ago
I agree. I'm wondering if it is possible to disable output streaming of audio and just get the text response event.
qwertox
2 days ago
It seems so.
The configuration of the session accepts a parameter (modalities) that could restrict the response only to text. See it in https://platform.openai.com/docs/api-reference/realtime-clie....
colaco
2 days ago
correct - you should also be able to save a lot by skipping their built-in VAD and doing turn detection (if you need it) locally to avoid paying for silent inputs.
bcherry
2 days ago
I just need their API to be faster. 15-30 seconds per request using 4o-mini isn't good enough for responsive applications.
N_A_T_E
2 days ago
You should try Azure: it comes with dedicated capacity which is typically a very expensive "call our sales team" feature with OpenAI
BoorishBears
2 days ago
The new Realtime Websocket API appears to send back responses within less than a second. It might be just what you want.
simonw
2 days ago
yes and you can use it in text-text mode if you want. a key benefit is for turn-based usages (where you have running back and forth between user and assistant) you only need to send the incremental new input message for each generation. this is better than "prompt caching" on the chat completions API, which is basically a pricing optimization, as it's actually a technical advantage that uses less upstream bandwidth.
bcherry
2 days ago
That is odd. Longest I’ve experienced in my use of it is a few seconds.
carlgreene
2 days ago
That doesn’t match my experience using it a lot at all
petesergeant
2 days ago
I didn't expect an API for advanced voice so soon. That's pretty great. Here's the thing I was really wondering: Audio is $.06/min in, $.24/min out. Can't wait to try some language learning apps built with this. It'll also be fun for controlling robots.
modeless
2 days ago
Loving these live updates, keep em coming! Thanks Simon!
sammyteee
2 days ago
> The first big announcement: a realtime API, providing the ability to use WebSockets to implement voice input and output against their models.
I guess this is using their "old" turn-based voice system?
nielsole
2 days ago
No, it's the same thing as ChatGPT advanced voice. Full speech-to-speech model.
bcherry
2 days ago
Right, see the "Handling interruptions" section here: https://platform.openai.com/docs/guides/realtime/integration
chrisshroba
2 days ago
Interesting choice of a 24kHz sample rate for PCM audio. I wonder if the model was trained on 24kHz audio, rather than the usual 8/16kHz for ML models.
jbaudanza
2 days ago
Image output for 4o in the API would be very nice but i'm not sure if that's at all in the cards.
Audio output in the api now but you lose image input. Why ? That's a shame.
og_kalu
2 days ago
Any word on increased weekly caps on o1 usage?
hidelooktropic
3 days ago
Weekly caps are for standard accounts (not going to be talked about at DevDay). The blog does note RPM changes for the API though:
"10:30 They started with some demos of o1 being used in applications, and announced that the rate limit for o1 doubled to 10000 RPM (from 5000 RPM) - same as GPT-4 now."
zamadatix
2 days ago
Using structured outputs for generative ui is such a cool idea does anyone know some cool web demos related to this ?
lysecret
2 days ago
I just had an evil thought: once AIs are fast enough, it would be possible to create a “dynamic” user interface on the fly using an AI. Instead of Java or C# code running in an event loop processing mouse clicks, in principle we could have a chat bot generate the UI elements in a script like WPF or plain HTML and process user mouse and keyboard input events!
If you squint at it, this is what chat bots do now, except with a “terminal” style text UI instead of a GUI or true Web UI.
The first incremental step had already been taken: pretty-printing of maths and code. Interactive components are a logical next step.
It would be a mere afternoon of work to write a web server where the dozens of “controllers” is replaced with a single call to an LLM API that simply sends the previous page HTML and the request HTML with headers and all.
“Based on the previous HTML above and the HTTP request below, output the response HTML.”
Just sprinkle on some function calling and a database schema, and the site is done!
jiggawatts
2 days ago
That actually sounds pretty entertaining. Especially if there is dynamic user input, like text box input
ghthor
2 days ago
Other than being borderline impossible to secure, it “should just work” once the AIs get smart enough.
Fine-tuning the model based on example pages and responses might be all that’s required for a sufficient level of consistency.
An immediate use-case might be prototyping in-place.
If you have an existing site, you can capture the request-response pairs and train the AI on it, annotated with the spec docs. Then tell it to implement some new functionality and it should be able to. Just route a subset of the site to the AI instead of the normal controllers.
One could “design” new components and functionality in English and try it instantly with no compilation or deployment steps!
jiggawatts
2 days ago
Karpathy talks about this in https://karpathy.medium.com/software-2-0-a64152b37c35
porridgeraisin
2 days ago
Seems mostly standard items so far.
bigcat12345678
3 days ago