On the current OpenAI Dev Day on October 1st, 2024, OpenAI’s largest launch was the reveal of their Realtime API:
“Immediately, we’re introducing a public beta of the Realtime API, enabling all paid builders to construct low-latency, multimodal experiences of their apps.
Just like ChatGPT’s Superior Voice Mode, the Realtime API helps pure speech-to-speech conversations utilizing the six preset voices already supported within the API.”
(supply: OpenAI web site)
As per their message, a few of its key advantages embrace low latency, and its speech to speech capabilities. Let’s see how that performs out in follow when it comes to constructing out voice AI brokers.
It additionally has an interruption dealing with function, in order that the realtime stream will cease sending audio if it detects you are attempting to talk over it, a helpful function for certain when constructing voice brokers.
On this article we are going to:
- Examine what a telephone voice agent circulate may need regarded like earlier than the Realtime API, and what it appears to be like like now,
- Evaluation a GitHub venture from Twilio that units up a voice agent utilizing the brand new Realtime API, so we are able to see what the implementation appears to be like like in follow, and get an concept how the websockets and connections are setup for such an software,
- Rapidly evaluate the React demo venture from OpenAI that makes use of the Realtime API,
- Examine the pricing of those varied choices.
Earlier than the OpenAI Realtime API
To get a telephone voice agent service working, there are some key providers we require
- Speech to Textual content ( e.g Deepgram),
- LLM/Agent ( e.g OpenAI),
- Textual content to Speech (e.g ElevenLabs).
These providers are illustrated within the diagram beneath
That after all means integration with a lot of providers, and separate API requests for every components.
The brand new OpenAI Realtime API permits us to bundle all of these collectively right into a single request, therefore the time period, speech to speech.
After the OpenAI Realtime API
That is what the circulate diagram would seem like for the same new circulate utilizing the brand new OpenAI Realtime API.
Clearly this can be a a lot easier circulate. What is occurring is we’re simply passing the speech/audio from the telephone name on to the OpenAI Realtime API. No want for a speech to textual content middleman service.
And on the response aspect, the Realtime API is once more offering an audio stream because the response, which we are able to ship proper again to Twilio (i.e to the telephone name response). So once more, no want for an additional textual content to speech service, as it’s all taken care of by the OpenAI Realtime API.
Let’s take a look at some code samples for this. Twilio has supplied an excellent github repository instance for organising this Twilio and OpenAI Realtime API circulate. You’ll find it right here:
Listed here are some excerpts from key components of the code associated to organising
- the websockets connection from Twilio to our software, in order that we are able to obtain audio from the caller, and ship audio again,
- and the websockets connection to the OpenAI Realtime API from our software.
I’ve added some feedback within the supply code beneath to try to clarify what’s going on, expecially concerning the websocket connection between Twilio and our applicaion, and the websocket connection from our software to OpenAI. The triple dots (…) refere to sections of the supply code which have been eliminated for brevity, since they don’t seem to be vital to understanding the core options of how the circulate works.
// On receiving a telephone name, Twilio forwards the incoming name request to
// a webhook we specify, which is that this endpoint right here. This permits us to
// create programatic voice purposes, for instance utilizing an AI agent
// to deal with the telephone name
//
// So, right here we're offering an preliminary response to the decision, and creating
// a websocket (known as a MediaStream in Twilio, extra on that beneath) to obtain
// any future audio that comes into the decision
fastify.all('/incoming', async (request, reply) => {
const twimlResponse = `<?xml model="1.0" encoding="UTF-8"?>
<Response>
<Say>Please wait whereas we join your name to the A. I. voice assistant, powered by Twilio and the Open-A.I. Realtime API</Say>
<Pause size="1"/>
<Say>O.Okay. you can begin speaking!</Say>
<Join>
<Stream url="wss://${request.headers.host}/media-stream" />
</Join>
</Response>`;reply.sort('textual content/xml').ship(twimlResponse);
});
fastify.register(async (fastify) => {
// Right here we're connecting our software to the websocket media stream we
// setup above. Meaning all audio that comes although the telephone will come
// to this websocket connection we have now setup right here
fastify.get('/media-stream', { websocket: true }, (connection, req) => {
console.log('Shopper related');
// Now, we're creating websocket connection to the OpenAI Realtime API
// That is the second leg of the circulate diagram above
const openAiWs = new WebSocket('wss://api.openai.com/v1/realtime?mannequin=gpt-4o-realtime-preview-2024-10-01', {
headers: {
Authorization: `Bearer ${OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1"
}
});
...
// Right here we're organising the listener on the OpenAI Realtime API
// websockets connection. We're specifying how we wish it to
// deal with any incoming audio streams which have come again from the
// Realtime API.
openAiWs.on('message', (information) => {
attempt {
const response = JSON.parse(information);
...
// This response sort signifies an LLM responce from the Realtime API
// So we need to ahead this response again to the Twilio Mediat Stream
// websockets connection, which the caller will hear as a response on
// on the telephone
if (response.sort === 'response.audio.delta' && response.delta) {
const audioDelta = {
occasion: 'media',
streamSid: streamSid,
media: { payload: Buffer.from(response.delta, 'base64').toString('base64') }
};
// That is the precise half we're sending it again to the Twilio
// MediaStream websockets connection. Discover how we're sending the
// response again immediately. No want for textual content to speech conversion from
// the OpenAI response. The OpenAI Realtime API already offers the
// response as an audio stream (i.e speech to speech)
connection.ship(JSON.stringify(audioDelta));
}
} catch (error) {
console.error('Error processing OpenAI message:', error, 'Uncooked message:', information);
}
});
// This components specifies how we deal with incoming messages to the Twilio
// MediaStream websockets connection i.e how we deal with audio that comes
// into the telephone from the caller
connection.on('message', (message) => {
attempt {
const information = JSON.parse(message);
change (information.occasion) {
// This case ('media') is that state for when there may be audio information
// accessible on the Twilio MediaStream from the caller
case 'media':
// we first try OpenAI Realtime API websockets
// connection is open
if (openAiWs.readyState === WebSocket.OPEN) {
const audioAppend = {
sort: 'input_audio_buffer.append',
audio: information.media.payload
};
// after which ahead the audio stream information to the
// Realtime API. Once more, discover how we're sending the
// audio stream immediately, not speech to textual content converstion
// as would have been required beforehand
openAiWs.ship(JSON.stringify(audioAppend));
}
break;
...
}
} catch (error) {
console.error('Error parsing message:', error, 'Message:', message);
}
});
...
fastify.hear({ port: PORT }, (err) => {
if (err) {
console.error(err);
course of.exit(1);
}
console.log(`Server is listening on port ${PORT}`);
});
So, that’s how the brand new OpenAI Realtime API circulate performs out in follow.
Relating to the Twilio MediaStreams, you may learn extra about them here. They’re a solution to setup a websockets connection between a name to a Twilio telephone quantity and your software. This permits streaming of audio from the decision to and from you software, permitting you to construct programmable voice purposes over the telephone.
To get to the code above working, you’ll need to setup a Twilio quantity and ngrok additionally. You’ll be able to try my different article over right here for assist setting these up.
Since entry to the OpenAI Realtime API has simply been rolled, not everybody could have entry simply but. I intially was not capable of entry it. Working the appliance labored, however as quickly because it tries to connect with the OpenAI Realtime API I obtained a 403 error. So in case you see the identical difficulty, it could possibly be associated to not having entry but additionally.
OpenAI have additionally supplied an excellent demo for testing out their Realtime API within the browser utilizing a React app. I examined this out myself, and was very impressed with the pace of response from the voice agent coming from the Realtime API. The response is immediate, there is no such thing as a latency, and makes for an excellent person expertise. I used to be definitley impressed when testing it out.
Sharing a hyperlink to the supply code right here. It has intructions within the README.md for the best way to get setup
It is a image of what the appliance appears to be like like when you get it working on native
Let’s examine the price the of utilizing the OpenAI Realtime API versus a extra typical method utilizing Deepagram for speech to textual content (STT) and textual content to speech (TTS) and utilizing OpenAI GPT-4o for the LLM half.
Comparability utilizing the costs from their web sites reveals that for a 1 minute dialog, with the caller talking half the time, and the AI agent talking the opposite half, the price per minute utilizing Deepgram and GPT-4o can be $0.0117/minute, whereas utilizing the OpenAI Realtime API can be $0.15/minute.
Meaning utilizing the OpenAI Realtime API can be simply over 10x the worth per minute.
It does sound like a good quantity dearer, although we should always stability that with a few of the advantages the OpenAI Realtime API might present, together with
- diminished latencies, essential for having a very good voice expertise,
- ease of setup because of fewer shifting components,
- dialog interruption dealing with supplied out of the field.
Additionally, please do remember that costs can change over time, so the costs you discover on the time of studying this text, might not be the identical as these mirrored above.
Hope that was useful! What do you consider the brand new OpenAI Realtime API? Assume you can be utilizing it in any upcoming initiatives?
Whereas we’re right here, are there some other tutorials or articles round voice brokers andvoice AI you’ll be curious about? I’m deep diving into that subject a bit simply now, so can be comfortable to look into something individuals discover fascinating.
Completely satisfied hacking!