NVDA 208.65 ▼0.97%GOOGL 349.68 ▼4.99%MSFT 367.34 ▼3.18%AMD 551.63 ▲2.65%INTC 140.94 ▲5.19%TSMC 467.67 ▲1.20%AMZN 232.79 ▼4.75%META 563.85 ▼2.32%AAPL 297.01 ▼0.34%PLTR 119.50 ▼6.98%
Markets at last close

Microsoft · Models

GPT realtime API for speech and audio

·1 min read

Azure OpenAI’s GPT Realtime API supports interactive, low-latency ‘speech in, speech out’ conversations and is part of the GPT-4o model family. You can stream audio to the model and receive audio responses in real time via WebRTC or WebSocket. The documentation recommends WebRTC for client-side applications such as web and mobile apps because it is designed for low-latency audio streaming, while WebSocket is suggested for server-to-server scenarios where extreme low latency is not required.

The article lists supported realtime models and recommends API versions: gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview (both version 2024-12-17), gpt-realtime (version 2025-08-28), and gpt-realtime-mini (version 2025-10-06). It notes that Realtime API support was first added in API version 2024-10-01-preview (retired) and recommends using the generally available API version 2025-08-28 when possible. To deploy a model, follow the Azure Artificial Intelligence Foundry portal workflow: create or select a project, open Models + endpoints under My assets, choose Deploy model > Deploy base model, select gpt-realtime, and complete the deployment wizard.

The guide covers prerequisites and authentication options. You need an Azure subscription, a deployed gpt-realtime or gpt-realtime-mini model, and Node.js or Python/TypeScript environments depending on sample code. Microsoft Entra ID keyless authentication is recommended; it requires Azure CLI and assignment of the Cognitive Services User role. The article also describes API-key authentication and cautions about storing keys securely. Example session configuration shows audio input and output settings (transcription with whisper-1, audio/pcm at 24000 Hz, server_vad turn detection, and output voice ‘alloy’). Client samples for JavaScript, Python, and TypeScript demonstrate event handling for session.created, session.updated, response.output_audio.delta, response.output_audio_transcript.delta, and response.done, and include example console output of transcript deltas and audio chunk sizes to illustrate real-time interaction patterns.

Originally reported by learn.microsoft.comRead the source →
Related coverage
All Microsoft news →