Text-to-Speech WebSocket Streaming
Overview
The Telnyx Text-to-Speech (TTS) WebSocket API provides real-time audio synthesis from text input. This streaming endpoint allows you to send text and receive synthesized audio incrementally, enabling low-latency voice generation for real-time applications.
Video Demos
Watch these demonstrations to see the Telnyx Text-to-Speech in action:
Convert text to speech in REAL TIME | Python | TTS websocket streaming
Telnyx Text-to-Speech API Use-case Demo
Telnyx TTS Audio Reader
WebSocket Endpoint
Connection URL
wss://api.telnyx.com/v2/text-to-speech/speech?voice={voice_id}
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
voice | string | Yes | Voice identifier (e.g., Telnyx.NaturalHD.astra) |
inactivity_timeout | integer | No | Time without message to keep the WebSocket open (default: 20 seconds) |
Authentication
Include your Telnyx API token as an Authorization header in the connection request:
Authorization: Bearer YOUR_TELNYX_TOKEN
Example Connection
import websockets
url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra"
headers = {
"Authorization": "Bearer YOUR_TELNYX_TOKEN"
}
websocket = await websockets.connect(url, extra_headers=headers)
Available Voices
Telnyx offers high-quality text-to-speech voices across multiple models, languages, and voice types, including AWS Polly Text-to-Speech services and Azure AI Speech. Use the interactive explorer below to browse and filter Telnyx voices by model, language, and gender characteristics.
For AWS Polly voices, see the complete AWS Polly voice list. For Azure AI Speech voices, explore the Azure AI Speech voice gallery.
Connection Flow
The TTS WebSocket follows this lifecycle:
- Connect - Establish WebSocket connection with authentication.
- Initialize - Send initialization frame with space character.
- Send Text - Send one or more text frames to synthesize.
- Receive Audio - Receive audio frames with base64-encoded mp3 data.
- Stop - Send empty text frame to signal completion.
- Close - Connection closes after processing completes.
Flow Diagram
Client Server
| |
|------- Connect -------------->|
|<------ Connected -------------|
| |
|------- Init Frame ----------->|
| {"text": " "} |
| |
|------- Text Frame ----------->|
| {"text": "Hello"} |
| |
|<------ Audio Frame -----------|
| {"audio": "base64..."} |
|<------ Audio Frame -----------|
| {"audio": "base64..."} |
| |
|------- Stop Frame ----------->|
| {"text": ""} |
| |
|<------ Close -----------------|
Frame Types
Outbound Frames (Client → Server)
All outbound frames are JSON text messages with the following structure:
1. Initialization Frame
Purpose: Initialize the TTS session
Format:
{
"text": " "
}
Example:
import json
init_frame = {"text": " "}
await websocket.send(json.dumps(init_frame))
Notes:
- Must be sent first after connection.
- Contains a single space character.
- Required to begin the session.
2. Text Frame
Purpose: Send text content to be synthesized into speech
Format:
{
"text": "Your text content here"
}
Example:
text_frame = {"text": "Hello, this is a test of the Telnyx TTS service."}
await websocket.send(json.dumps(text_frame))
Multiple Text Frames:
# You can send multiple text frames sequentially
frames = [
{"text": "First sentence."},
{"text": "Second sentence."},
{"text": "Third sentence."}
]
for frame in frames:
await websocket.send(json.dumps(frame))
await asyncio.sleep(0.5)
Notes:
- Can send multiple text frames in one session.
- Each frame is processed and synthesized separately.
- Audio is returned incrementally for each text frame.
3. Stop Frame
Purpose: Signal completion of text input and end the session
Format:
{
"text": ""
}
Example:
stop_frame = {"text": ""}
await websocket.send(json.dumps(stop_frame))
Notes:
- Contains an empty string.
- Signals the server to finish processing.
- Should be sent after all text frames.
Inbound Frames (Server → Client)
The server sends JSON text messages containing synthesized audio data.
Audio Frame
Purpose: Deliver synthesized audio data
Format:
{
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAAB9AAACABAAZGF0YQAAAAA="
}
Processing Audio:
import base64
async for message in websocket:
data = json.loads(message)
if "audio" in data:
# Decode base64 audio
audio_bytes = base64.b64decode(data["audio"])
# Save or process audio
with open("output.mp3", "ab") as f:
f.write(audio_bytes)
Audio Specifications:
| Property | Value |
|---|---|
| Format | mp3 |
| Sample Rate | 16 kHz |
| Bit Depth | 16-bit |
| Channels | Mono (1) |
| Encoding | Base64 |
Notes:
- Multiple audio frames may be received for a single text input.
- Each audio chunk is a complete mp3 file with headers.
- Chunks should be concatenated in the order received.
- Use append mode when saving to file to preserve all audio.
Complete Example
Here's a complete example showing all frame types in sequence:
import asyncio
import json
import base64
import websockets
async def tts_example():
# 1. Connect to WebSocket
url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra"
headers = {
"Authorization": "Bearer YOUR_TELNYX_TOKEN"
}
async with websockets.connect(url, extra_headers=headers) as ws:
print("Connected to TTS WebSocket")
# 2. Send initialization frame
init_frame = {"text": " "}
await ws.send(json.dumps(init_frame))
print("Sent: Initialization frame")
# 3. Send text frame
text_frame = {"text": "Hello, welcome to Telnyx Text-to-Speech streaming."}
await ws.send(json.dumps(text_frame))
print("Sent: Text frame")
# 4. Receive audio frames
audio_count = 0
async for message in ws:
data = json.loads(message)
if "audio" in data:
audio_count += 1
audio_bytes = base64.b64decode(data["audio"])
# Append audio chunks to file
with open("output.mp3", "ab") as f:
f.write(audio_bytes)
print(f"Received: Audio frame #{audio_count} ({len(audio_bytes)} bytes)")
# After receiving audio, send stop frame
if audio_count >= 10: # Adjust based on your needs
# 5. Send stop frame
stop_frame = {"text": ""}
await ws.send(json.dumps(stop_frame))
print("Sent: Stop frame")
print("Connection closed")
asyncio.run(tts_example())
Expected Output:
Connected to TTS WebSocket
Sent: Initialization frame
Sent: Text frame
Received: Audio frame #1 (8192 bytes)
Received: Audio frame #2 (6144 bytes)
Received: Audio frame #3 (4096 bytes)
Sent: Stop frame
Connection closed
Configuration Summary
Required Configuration
# WebSocket URL
ENDPOINT = "wss://api.telnyx.com/v2/text-to-speech/speech"
VOICE_ID = "Telnyx.NaturalHD.astra"
URL = f"{ENDPOINT}?voice={VOICE_ID}"
# Authentication Header
HEADERS = {
"Authorization": f"Bearer {TELNYX_TOKEN}"
}
Message Sequence
# 1. Initialization
{"text": " "}
# 2. Text to synthesize (can send multiple)
{"text": "Your text here"}
# 3. Stop signal
{"text": ""}
Demo Project
A complete Python implementation is available under the link.
Troubleshooting
| Issue | Solution |
|---|---|
| Connection fails | Verify token format: Bearer YOUR_TOKEN |
| No audio received | Ensure initialization frame sent first |
| Audio is garbled | Check base64 decoding and file append mode |
| Empty audio file | Confirm text frame contains valid content |