Skip to main content

Text-to-Speech WebSocket Streaming

Overview

The Telnyx Text-to-Speech (TTS) WebSocket API provides real-time audio synthesis from text input. This streaming endpoint allows you to send text and receive synthesized audio incrementally, enabling low-latency voice generation for real-time applications.

Video Demos

Watch these demonstrations to see the Telnyx Text-to-Speech in action:

Convert text to speech in REAL TIME | Python | TTS websocket streaming

Telnyx Text-to-Speech API Use-case Demo

Telnyx TTS Audio Reader

WebSocket Endpoint

Connection URL

wss://api.telnyx.com/v2/text-to-speech/speech?voice={voice_id}

Query Parameters

ParameterTypeRequiredDescription
voicestringYesVoice identifier (e.g., Telnyx.NaturalHD.astra)
inactivity_timeoutintegerNoTime without message to keep the WebSocket open (default: 20 seconds)

Authentication

Include your Telnyx API token as an Authorization header in the connection request:

Authorization: Bearer YOUR_TELNYX_TOKEN

Example Connection

import websockets

url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra"
headers = {
"Authorization": "Bearer YOUR_TELNYX_TOKEN"
}

websocket = await websockets.connect(url, extra_headers=headers)

Available Voices

Telnyx offers high-quality text-to-speech voices across multiple models, languages, and voice types, including AWS Polly Text-to-Speech services and Azure AI Speech. Use the interactive explorer below to browse and filter Telnyx voices by model, language, and gender characteristics.

For AWS Polly voices, see the complete AWS Polly voice list. For Azure AI Speech voices, explore the Azure AI Speech voice gallery.

Loading voices...

Connection Flow

The TTS WebSocket follows this lifecycle:

  1. Connect - Establish WebSocket connection with authentication.
  2. Initialize - Send initialization frame with space character.
  3. Send Text - Send one or more text frames to synthesize.
  4. Receive Audio - Receive audio frames with base64-encoded mp3 data.
  5. Stop - Send empty text frame to signal completion.
  6. Close - Connection closes after processing completes.

Flow Diagram

Client                          Server
| |
|------- Connect -------------->|
|<------ Connected -------------|
| |
|------- Init Frame ----------->|
| {"text": " "} |
| |
|------- Text Frame ----------->|
| {"text": "Hello"} |
| |
|<------ Audio Frame -----------|
| {"audio": "base64..."} |
|<------ Audio Frame -----------|
| {"audio": "base64..."} |
| |
|------- Stop Frame ----------->|
| {"text": ""} |
| |
|<------ Close -----------------|

Frame Types

Outbound Frames (Client → Server)

All outbound frames are JSON text messages with the following structure:

1. Initialization Frame

Purpose: Initialize the TTS session

Format:

{
"text": " "
}

Example:

import json

init_frame = {"text": " "}
await websocket.send(json.dumps(init_frame))

Notes:

  • Must be sent first after connection.
  • Contains a single space character.
  • Required to begin the session.

2. Text Frame

Purpose: Send text content to be synthesized into speech

Format:

{
"text": "Your text content here"
}

Example:

text_frame = {"text": "Hello, this is a test of the Telnyx TTS service."}
await websocket.send(json.dumps(text_frame))

Multiple Text Frames:

# You can send multiple text frames sequentially
frames = [
{"text": "First sentence."},
{"text": "Second sentence."},
{"text": "Third sentence."}
]

for frame in frames:
await websocket.send(json.dumps(frame))
await asyncio.sleep(0.5)

Notes:

  • Can send multiple text frames in one session.
  • Each frame is processed and synthesized separately.
  • Audio is returned incrementally for each text frame.

3. Stop Frame

Purpose: Signal completion of text input and end the session

Format:

{
"text": ""
}

Example:

stop_frame = {"text": ""}
await websocket.send(json.dumps(stop_frame))

Notes:

  • Contains an empty string.
  • Signals the server to finish processing.
  • Should be sent after all text frames.

Inbound Frames (Server → Client)

The server sends JSON text messages containing synthesized audio data.

Audio Frame

Purpose: Deliver synthesized audio data

Format:

{
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAAB9AAACABAAZGF0YQAAAAA="
}

Processing Audio:

import base64

async for message in websocket:
data = json.loads(message)

if "audio" in data:
# Decode base64 audio
audio_bytes = base64.b64decode(data["audio"])

# Save or process audio
with open("output.mp3", "ab") as f:
f.write(audio_bytes)

Audio Specifications:

PropertyValue
Formatmp3
Sample Rate16 kHz
Bit Depth16-bit
ChannelsMono (1)
EncodingBase64

Notes:

  • Multiple audio frames may be received for a single text input.
  • Each audio chunk is a complete mp3 file with headers.
  • Chunks should be concatenated in the order received.
  • Use append mode when saving to file to preserve all audio.

Complete Example

Here's a complete example showing all frame types in sequence:

import asyncio
import json
import base64
import websockets

async def tts_example():
# 1. Connect to WebSocket
url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra"
headers = {
"Authorization": "Bearer YOUR_TELNYX_TOKEN"
}

async with websockets.connect(url, extra_headers=headers) as ws:
print("Connected to TTS WebSocket")

# 2. Send initialization frame
init_frame = {"text": " "}
await ws.send(json.dumps(init_frame))
print("Sent: Initialization frame")

# 3. Send text frame
text_frame = {"text": "Hello, welcome to Telnyx Text-to-Speech streaming."}
await ws.send(json.dumps(text_frame))
print("Sent: Text frame")

# 4. Receive audio frames
audio_count = 0
async for message in ws:
data = json.loads(message)

if "audio" in data:
audio_count += 1
audio_bytes = base64.b64decode(data["audio"])

# Append audio chunks to file
with open("output.mp3", "ab") as f:
f.write(audio_bytes)

print(f"Received: Audio frame #{audio_count} ({len(audio_bytes)} bytes)")

# After receiving audio, send stop frame
if audio_count >= 10: # Adjust based on your needs
# 5. Send stop frame
stop_frame = {"text": ""}
await ws.send(json.dumps(stop_frame))
print("Sent: Stop frame")

print("Connection closed")

asyncio.run(tts_example())

Expected Output:

Connected to TTS WebSocket
Sent: Initialization frame
Sent: Text frame
Received: Audio frame #1 (8192 bytes)
Received: Audio frame #2 (6144 bytes)
Received: Audio frame #3 (4096 bytes)
Sent: Stop frame
Connection closed

Configuration Summary

Required Configuration

# WebSocket URL
ENDPOINT = "wss://api.telnyx.com/v2/text-to-speech/speech"
VOICE_ID = "Telnyx.NaturalHD.astra"
URL = f"{ENDPOINT}?voice={VOICE_ID}"

# Authentication Header
HEADERS = {
"Authorization": f"Bearer {TELNYX_TOKEN}"
}

Message Sequence

# 1. Initialization
{"text": " "}

# 2. Text to synthesize (can send multiple)
{"text": "Your text here"}

# 3. Stop signal
{"text": ""}

Demo Project

A complete Python implementation is available under the link.

Troubleshooting

IssueSolution
Connection failsVerify token format: Bearer YOUR_TOKEN
No audio receivedEnsure initialization frame sent first
Audio is garbledCheck base64 decoding and file append mode
Empty audio fileConfirm text frame contains valid content

Additional Resources