Skip to main content

Text-to-Speech WebSocket Streaming

Overview

The Telnyx Text-to-Speech (TTS) WebSocket API provides real-time audio synthesis from text input. This streaming endpoint allows you to send text and receive synthesized audio incrementally, enabling low-latency voice generation for real-time applications.

WebSocket Endpoint

Connection URL

wss://api.telnyx.com/v2/text-to-speech/speech?voice={voice_id}

Query Parameters

ParameterTypeRequiredDescription
voicestringYesVoice identifier (e.g., Telnyx.NaturalHD.astra)
inactivity_timeoutintegerNoTime without message to keep the WebSocket open (default: 20 seconds)

Authentication

Include your Telnyx API token as an Authorization header in the connection request:

Authorization: Bearer YOUR_TELNYX_TOKEN

Example Connection

import websockets

url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra"
headers = {
"Authorization": "Bearer YOUR_TELNYX_TOKEN"
}

websocket = await websockets.connect(url, extra_headers=headers)

Available Voices

The list of available voice can be found by querying the following endpoint

Connection Flow

The TTS WebSocket follows this lifecycle:

  1. Connect - Establish WebSocket connection with authentication
  2. Initialize - Send initialization frame with space character
  3. Send Text - Send one or more text frames to synthesize
  4. Receive Audio - Receive audio frames with base64-encoded mp3 data
  5. Stop - Send empty text frame to signal completion
  6. Close - Connection closes after processing completes

Flow Diagram

Client                          Server
| |
|------- Connect -------------->|
|<------ Connected -------------|
| |
|------- Init Frame ----------->|
| {"text": " "} |
| |
|------- Text Frame ----------->|
| {"text": "Hello"} |
| |
|<------ Audio Frame -----------|
| {"audio": "base64..."} |
|<------ Audio Frame -----------|
| {"audio": "base64..."} |
| |
|------- Stop Frame ----------->|
| {"text": ""} |
| |
|<------ Close -----------------|

Frame Types

Outbound Frames (Client → Server)

All outbound frames are JSON text messages with the following structure:

1. Initialization Frame

Purpose: Initialize the TTS session

Format:

{
"text": " "
}

Example:

import json

init_frame = {"text": " "}
await websocket.send(json.dumps(init_frame))

Notes:

  • Must be sent first after connection
  • Contains a single space character
  • Required to begin the session

2. Text Frame

Purpose: Send text content to be synthesized into speech

Format:

{
"text": "Your text content here"
}

Example:

text_frame = {"text": "Hello, this is a test of the Telnyx TTS service."}
await websocket.send(json.dumps(text_frame))

Multiple Text Frames:

# You can send multiple text frames sequentially
frames = [
{"text": "First sentence."},
{"text": "Second sentence."},
{"text": "Third sentence."}
]

for frame in frames:
await websocket.send(json.dumps(frame))
await asyncio.sleep(0.5)

Notes:

  • Can send multiple text frames in one session
  • Each frame is processed and synthesized separately
  • Audio is returned incrementally for each text frame

3. Stop Frame

Purpose: Signal completion of text input and end the session

Format:

{
"text": ""
}

Example:

stop_frame = {"text": ""}
await websocket.send(json.dumps(stop_frame))

Notes:

  • Contains an empty string
  • Signals the server to finish processing
  • Should be sent after all text frames

Inbound Frames (Server → Client)

The server sends JSON text messages containing synthesized audio data.

Audio Frame

Purpose: Deliver synthesized audio data

Format:

{
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAAB9AAACABAAZGF0YQAAAAA="
}

Processing Audio:

import base64

async for message in websocket:
data = json.loads(message)

if "audio" in data:
# Decode base64 audio
audio_bytes = base64.b64decode(data["audio"])

# Save or process audio
with open("output.mp3", "ab") as f:
f.write(audio_bytes)

Audio Specifications:

PropertyValue
Formatmp3
Sample Rate16 kHz
Bit Depth16-bit
ChannelsMono (1)
EncodingBase64

Notes:

  • Multiple audio frames may be received for a single text input
  • Each audio chunk is a complete mp3 file with headers
  • Chunks should be concatenated in the order received
  • Use append mode when saving to file to preserve all audio

Complete Example

Here's a complete example showing all frame types in sequence:

import asyncio
import json
import base64
import websockets

async def tts_example():
# 1. Connect to WebSocket
url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra"
headers = {
"Authorization": "Bearer YOUR_TELNYX_TOKEN"
}

async with websockets.connect(url, extra_headers=headers) as ws:
print("Connected to TTS WebSocket")

# 2. Send initialization frame
init_frame = {"text": " "}
await ws.send(json.dumps(init_frame))
print("Sent: Initialization frame")

# 3. Send text frame
text_frame = {"text": "Hello, welcome to Telnyx Text-to-Speech streaming."}
await ws.send(json.dumps(text_frame))
print("Sent: Text frame")

# 4. Receive audio frames
audio_count = 0
async for message in ws:
data = json.loads(message)

if "audio" in data:
audio_count += 1
audio_bytes = base64.b64decode(data["audio"])

# Append audio chunks to file
with open("output.mp3", "ab") as f:
f.write(audio_bytes)

print(f"Received: Audio frame #{audio_count} ({len(audio_bytes)} bytes)")

# After receiving audio, send stop frame
if audio_count >= 10: # Adjust based on your needs
# 5. Send stop frame
stop_frame = {"text": ""}
await ws.send(json.dumps(stop_frame))
print("Sent: Stop frame")

print("Connection closed")

asyncio.run(tts_example())

Expected Output:

Connected to TTS WebSocket
Sent: Initialization frame
Sent: Text frame
Received: Audio frame #1 (8192 bytes)
Received: Audio frame #2 (6144 bytes)
Received: Audio frame #3 (4096 bytes)
Sent: Stop frame
Connection closed

Configuration Summary

Required Configuration

# WebSocket URL
ENDPOINT = "wss://api.telnyx.com/v2/text-to-speech/speech"
VOICE_ID = "Telnyx.NaturalHD.astra"
URL = f"{ENDPOINT}?voice={VOICE_ID}"

# Authentication Header
HEADERS = {
"Authorization": f"Bearer {TELNYX_TOKEN}"
}

Message Sequence

# 1. Initialization
{"text": " "}

# 2. Text to synthesize (can send multiple)
{"text": "Your text here"}

# 3. Stop signal
{"text": ""}

Demo Project

A complete Python implementation is available under the link.

Troubleshooting

IssueSolution
Connection failsVerify token format: Bearer YOUR_TOKEN
No audio receivedEnsure initialization frame sent first
Audio is garbledCheck base64 decoding and file append mode
Empty audio fileConfirm text frame contains valid content

Additional Resources