VoiceInworld AI

Inworld AI

A full-stack voice runtime for building human-sounding AI agents.

Categories: VoiceSupportCompanion
Pricing: FREEMIUM
Source: Proprietary
Hosting: Cloud
Platforms: API
Models: Multi-model
Verified: Jun 14, 2026

A developer platform for real-time voice AI — an integrated STT + LLM + TTS pipeline exposed through REST and WebSocket APIs (OpenAI Realtime-compatible) for companions, character chat, support, and phone agents. Beyond the voice stack it offers a model Router, inference, and compute, with cloud and enterprise on-prem deployment.

Capabilities 6

What it actually does — grouped by capability family.

Voice agent (primary capability)

LLM gateway / routing (secondary capability)
Model inference / serving (secondary capability)

Speech synthesis (TTS) (primary capability)
Transcription (STT) (secondary capability)
Voice cloning (secondary capability)

Pros & cons

Integrated full-stack voice pipeline
OpenAI Realtime-compatible API
Aggressive usage-based pricing at scale
Free on-demand tier for prototyping

Developer API, not an end-user app
Pivoted from its original character-engine focus
Voice quality varies by model tier

View Vapi details
VoiceFREEMIUM
Vapi
Vapi
Voice agent infrastructure. Build a phone-agent in a weekend.
Production voice-agent platform — telephony, STT, LLM, TTS, and interrupt handling stitched together so you call an endpoint and get a working phone agent. Pluggable models at every layer.
Telephony and interrupts handled
Per-minute costs stack across layers
- voice-agents
- telephony
- phone
- real-time
Open
View Retell AI details
VoiceFREEMIUM
Retell AI
Retell AI
Build, test, and deploy AI voice agents for phone calls.
A no-code platform for humanlike voice agents that handle inbound and outbound phone calls — receptionists, IVR, and outbound campaigns. It bundles telephony (SIP / Twilio), a proprietary turn-taking model for low-latency conversations, prompts, tools, and call analytics. Pay-as-you-go pricing with free starter credits.
Inbound and outbound call handling
Per-minute costs stack with LLM/TTS
- voice-agents
- telephony
- call-automation
- no-code
Open
View Cartesia details
VoiceFREEMIUM
Cartesia
Cartesia
Low-latency streaming text-to-speech for real-time voice.
Streaming-first speech synthesis built around the Sonic family of state-space models. Aims at real-time agent voices where latency between turns is the product. Strong choice for sub-200ms voice loops.
Streaming over WebSocket for fast first audio
Long-form expressive texture trails ElevenLabs
- tts
- streaming
- low-latency
- real-time
Open
View ElevenLabs details
VoiceFREEMIUM
ElevenLabs
ElevenLabs
Text-to-speech, voice cloning, and multilingual dubbing.
Hosted speech synthesis at near-human quality — TTS, voice cloning, multilingual dubbing, and conversational voice agents. Default choice when you need a voice that sounds like a person, not a robot.
Best-in-class voice realism
Pricier than commodity TTS at scale
- tts
- voice-cloning
- dubbing
- multilingual
Open

Open Inworld AI

Inworld AI

Capabilities 6

Pros & cons

Tags

Further reading

Vapi

Retell AI

Cartesia

ElevenLabs