DeepInfra

Pay-as-you-go API access to open and proprietary AI models.

Category: Inference
Pricing: PAID
Source: Proprietary
Hosting: Cloud
Platforms: APIWeb
Models: Multi-model
Verified: Jun 15, 2026

DeepInfra is a cloud inference platform that lets developers run open and proprietary models through a simple, OpenAI-compatible API without managing hardware. It serves text generation, embeddings, image/audio/video, and speech models with token-based, pay-as-you-go pricing, and offers DeepCluster dedicated NVIDIA GPU capacity for heavier workloads. It is SOC 2 and ISO 27001 certified with a zero data-retention policy.

Capabilities 6

What it actually does — grouped by capability family.

Model inference / serving (primary capability)
Multi-model access (primary capability)

Embeddings (secondary capability)

Transcription (STT) (secondary capability)
Speech synthesis (TTS) (secondary capability)

Text-to-image (secondary capability)

Pros & cons

100+ models behind one OpenAI-compatible API
Dedicated GPU clusters (DeepCluster) available
SOC 2 / ISO 27001, zero data retention
No hardware to manage

Pay-as-you-go only, no free tier
Skews toward open models
Not a fine-tuning-first platform

View Together AI details
InferenceFREEMIUM
Together AI
Together
Hosted inference and fine-tuning for open-weights models.
Hosted inference and fine-tuning across hundreds of open-weights models (Llama, Mistral, DeepSeek, Qwen, etc.). Strong pricing for inference-at-scale; LoRA + full fine-tuning supported.
LoRA and full fine-tuning
Open models only, no frontier closed models
- inference
- fine-tuning
- open-weights
- lora
Open
View Fireworks AI details
InferenceFREEMIUM
Fireworks AI
Fireworks AI
Fast inference + fine-tuning. Production deployments at scale.
Optimized inference platform for open-weights models with strong latency numbers and serverless + dedicated deployment options. Fine-tuning supported; vision and audio models alongside text.
Custom FireAttention inference stack
Usage pricing scales with traffic
- inference
- fine-tuning
- low-latency
- production
Open
View Groq details
InferenceFREEMIUM
Groq
Groq
Low-latency inference for open-weights models on custom LPU chips.
GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.
Hundreds of tokens/sec on open models
Curated open-weight models only
- inference
- low-latency
- lpu
- open-weights
Open
View Replicate details
InferenceFREEMIUM
Replicate
Replicate
Run, fine-tune, and deploy thousands of open models via one API.
A platform to run open-source models with one API call — image, video, audio, and language — plus fine-tuning and custom deploys with pay-per-second billing. No infra to manage.
Image, video, audio, and language models
Cold starts on less-popular models
- model-hosting
- fine-tuning
- api
- open-source
Open

Open DeepInfra

DeepInfra

Capabilities 6

Pros & cons

Tags

Further reading

Together AI

Fireworks AI

Groq

Replicate