Skip to content

InferenceInception Labs

Inception Labs

Diffusion LLMs for ultra-fast text and code.

Category
Inference
Pricing
PAID
Hosting
Cloud
Platforms
APIWeb
Models
Single model (proprietary)
Verified
Jun 19, 2026

Inception Labs builds diffusion-based large language models (dLLMs) that generate tokens in parallel rather than sequentially, claiming several times the speed and under half the cost of conventional autoregressive LLMs at comparable quality. Its Mercury family — including the Mercury 2 reasoning model and Mercury Edit for code — is served through an OpenAI-compatible API and also via AWS Bedrock and Azure. The Stanford spinout, led by Stefano Ermon, raised $50M from Menlo Ventures with angels including Andrew Ng and Andrej Karpathy.

Pros & cons

  • 1,000+ tokens/sec throughput
  • Lower per-token cost than peers
  • OpenAI-compatible API
  • Available on Bedrock and Azure
  • Own model family only (Mercury)
  • Newer, less battle-tested than GPT/Claude
  • Paid API, no large free tier

Tags

Further reading

View all Inference
  • View Groq details
    InferenceFREEMIUM

    Groq

    Groq

    Ultra-fast inference on custom LPU chips. Open-weights at 500+ tokens/sec.

    GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.

    Hundreds of tokens/sec on open models
    Curated open-weight models only
    • inference
    • low-latency
    • lpu
    • open-weights
  • View Cerebras details
    InferenceFREEMIUM

    Cerebras

    Cerebras Systems

    Wafer-scale inference cloud for open models.

    Inference cloud that serves open-weight models such as Llama, Qwen, DeepSeek, and gpt-oss on Cerebras's wafer-scale CS-3 hardware, reaching token throughput far above GPU clouds. Exposes an OpenAI-compatible API with a free daily tier and pay-per-token pricing.

    Highest tokens/sec in the market
    Smaller model catalog than Groq/Together
    • inference
    • fast-inference
    • wafer-scale
    • open-models
  • View Fireworks AI details
    InferenceFREEMIUM

    Fireworks AI

    Fireworks AI

    Fast inference + fine-tuning. Production deployments at scale.

    Optimized inference platform for open-weights models with strong latency numbers and serverless + dedicated deployment options. Fine-tuning supported; vision and audio models alongside text.

    Custom FireAttention inference stack
    Usage pricing scales with traffic
    • inference
    • fine-tuning
    • low-latency
    • production
  • View Together AI details
    InferenceFREEMIUM

    Together AI

    Together

    Fine-tuning + inference for open-weights models. Broad coverage.

    Hosted inference and fine-tuning across hundreds of open-weights models (Llama, Mistral, DeepSeek, Qwen, etc.). Strong pricing for inference-at-scale; LoRA + full fine-tuning supported.

    Hundreds of open-weights models
    Open models only, no frontier closed models
    • inference
    • fine-tuning
    • open-weights
    • lora