Skip to content

Inference AI apps

Model serving, inference engines, and LLM gateways — run, route, and scale models in production.

20 apps · researched & kept current by Claude Code

Filter & search these 20 apps
  • View Lambda details
    InferencePAID

    Lambda

    Lambda

    The Superintelligence Cloud — on-demand GPUs, 1-Click Clusters, and superclusters.

    Lambda is a GPU cloud for AI training and inference, spanning on-demand HGX B200 and H100 instances, self-serve 1-Click Clusters, and single-tenant superclusters built on NVIDIA's latest generations. A GPU specialist since 2012, it sells compute by the hour without long-term hyperscaler contracts and co-engineers large deployments with NVIDIA.

    Worth knowing

    Founded in 2012 selling deep-learning workstations; its $1.5B+ late-2025 Series E was led by TWG Global to build gigawatt 'AI factories'.

    • gpu-cloud
    • training
    • clusters
    • nvidia
    • +1
  • View CoreWeave details
    InferencePAID

    CoreWeave

    CoreWeave

    The AI hyperscaler — GPU cloud built for large-scale training and inference.

    CoreWeave is a purpose-built AI cloud renting large-scale NVIDIA GPU capacity for training and inference, layered with managed Kubernetes, AI object storage, and Mission Control observability. Public on Nasdaq since March 2025, it counts most leading AI labs — including OpenAI, Meta, and Anthropic — among its customers, with a contracted revenue backlog reported near $100B in 2026.

    Worth knowing

    Started in 2017 as Ethereum-mining startup Atlantic Crypto before pivoting to AI cloud; went public on Nasdaq in March 2025.

    • gpu-cloud
    • ai-hyperscaler
    • training
    • inference
    • +1
  • View Hyperbolic details
    InferenceFREEMIUM

    Hyperbolic

    Hyperbolic

    Open-access AI cloud: serverless inference + a GPU marketplace.

    An AI cloud offering serverless inference for open models (Llama, Qwen, DeepSeek, SDXL, Flux) behind an OpenAI-compatible API, alongside an on-demand GPU marketplace for H100/H200 rentals. It aggregates idle and reserved compute to price inference and GPU hours below centralized clouds. Aimed at developers training, fine-tuning, and serving open-weights models.

    Worth knowing

    Crypto-VC-backed (Polychain, Variant); a blockchain network pooling idle GPUs, with Hugging Face and Quora as early users.

    • inference
    • gpu-cloud
    • open-weights
    • marketplace
  • View Runpod details
    InferencePAID

    Runpod

    Runpod

    GPU cloud for AI — on-demand instances and serverless inference.

    Runpod is an AI developer cloud for renting GPUs on demand or running auto-scaling serverless inference endpoints. Serverless workers bill by the millisecond, scale to zero when idle, and advertise sub-200ms cold starts; on-demand Pods and multi-node Clusters cover training and long-running jobs. A Community Cloud tier offers cheaper, peer-sourced GPUs alongside the vendor-operated Secure Cloud.

    Worth knowing

    Bootstrapped from a Reddit post by two ex-Comcast developers, it hit $120M ARR before ever raising a Series A.

    • gpu-cloud
    • serverless
    • inference
    • deployment
    • +1
  • View Portkey details
    InferenceFREEMIUMOpen core

    Portkey

    Portkey

    AI gateway with observability, guardrails, and governance.

    A production AI gateway that gives apps and agents unified access to 1,600+ LLMs across providers behind a single API, with built-in observability, prompt management, guardrails, and governance. Portkey adds routing, caching, fallbacks, cost limits, PII redaction, RBAC, and an MCP gateway. Its core gateway is open-source; run it self-hosted/hybrid or use the managed cloud, which offers a free tier.

    Worth knowing

    Palo Alto Networks acquired Portkey (closed May 2026) to fold its open-source AI gateway into Prisma AIRS agent security.

    • ai-gateway
    • llm-routing
    • observability
    • guardrails
  • View vLLM details
    InferenceFREEOSS

    vLLM

    vLLM Project

    High-throughput, memory-efficient inference engine for LLMs.

    Open-source (Apache-2.0) serving engine for large language and vision-language models, originally from UC Berkeley's Sky Computing Lab. Its PagedAttention KV-cache management and continuous batching deliver high throughput on commodity GPUs. Now a community project with 1000s of contributors and an OpenAI-compatible server.

    Worth knowing

    Started at UC Berkeley's Sky Computing Lab (2023 PagedAttention paper) and became a PyTorch Foundation-hosted project in 2025.

    • inference
    • model-serving
    • gpu
    • open-source
    • +1
  • View LiteLLM details
    InferenceFREEMIUMOpen core

    LiteLLM

    BerriAI

    AI gateway: call 100+ LLMs in one OpenAI-format interface.

    Open-source Python SDK and proxy server (AI gateway) that exposes 100+ LLM providers through a single OpenAI-compatible API, with cost tracking, load balancing, fallbacks, caching, and guardrails. Self-host the proxy or use the managed cloud; a paid Enterprise tier adds SSO, audit logs, and support.

    Worth knowing

    Built by YC W23 startup BerriAI; used in production by Netflix, Adobe, and Stripe, with 45k+ GitHub stars.

    • gateway
    • proxy
    • routing
    • open-source
    • +1
  • View Morph details
    InferenceFREEMIUM

    Morph

    Morph

    Fast models that apply AI code edits to files in milliseconds.

    Infrastructure for coding agents centered on Fast Apply, a specialized model that merges AI-generated edits into files at ~10,500 tokens/sec instead of full-file rewrites or brittle search-and-replace. Also serves WarpGrep code search, context compaction, and a model router via an OpenAI-compatible API. Used in production by JetBrains, Vercel, and Webflow.

    Worth knowing

    A Y Combinator S23 company (legal entity AutoInfra), founded by Tejas Bhakta.

    • code-editing
    • fast-apply
    • coding-agents
    • api
  • View SambaNova Cloud details
    InferenceFREEMIUM

    SambaNova Cloud

    SambaNova Systems

    Fast inference for open models on custom RDU chips.

    Inference cloud running open-weight models — Llama, DeepSeek, Qwen, gpt-oss — on SambaNova's RDU hardware at hundreds of tokens per second, including full-precision Llama 405B. Provides an OpenAI-compatible API with a free tier and pay-per-token pricing.

    Worth knowing

    Founded by Stanford professors Kunle Olukotun and Chris Re; raised $676M at a $5.1B valuation led by SoftBank in 2021.

    • inference
    • fast-inference
    • open-models
    • rdu
  • View Baseten details
    InferenceFREEMIUM

    Baseten

    Baseten

    Inference cloud for serving any AI model in production.

    Production inference platform offering both pre-optimized Model APIs (Llama, DeepSeek, and more, billed per token) and dedicated GPU/CPU deployments for custom models, billed per minute with no charge for idle time. Custom models are packaged with its open-source Truss format and autoscale, including scale-to-zero. Aimed at low-latency, high-throughput serving.

    Worth knowing

    Raised a $300M Series E in Jan 2026 at a $5B valuation, with Nvidia investing $150M of it.

    • inference
    • model-serving
    • gpu
    • autoscaling
  • View Cerebras details
    InferenceFREEMIUM

    Cerebras

    Cerebras Systems

    Wafer-scale inference cloud for open models.

    Inference cloud that serves open-weight models such as Llama, Qwen, DeepSeek, and gpt-oss on Cerebras's wafer-scale CS-3 hardware, reaching token throughput far above GPU clouds. Exposes an OpenAI-compatible API with a free daily tier and pay-per-token pricing.

    Worth knowing

    Its 2024 IPO filing revealed ~87% of H1 revenue came from a single customer, UAE-based G42; it went public in 2026.

    • inference
    • fast-inference
    • wafer-scale
    • open-models
  • View Replicate details
    InferenceFREEMIUM

    Replicate

    Replicate

    Run, fine-tune, and deploy thousands of open models via one API.

    A platform to run open-source models with one API call — image, video, audio, and language — plus fine-tuning and custom deploys with pay-per-second billing. No infra to manage.

    Worth knowing

    Co-founded by Ben Firshman, who built the original Docker Compose; its Cog packaging format is essentially 'Docker for machine learning.'

    • model-hosting
    • fine-tuning
    • api
    • open-source
  • View Groq details
    InferenceFREEMIUM

    Groq

    Groq

    Ultra-fast inference on custom LPU chips. Open-weights at 500+ tokens/sec.

    GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.

    Worth knowing

    Nvidia agreed a ~$20B cash deal on Dec 24, 2025 to license Groq's LPU IP and acquihire its team — Nvidia's largest deal ever.

    • inference
    • low-latency
    • lpu
    • open-weights
  • View fal details
    InferenceFREEMIUM

    fal

    fal

    Serverless inference API for image, video, audio, and 3D models.

    A generative-media inference platform exposing FLUX, Kling, Veo, Wan, Stable Diffusion, and 600+ image/video/audio/3D models through one fast, serverless API — no GPUs to manage and near-zero cold starts. Pay per output or per GPU-second; free starter credits to test. Popular as the production backend for AI media features.

    Worth knowing

    Raised a $140M Series D led by Sequoia in December 2025 at a $4.5B valuation.

    • generative-media
    • image-gen
    • video-gen
    • serverless
  • View OpenRouter details
    InferenceFREEMIUM

    OpenRouter

    OpenRouter

    One OpenAI-compatible API in front of 300+ models from every provider.

    A unified gateway that routes a single endpoint and API key to models from Anthropic, OpenAI, Google, Meta, DeepSeek, xAI, and more — swap models by changing one parameter, with automatic fallbacks and one consolidated bill. Pass-through token pricing plus dozens of free models.

    Worth knowing

    Founded by OpenSea co-founder Alex Atallah; hit unicorn status in 2025 with a $113M Series B led by Alphabet's CapitalG.

    • gateway
    • routing
    • multi-model
    • fallbacks
  • View LM Studio details
    InferenceFREE

    LM Studio

    LM Studio

    Desktop app to discover, download, and run local LLMs privately.

    A GUI for running open-weight models on your own hardware — browse and download GGUF/MLX models, chat offline, and expose an OpenAI- and Anthropic-compatible local server for your apps. Includes RAG over local files, MCP tool-use support, and dual llama.cpp + Apple MLX runtimes. Free for personal and commercial use; the app itself is proprietary.

    Worth knowing

    Made by Brooklyn's Element Labs; founder Yagil Burowski is an ex-Apple engineer who shipped the first release in May 2023.

    • local
    • llm-runner
    • gui
    • privacy
  • View Ollama details
    InferenceFREEMIUMOpen core

    Ollama

    Ollama

    Run open-weight LLMs locally with one command. OpenAI-compatible API.

    The de-facto way to pull and run open-weight models (Llama, Qwen, Gemma, DeepSeek, gpt-oss) on your own machine — no API key, no data leaving the device. Ships native macOS/Windows/Linux apps, an OpenAI-compatible server, and official Python/JS libraries. MIT-licensed and free locally; an optional paid Ollama Cloud runs larger models.

    Worth knowing

    Built by Jeffrey Morgan and Michael Chiang, creators of Kitematic — the early UI Docker acquired and turned into Docker Desktop.

    • local
    • open-source
    • llm-runner
    • self-hosted
  • View Together AI details
    InferenceFREEMIUM

    Together AI

    Together

    Fine-tuning + inference for open-weights models. Broad coverage.

    Hosted inference and fine-tuning across hundreds of open-weights models (Llama, Mistral, DeepSeek, Qwen, etc.). Strong pricing for inference-at-scale; LoRA + full fine-tuning supported.

    Worth knowing

    Co-founded by Stanford's Percy Liang and FlashAttention author Tri Dao; raised $305M at a $3.3B valuation.

    • inference
    • fine-tuning
    • open-weights
    • lora
  • View Modal details
    InferenceFREEMIUM

    Modal

    Modal Labs

    Serverless GPUs. Run training, inference, batch jobs from Python.

    Define cloud workloads in Python, deploy with one command — GPU access on demand, fast cold starts, fair-share pricing. The default 'I need to fine-tune a model from a Jupyter cell' platform.

    Worth knowing

    Co-founded by Erik Bernhardsson, who built Spotify's recommender; raised a $355M Series C at a $4.65B valuation in 2026.

    • gpu
    • serverless
    • python
    • training
  • View Fireworks AI details
    InferenceFREEMIUM

    Fireworks AI

    Fireworks AI

    Fast inference + fine-tuning. Production deployments at scale.

    Optimized inference platform for open-weights models with strong latency numbers and serverless + dedicated deployment options. Fine-tuning supported; vision and audio models alongside text.

    Worth knowing

    Founded by the Meta team that built PyTorch; hit a $4B valuation in its Oct 2025 raise.

    • inference
    • fine-tuning
    • low-latency
    • production