Loading…
Inference · Runpod
GPU cloud for AI — on-demand instances and serverless inference.
Runpod is an AI developer cloud for renting GPUs on demand or running auto-scaling serverless inference endpoints. Serverless workers bill by the millisecond, scale to zero when idle, and advertise sub-200ms cold starts; on-demand Pods and multi-node Clusters cover training and long-running jobs. A Community Cloud tier offers cheaper, peer-sourced GPUs alongside the vendor-operated Secure Cloud.
Model support
Raw GPU compute — deploy and serve any model or container you bring.
Where it runs
Tags
Related in Inference
Baseten
Inference cloud for serving any AI model in production.
Production inference platform offering both pre-optimized Model APIs (Llama, DeepSeek, and more, billed per token) and dedicated GPU/CPU deployments for custom models, billed per minute with no charge for idle time. Custom models are packaged with its open-source Truss format and autoscale, including scale-to-zero. Aimed at low-latency, high-throughput serving.
AI insight: Models use its open-source 'Truss' packaging and scale to zero, so you pay per minute of active compute, not for idle GPUs.
Cerebras Systems
Wafer-scale inference cloud for open models.
Inference cloud that serves open-weight models such as Llama, Qwen, DeepSeek, and gpt-oss on Cerebras's wafer-scale CS-3 hardware, reaching token throughput far above GPU clouds. Exposes an OpenAI-compatible API with a free daily tier and pay-per-token pricing.
AI insight: Runs models on a single dinner-plate-sized wafer instead of GPU clusters, hitting ~2,000 tokens/sec where GPU clouds plateau far lower.
BerriAI
AI gateway: call 100+ LLMs in one OpenAI-format interface.
Open-source Python SDK and proxy server (AI gateway) that exposes 100+ LLM providers through a single OpenAI-compatible API, with cost tracking, load balancing, fallbacks, caching, and guardrails. Self-host the proxy or use the managed cloud; a paid Enterprise tier adds SSO, audit logs, and support.
AI insight: Translates 100+ providers into one OpenAI-format call, so many other AI tools quietly embed it as their model-routing layer.
Morph
Fast models that apply AI code edits to files in milliseconds.
Infrastructure for coding agents centered on Fast Apply, a specialized model that merges AI-generated edits into files at ~10,500 tokens/sec instead of full-file rewrites or brittle search-and-replace. Also serves WarpGrep code search, context compaction, and a model router via an OpenAI-compatible API. Used in production by JetBrains, Vercel, and Webflow.
AI insight: Its Fast Apply model merges LLM code edits at ~10,500 tok/s — the dedicated write layer agents use instead of slow full-file rewrites.
SambaNova Systems
Fast inference for open models on custom RDU chips.
Inference cloud running open-weight models — Llama, DeepSeek, Qwen, gpt-oss — on SambaNova's RDU hardware at hundreds of tokens per second, including full-precision Llama 405B. Provides an OpenAI-compatible API with a free tier and pay-per-token pricing.
AI insight: One of the few clouds serving Llama 405B in native 16-bit precision at 100+ tokens/sec, not a quantized copy.
vLLM Project
High-throughput, memory-efficient inference engine for LLMs.
Open-source (Apache-2.0) serving engine for large language and vision-language models, originally from UC Berkeley's Sky Computing Lab. Its PagedAttention KV-cache management and continuous batching deliver high throughput on commodity GPUs. Now a community project with 1000s of contributors and an OpenAI-compatible server.
AI insight: PagedAttention pages the KV cache like OS virtual memory, slashing waste — the trick that made it the default open-source serving engine.
fal
Serverless inference API for image, video, audio, and 3D models.
A generative-media inference platform exposing FLUX, Kling, Veo, Wan, Stable Diffusion, and 600+ image/video/audio/3D models through one fast, serverless API — no GPUs to manage and near-zero cold starts. Pay per output or per GPU-second; free starter credits to test. Popular as the production backend for AI media features.
AI insight: Specializes in generative-media latency — FLUX, Kling, Veo and 600+ media models — where general inference hosts focus on text.
Groq
Ultra-fast inference on custom LPU chips. Open-weights at 500+ tokens/sec.
GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.
AI insight: Speed comes from custom LPU silicon, not GPUs — which is why it serves open models at hundreds of tokens/sec on an OpenAI-compatible API.
LM Studio
Desktop app to discover, download, and run local LLMs privately.
A GUI for running open-weight models on your own hardware — browse and download GGUF/MLX models, chat offline, and expose an OpenAI- and Anthropic-compatible local server for your apps. Includes RAG over local files, MCP tool-use support, and dual llama.cpp + Apple MLX runtimes. Free for personal and commercial use; the app itself is proprietary.
AI insight: Free even for commercial use, though the app itself is closed-source — and it serves both OpenAI- and Anthropic-compatible local APIs.
Ollama
Run open-weight LLMs locally with one command. OpenAI-compatible API.
The de-facto way to pull and run open-weight models (Llama, Qwen, Gemma, DeepSeek, gpt-oss) on your own machine — no API key, no data leaving the device. Ships native macOS/Windows/Linux apps, an OpenAI-compatible server, and official Python/JS libraries. MIT-licensed and free locally; an optional paid Ollama Cloud runs larger models.
AI insight: Its OpenAI-compatible local server makes it a drop-in backend — point any app at localhost and swap the cloud for your own GPU.