Inference AI apps

Model serving, inference engines, and LLM gateways — run, route, and scale models in production.

42 apps · researched & kept current by Claude Code

Filter & search these 42 apps

View Relace details
InferenceFREEMIUM
Relace
Relace
Purpose-built AI models and infrastructure for coding agents.
Relace builds specialized models and infrastructure that slot into AI code-generation products. Its Instant Apply model merges partial diffs from frontier models into full files at thousands of tokens per second, and its two-stage code retrieval (embedding search plus a code reranker) finds the right context fast. Relace also offers managed repository hosting with automatic per-commit indexing, so coding agents get cheaper, faster, more reliable edits and search.
Two-stage code retrieval + reranker
No public pricing detail
- coding-agents
- instant-apply
- code-retrieval
- codegen
Open
View Liquid AI details
InferenceFREEMIUM
Liquid AI
Liquid AI
On-device foundation models (LFMs) plus LEAP, an edge platform to ship them to any device.
Liquid AI builds Liquid Foundation Models (LFMs) — compact, fast models designed to run on phones, laptops, and edge hardware rather than the cloud. Its LEAP platform lets developers discover, fine-tune, bundle, and deploy these models on-device through an Edge SDK, taking a model 'from concept to on-device in minutes.' The LFM2/LFM2.5 family spans 350M–8.3B parameters with a hybrid architecture tuned for low-latency local inference.
Runs on phones, laptops, edge hardware
Small models trail frontier cloud LLMs
- on-device
- edge-ai
- small-models
- open-weight
- +1
Open
View Tinfoil details
InferenceFREEMIUMOpen core
Tinfoil
Tinfoil
Verifiably private AI inference inside secure hardware enclaves.
Tinfoil runs open-weight LLMs inside confidential-computing GPU enclaves so that neither Tinfoil nor the cloud provider can see your prompts or data — and the setup is remotely attestable rather than a policy promise. It offers a private chat, an OpenAI-compatible inference API, and Tinfoil Containers for arbitrary Docker workloads, serving models like GPT-OSS, Llama 3.3, and Kimi K2. The software stack is open source and verifiable.
Hardware-enforced, verifiable privacy
Open-weight models only, no GPT-4/Claude
- confidential-computing
- privacy
- inference
- secure-enclave
- +1
Open
View Inception Labs details
InferencePAID
Inception Labs
Inception Labs
Diffusion LLMs for ultra-fast text and code.
Inception Labs builds diffusion-based large language models (dLLMs) that generate tokens in parallel rather than sequentially, claiming several times the speed and under half the cost of conventional autoregressive LLMs at comparable quality. Its Mercury family — including the Mercury 2 reasoning model and Mercury Edit for code — is served through an OpenAI-compatible API and also via AWS Bedrock and Azure. The Stanford spinout, led by Stefano Ermon, raised $50M from Menlo Ventures with angels including Andrew Ng and Andrej Karpathy.
1,000+ tokens/sec throughput
Own model family only (Mercury)
- inference
- diffusion-llm
- low-latency
- code
Open
View Novita AI details
InferencePAID
Novita AI
Novita AI
One API for many AI models, plus agent sandboxes and GPU cloud.
Novita AI is an AI and agent cloud for developers that combines serverless model APIs with on-demand compute. A single API serves 120+ text, image, audio, video, and vision models, while Agent Sandbox provides isolated runtimes for tool-using agents and the GPU cloud offers dedicated instances, serverless GPUs, and bare-metal clusters. It advertises sub-50ms time-to-first-token and startup-friendly, usage-based pricing.
120+ models behind one API
Usage-based, no standing free tier
- inference
- agent-sandbox
- gpu-cloud
- llm-api
Open
View DeepInfra details
InferencePAID
DeepInfra
DeepInfra
Pay-as-you-go API access to open and proprietary AI models.
DeepInfra is a cloud inference platform that lets developers run open and proprietary models through a simple, OpenAI-compatible API without managing hardware. It serves text generation, embeddings, image/audio/video, and speech models with token-based, pay-as-you-go pricing, and offers DeepCluster dedicated NVIDIA GPU capacity for heavier workloads. It is SOC 2 and ISO 27001 certified with a zero data-retention policy.
100+ models behind one OpenAI-compatible API
Pay-as-you-go only, no free tier
- inference
- open-models
- gpu-cloud
- llm-api
Open
View Lamini details
Fine-tuningPAID
Lamini
Lamini
Enterprise platform to tune and run open LLMs in your own environment.
Lamini is an enterprise LLM platform for fine-tuning open models and serving them, designed to run on-prem, in a VPC, or on Lamini's cloud — including on AMD GPUs. It pairs tuning (LoRA/PEFT and memory tuning to reduce hallucinations) with an inference stack and agentic pipelines, accessed via a Python client, REST API, or web UI. Built for teams that need to keep models and data in-house.
Keeps models and data fully in-house
Enterprise-focused pricing
- fine-tuning
- llm
- enterprise
- on-prem
Open
View TrueFoundry details
InfraPAID
TrueFoundry
TrueFoundry
Enterprise AI gateway and deployment platform that runs in your own cloud.
A unified platform for deploying, scaling, and governing LLM and agentic AI systems. It pairs an AI gateway that routes and orchestrates calls across providers with infrastructure for hosting models (vLLM, TGI, Triton), fine-tuning, and full-stack observability — deployed inside your own VPC, on-prem, or air-gapped environment with enterprise RBAC and audit logging.
Runs in your own cloud, on-prem, or air-gapped
Enterprise-oriented; no public free tier
- ai-gateway
- model-deployment
- mlops
- enterprise
- +1
Open
View MiniMax details
InferenceFREEMIUM
MiniMax
MiniMax
Multimodal foundation models and developer API for text, code, video, speech, and music.
MiniMax is a Shanghai foundation-model lab whose platform serves its own model family through a developer API and agent app: the M-series LLMs (M2/M3) built for coding and agentic workflows with up to a 1M-token context, the Hailuo video models, and MiniMax Speech and Music. Developers get chat completions, text-to-speech, and text-to-video on token-based pricing, with a free agent tier for getting started.
Coding- and agent-tuned M-series models
China-based; data-residency considerations
- foundation-models
- llm
- api
- video
- +1
Open
View Runware details
InferenceFREEMIUM
Runware
Runware
One pay-as-you-go API for multi-modal AI inference.
Runware is a unified AI inference platform that exposes 400K+ models — image, video, audio, text, and 3D — behind a single pay-as-you-go API. It runs a proprietary Sonic Inference Engine on custom GPU hardware, claiming sub-second cold starts and up to 10x lower cost per generation than typical hosted inference. A REST and WebSocket API plus a web playground let developers swap models without per-provider integrations.
400K+ models via one API
Proprietary, cloud-only
- inference
- api
- image-generation
- video-generation
- +1
Open
View Respan details
ObservabilityFREEMIUM
Respan
Keywords AI, Inc.
Self-driving observability, evals, and an AI gateway for LLM agents.
Respan is a unified LLM-engineering platform that routes, observes, and evaluates every model call from one control plane. Its AI gateway fronts 500+ models with fallbacks, load balancing, and spend caps; trace trees, dashboards, and alerts cover observability; and rule checks, AI judges, and human review run as one evaluation workflow. An automated eval agent surfaces issues from production traffic so teams can fix what breaks without stitching tools together.
Trace trees, dashboards, and alerts
Proprietary, no self-host option
- observability
- ai-gateway
- evals
- tracing
- +1
Open
View Vast.ai details
InferencePAID
Vast.ai
Vast.ai
GPU cloud marketplace for renting AI compute.
Vast.ai is a marketplace for renting GPUs, connecting people who need AI compute with hosts who list spare hardware — from solo owners to Tier-4 data centers. Prices are set by supply and demand, billed per second, and queryable through code. It supports on-demand, interruptible (spot), and reserved instances across tens of thousands of GPUs and dozens of GPU types.
Often the cheapest GPUs via marketplace
Host quality and reliability vary
- gpu
- gpu-marketplace
- cloud-compute
- spot-instances
Open
View Cohere details
InferenceFREEMIUM
Cohere
Cohere Inc.
Enterprise-grade LLMs, embeddings, and retrieval built for private deployment.
Cohere builds large language models for the enterprise rather than the consumer. Its Command models cover agentic, multimodal, and multilingual generation; Embed and Rerank power high-quality search and retrieval; Aya is a multilingual research family spanning 70+ languages; and North is a workplace AI platform built on top. Cohere's emphasis is data control — models can run in a private VPC, on-premises, or via a managed Model Vault.
Strong Rerank/Embed retrieval models
No consumer chat product to speak of
- llm
- embeddings
- rerank
- enterprise
- +2
Open
View Clarifai details
VisionFREEMIUM
Clarifai
Clarifai
Full-stack AI platform for computer vision and LLMs.
Clarifai is a full-stack AI platform for building with unstructured image, video, text, and audio data. It pairs production computer-vision models — classification, detection, visual search — with a model hub for LLMs, plus data labeling, training of custom models, and inference, all behind one API and console. A free Community tier lets you discover and run models before moving to paid usage plans.
Mature, end-to-end vision stack
Broad platform has a learning curve
- computer-vision
- model-hub
- data-labeling
- inference
- +1
Open
View Crusoe details
InferencePAID
Crusoe
Crusoe
Energy-first AI cloud and gigawatt-scale AI data centers.
Vertically integrated AI infrastructure company: Crusoe Cloud offers NVIDIA and AMD GPU clusters with managed Kubernetes and managed inference, while its data-center arm develops and powers gigawatt-scale 'AI factories' — including OpenAI's 1.2 GW Stargate campus in Abilene, Texas, which Crusoe built and co-owns.
Owns power generation and data centers end to end
Pricing is sales-led rather than self-serve
- gpu-cloud
- data-centers
- energy
- training
- +1
Open
View Nebius details
InferencePAID
Nebius
Nebius Group
Full-stack AI cloud for training and inference at scale.
AI-native cloud built around large NVIDIA GPU clusters — bare-metal and virtualized H100 through Blackwell hardware with InfiniBand networking, managed Slurm and Kubernetes, high-speed storage, and MLOps tooling. Its Token Factory layer adds managed per-token inference for open-weight models. Microsoft signed a five-year capacity deal with Nebius worth up to $19.4B.
Latest NVIDIA silicon, H100 through Blackwell
AI-only cloud — few general-purpose services
- gpu-cloud
- training
- inference
- ai-hyperscaler
Open
View Lambda details
InferencePAID
Lambda
Lambda
GPU cloud for AI training — on-demand GPUs, 1-Click Clusters, and superclusters.
Lambda is a GPU cloud for AI training and inference, spanning on-demand HGX B200 and H100 instances, self-serve 1-Click Clusters, and single-tenant superclusters built on NVIDIA's latest generations. A GPU specialist since 2012, it sells compute by the hour without long-term hyperscaler contracts and co-engineers large deployments with NVIDIA.
Single GPUs up to superclusters
No free tier
- gpu-cloud
- training
- clusters
- nvidia
- +1
Open
View CoreWeave details
InferencePAID
CoreWeave
CoreWeave
The AI hyperscaler — GPU cloud built for large-scale training and inference.
CoreWeave is a purpose-built AI cloud renting large-scale NVIDIA GPU capacity for training and inference, layered with managed Kubernetes, AI object storage, and Mission Control observability. Public on Nasdaq since March 2025, it counts most leading AI labs — including OpenAI, Meta, and Anthropic — among its customers, with a contracted revenue backlog reported near $100B in 2026.
Frontier-scale GPU capacity
Enterprise-oriented; no free tier
- gpu-cloud
- ai-hyperscaler
- training
- inference
- +1
Open
View Z.ai details
AssistantFREEMIUM
Z.ai
Z.ai (Zhipu AI)
Zhipu's GLM assistant — frontier open-weight chat, free to use.
Z.ai is the international assistant of Zhipu AI, built on the GLM model family. The free chat runs the flagship GLM-5 and GLM-5.1 models with reasoning and agentic modes, and the same models are served to developers over a low-cost, OpenAI-compatible API. GLM-5's open weights ship under the MIT license.
Free GLM-5 chat with reasoning modes
China-hosted data concerns
- chat
- assistant
- open-weights
- reasoning
- +1
Open
View Predibase details
Fine-tuningPAID
Predibase
Predibase (Rubrik)
Fine-tune open-source LLMs and serve them in production.
Predibase is an enterprise platform for fine-tuning open-source models and serving them in production. It pairs a post-training stack — supervised fine-tuning plus an end-to-end reinforcement fine-tuning (RFT) flow — with an optimized inference engine, and its open-source LoRAX framework serves many fine-tuned LoRA adapters from a single GPU. Runs as managed SaaS or inside your own VPC.
Fine-tune + serve in one place
Enterprise-priced
- fine-tuning
- lora
- rft
- inference
- +1
Open
View Hyperbolic details
InferenceFREEMIUM
Hyperbolic
Hyperbolic
Open-access AI cloud: serverless inference + a GPU marketplace.
An AI cloud offering serverless inference for open models (Llama, Qwen, DeepSeek, SDXL, Flux) behind an OpenAI-compatible API, alongside an on-demand GPU marketplace for H100/H200 rentals. It aggregates idle and reserved compute to price inference and GPU hours below centralized clouds. Aimed at developers training, fine-tuning, and serving open-weights models.
Serverless inference + GPU marketplace
Marketplace supply reliability varies
- inference
- gpu-cloud
- open-weights
- marketplace
Open
View Runpod details
InferencePAID
Runpod
Runpod
GPU cloud for AI — on-demand instances and serverless inference.
Runpod is an AI developer cloud for renting GPUs on demand or running auto-scaling serverless inference endpoints. Serverless workers bill by the millisecond, scale to zero when idle, and advertise sub-200ms cold starts; on-demand Pods and multi-node Clusters cover training and long-running jobs. A Community Cloud tier offers cheaper, peer-sourced GPUs alongside the vendor-operated Secure Cloud.
Serverless auto-scaling inference
Community Cloud less reliable/secure
- gpu-cloud
- serverless
- inference
- deployment
- +1
Open
View Glama details
MCPFREEMIUM
Glama
Glama
MCP server registry, inspector, and gateway.
A discovery and hosting hub for the Model Context Protocol ecosystem: browse a large indexed catalog of MCP servers, test them in an in-browser Inspector, and route them through a managed Gateway that handles credentials, logging, and analytics. Browsing and installing open-source servers locally is free; hosting servers on Glama's infrastructure and the Gateway's managed features are paid. Also ships an AI playground chat client over the connected tools.
Indexes 10000+ MCP servers, security-scored
Hosting and Gateway features are paid
- mcp
- registry
- gateway
- tool-use
- +1
Open
View Portkey details
InferenceFREEMIUMOpen core
Portkey
Portkey
AI gateway with observability, guardrails, and governance.
A production AI gateway that gives apps and agents unified access to 1,600+ LLMs across providers behind a single API, with built-in observability, prompt management, guardrails, and governance. Portkey adds routing, caching, fallbacks, cost limits, PII redaction, RBAC, and an MCP gateway. Its core gateway is open-source; run it self-hosted/hybrid or use the managed cloud, which offers a free tier.
One API across many providers
Acquired by Palo Alto Networks (closed 2025)
- ai-gateway
- llm-routing
- observability
- guardrails
Open
View Beam details
InfraFREEMIUM
Beam
Beam
On-demand serverless GPU compute for AI, from Python.
A serverless cloud for deploying AI inference endpoints, agent sandboxes, task queues, and containerized GPU workloads with a few lines of Python. It handles fast cold starts, autoscaling, and Docker-in-Docker execution across multiple cloud backends, and supports bring-your-own-compute. The Developer tier is free with recurring monthly credit; paid tiers add team features and scale, billed pay-as-you-go by GPU usage.
Define GPU workloads in pure Python
Smaller ecosystem than hyperscalers
- gpu
- serverless
- python
- inference
- +1
Open
View vLLM details
InferenceFREEOSS
vLLM
vLLM Project
High-throughput, memory-efficient inference engine for LLMs.
A serving engine for large language and vision-language models, originally from UC Berkeley's Sky Computing Lab. Its PagedAttention KV-cache management and continuous batching deliver high throughput on commodity GPUs. Now a community project with 1000s of contributors and an OpenAI-compatible server.
Serves most Hugging Face transformer models
You manage the GPU infrastructure
- inference
- model-serving
- gpu
- open-source
- +1
Open
View LiteLLM details
InferenceFREEMIUMOpen core
LiteLLM
BerriAI
AI gateway: call many LLMs through one OpenAI-format interface.
Open-source Python SDK and proxy server (AI gateway) that exposes 100+ LLM providers through a single OpenAI-compatible API, with cost tracking, load balancing, fallbacks, caching, and guardrails. Self-host the proxy or use the managed cloud; a paid Enterprise tier adds SSO, audit logs, and support.
Load balancing and guardrails built in
Proxy adds an extra hop
- gateway
- proxy
- routing
- open-source
- +1
Open
View Morph details
InferenceFREEMIUM
Morph
Morph
Fast models that apply AI code edits to files in milliseconds.
Infrastructure for coding agents centered on Fast Apply, a specialized model that merges AI-generated edits into files at ~10,500 tokens/sec instead of full-file rewrites or brittle search-and-replace. Also serves WarpGrep code search, context compaction, and a model router via an OpenAI-compatible API. Used in production by JetBrains, Vercel, and Webflow.
Merges edits without full-file rewrites
Narrow, infra-layer use case
- code-editing
- fast-apply
- coding-agents
- api
Open
View SambaNova Cloud details
InferenceFREEMIUM
SambaNova Cloud
SambaNova Systems
Fast inference for open models on custom RDU chips.
Inference cloud running open-weight models — Llama, DeepSeek, Qwen, gpt-oss — on SambaNova's RDU hardware at hundreds of tokens per second, including full-precision Llama 405B. Provides an OpenAI-compatible API with a free tier and pay-per-token pricing.
Serves Llama, DeepSeek, Qwen, gpt-oss
Open-weight catalog only
- inference
- fast-inference
- open-models
- rdu
Open
View Baseten details
InferenceFREEMIUM
Baseten
Baseten
Inference cloud for serving any AI model in production.
Production inference platform offering both pre-optimized Model APIs (Llama, DeepSeek, and more, billed per token) and dedicated GPU/CPU deployments for custom models, billed per minute with no charge for idle time. Custom models are packaged with its open-source Truss format and autoscale, including scale-to-zero. Aimed at low-latency, high-throughput serving.
Prebuilt Model APIs for Llama, DeepSeek
Dedicated GPU rates run pricier than Modal
- inference
- model-serving
- gpu
- autoscaling
Open
View Cerebras details
InferenceFREEMIUM
Cerebras
Cerebras Systems
Wafer-scale inference cloud for open models.
Inference cloud that serves open-weight models such as Llama, Qwen, DeepSeek, and gpt-oss on Cerebras's wafer-scale CS-3 hardware, reaching token throughput far above GPU clouds. Exposes an OpenAI-compatible API with a free daily tier and pay-per-token pricing.
Highest tokens/sec in the market
Smaller model catalog than Groq/Together
- inference
- fast-inference
- wafer-scale
- open-models
Open
View Replicate details
InferenceFREEMIUM
Replicate
Replicate
Run, fine-tune, and deploy thousands of open models via one API.
A platform to run open-source models with one API call — image, video, audio, and language — plus fine-tuning and custom deploys with pay-per-second billing. No infra to manage.
Image, video, audio, and language models
Cold starts on less-popular models
- model-hosting
- fine-tuning
- api
- open-source
Open
View Groq details
InferenceFREEMIUM
Groq
Groq
Low-latency inference for open-weights models on custom LPU chips.
GroqCloud serves open-weights models (Llama, DeepSeek, Qwen, Kimi) on Groq's purpose-built LPU hardware, hitting hundreds of tokens per second where GPUs manage tens. OpenAI-compatible API with a free tier; the default when token latency is the product.
Hundreds of tokens/sec on open models
Curated open-weight models only
- inference
- low-latency
- lpu
- open-weights
Open
View fal details
InferenceFREEMIUM
fal
fal
Serverless inference API for image, video, audio, and 3D models.
A generative-media inference platform exposing FLUX, Kling, Veo, Wan, Stable Diffusion, and 600+ image/video/audio/3D models through one fast, serverless API — no GPUs to manage and near-zero cold starts. Pay per output or per GPU-second; free starter credits to test. Popular as the production backend for AI media features.
600+ generative-media models
Media-focused, not a general LLM host
- generative-media
- image-gen
- video-gen
- serverless
Open
View OpenRouter details
InferenceFREEMIUM
OpenRouter
OpenRouter
One OpenAI-compatible API in front of models from every provider.
A unified gateway that routes a single endpoint and API key to models from Anthropic, OpenAI, Google, Meta, DeepSeek, xAI, and more — swap models by changing one parameter, with automatic fallbacks and one consolidated bill. Pass-through token pricing plus dozens of free models.
Swap models by changing one parameter
Adds a routing hop vs direct provider
- gateway
- routing
- multi-model
- fallbacks
Open
View LM Studio details
InferenceFREE
LM Studio
LM Studio
Desktop app to discover, download, and run local LLMs privately.
A GUI for running open-weight models on your own hardware — browse and download GGUF/MLX models, chat offline, and expose an OpenAI- and Anthropic-compatible local server for your apps. Includes RAG over local files, MCP tool-use support, and dual llama.cpp + Apple MLX runtimes. Free for personal and commercial use; the app itself is proprietary.
Polished desktop GUI
App itself is closed source
- local
- llm-runner
- gui
- privacy
Open
View Ollama details
InferenceFREEMIUMOpen core
Ollama
Ollama
Run open-weight LLMs locally with one command. OpenAI-compatible API.
The de-facto way to pull and run open-weight models (Llama, Qwen, Gemma, DeepSeek, gpt-oss) on your own machine — no API key, no data leaving the device. Ships native macOS/Windows/Linux apps, an OpenAI-compatible server, and official Python/JS libraries. MIT-licensed and free locally; an optional paid Ollama Cloud runs larger models.
One-command pull-and-run
Local performance bound by your hardware
- local
- open-source
- llm-runner
- self-hosted
Open
View OpenPipe details
Fine-tuningPAID
OpenPipe
OpenPipe
Replace frontier-model spend with a fine-tuned small model.
Captures your production OpenAI / Anthropic calls, builds a dataset, fine-tunes a small open-weights model on your traffic, then serves the swap behind your existing SDK. The pitch: 10x cost reduction at parity.
Uses your production logs as training data
Needs enough quality traffic to distill
- fine-tuning
- cost-reduction
- drop-in
- open-weights
Open
View Together AI details
InferenceFREEMIUM
Together AI
Together
Hosted inference and fine-tuning for open-weights models.
Hosted inference and fine-tuning across hundreds of open-weights models (Llama, Mistral, DeepSeek, Qwen, etc.). Strong pricing for inference-at-scale; LoRA + full fine-tuning supported.
LoRA and full fine-tuning
Open models only, no frontier closed models
- inference
- fine-tuning
- open-weights
- lora
Open
View Modal details
InferenceFREEMIUM
Modal
Modal Labs
Serverless GPUs. Run training, inference, batch jobs from Python.
Define cloud workloads in Python, deploy with one command — GPU access on demand, fast cold starts, fair-share pricing. The default 'I need to fine-tune a model from a Jupyter cell' platform.
Python-decorator infra, no YAML/Dockerfiles
SDK lock-in; migrating means rewriting
- gpu
- serverless
- python
- training
Open
View Fireworks AI details
InferenceFREEMIUM
Fireworks AI
Fireworks AI
Fast inference + fine-tuning. Production deployments at scale.
Optimized inference platform for open-weights models with strong latency numbers and serverless + dedicated deployment options. Fine-tuning supported; vision and audio models alongside text.
Custom FireAttention inference stack
Usage pricing scales with traffic
- inference
- fine-tuning
- low-latency
- production
Open
View Helicone details
ObservabilityFREEMIUMOpen core
Helicone
Helicone
Drop-in LLM proxy with logging, caching, and cost tracking.
One-line integration — change your OpenAI/Anthropic base URL and get a dashboard with every prompt, response, latency, and dollar tracked. Adds caching and rate-limit handling without code changes.
No SDK or code changes to integrate
Request/response focused, not span-based
- proxy
- logging
- caching
- cost-tracking
Open

Inference AI apps

Relace

Liquid AI

Tinfoil

Inception Labs

Novita AI

DeepInfra

Lamini

TrueFoundry

MiniMax

Runware

Respan

Vast.ai

Cohere

Clarifai

Crusoe

Nebius

Lambda

CoreWeave

Z.ai

Predibase

Hyperbolic

Runpod

Glama

Portkey

Beam

vLLM

LiteLLM

Morph

SambaNova Cloud

Baseten

Cerebras

Replicate

Groq

fal

OpenRouter

LM Studio

Ollama

OpenPipe

Together AI

Modal

Fireworks AI

Helicone