EvalLMArena

LMArena

Crowdsourced LLM leaderboard where humans vote on anonymous model battles.

Category: Eval
Pricing: FREE
Source: Proprietary
Hosting: Cloud
Platforms: Web
Models: Multi-model
Verified: Jun 15, 2026

An open evaluation platform where a user enters a prompt, two anonymous models answer side by side, and the user picks the better response. Those millions of blind pairwise votes are aggregated into Elo-style rankings across text, vision, coding, and other arenas. It works directly with major AI labs and has become the most-cited public reference for comparing frontier model quality.

Capabilities 2

What it actually does — grouped by capability family.

Multi-model access (secondary capability)

LLM evaluation (primary capability)

Pros & cons

Real human preference votes
Covers all major frontier models
Free and open to use
Widely cited industry standard

No public API yet
Crowd votes can be noisy or gamed
A leaderboard, not a dev test harness

View Future AGI details
EvalFREEMIUMOpen core
Future AGI
Future AGI
Evaluation, observability, and optimization platform for AI agents and LLM apps.
Future AGI is an end-to-end platform for testing, evaluating, observing, and improving generative-AI applications. It spans simulations, evaluation suites, real-time tracing and dashboards, runtime guardrails, and a model gateway, with multimodal evaluation across text, image, and audio. The core stack is open-source under Apache 2.0 and can be self-hosted or used as a managed cloud.
Open-source, Apache-2.0 licensed
Newer, smaller community
- open-source
- evals
- observability
- guardrails
- +1
Open
View Judgment Labs details
EvalFREEMIUMOpen core
Judgment Labs
Judgment Labs
The continuous-improvement stack for AI agents.
An evaluation and monitoring platform for AI agents, built around the open-source judgeval framework. Judgment traces an agent's full trajectory — tool calls, memory, search queries, and long reasoning chains — then uses trajectory-level judges to surface failure modes, validate fixes before deploy, and catch behavioral regressions in production. The captured environment data and evals feed back into agent post-training (RL and SFT), not just pass/fail scoring. judgeval is Apache-2.0 and free; the hosted platform adds the dashboard, AutoRubrics, and enterprise features.
Open-source judgeval framework (Apache-2.0)
Hosted platform pricing not public
- agent-evals
- llm-as-judge
- tracing
- post-training
- +1
Open
View Iris details
EvalFREEMIUMOpen core
Iris
Iris
MCP-native eval and observability server for AI agents.
Iris scores AI agent output, catches safety failures, and enforces cost budgets — exposed as an MCP server rather than an SDK. Any MCP-compatible agent discovers its tools (trace logging, output evaluation, rule management, LLM-as-judge) and uses them automatically, with no code changes, so every output flowing through the protocol gets evaluated. It detects PII leaks, prompt injection, hallucinations, and budget anomalies. The core is MIT-licensed and free to self-host; a managed cloud adds dashboards and alerting.
No SDK or instrumentation to add
Newer, niche MCP-focused tool
- agent-eval
- observability
- mcp
- open-source
- +1
Open
View Okareo details
EvalFREEMIUM
Okareo
Okareo
Simulate real users to ship reliable voice and text agents.
Okareo is an evaluation and testing platform for AI agents. It drives an agent with synthetic users ("Drivers") that hold personality-rich, multi-turn conversations across voice, text, and headless channels, surfacing edge cases before release. Teams gate releases on conversation quality in CI/CD, turn production failures into automated test scenarios, and evaluate models, RAG pipelines, and agents from one workspace.
Personality-rich synthetic-user Drivers
Younger than general eval platforms
- agent-evaluation
- simulation
- synthetic-users
- voice-agents
- +1
Open
View Hamming details
EvalPAID
Hamming
Hamming
Automated testing and monitoring for voice and chat agents.
Hamming is an enterprise platform for testing and monitoring conversational AI agents. It auto-generates test scenarios from an agent's prompt, load-tests with tens of thousands of concurrent calls, replays production calls for regression testing, and scores 50+ audio-native metrics like latency, hallucinations, sentiment, and compliance. It integrates natively with Vapi, Retell, ElevenLabs, LiveKit, and Pipecat.
Audio-native scoring of voice agents
No public pricing or free tier
- voice-agents
- testing
- evaluation
- monitoring
- +1
Open
View Cekura details
EvalPAID
Cekura
Cekura
Test, monitor, and self-improve voice and chat AI agents.
Cekura is an automated QA and observability platform for conversational AI agents. Before launch it runs simulations across thousands of diverse personas and edge cases, with red-teaming for bias, toxicity, and jailbreaks; in production it monitors live conversations for instruction-following, hallucinations, and voice-specific quality regressions with real-time alerting. It targets regulated sectors like healthcare and finance where reliability and compliance matter.
Simulates thousands of persona conversations
No public pricing; demo/trial required
- voice-agents
- agent-testing
- qa
- red-teaming
- +1
Open
View Coval details
EvalPAID
Coval
Coval
Simulation and evaluation platform for voice and chat AI agents.
Coval is an evaluation and monitoring platform for conversational AI agents, applying the simulation-driven testing rigor developed in self-driving to voice and chat. From a handful of test cases it generates thousands of realistic scenarios, runs them against an agent over text or live phone calls, and scores the results on built-in or custom metrics. In production it monitors and scores real calls so teams can catch regressions across millions of conversations.
Generates realistic scenarios from few cases
No free tier — 7-day trial only
- agent-eval
- voice-agents
- simulation
- monitoring
Open
View Atla details
EvalFREEMIUM
Atla
Atla
Evaluation layer that finds and fixes AI agent failures.
Atla is an evaluation platform that automatically discovers, clusters, and ranks failures in AI agents, then suggests fixes. Rather than prompting a general model to grade outputs, it runs on Atla's own Selene LLM-judge models, purpose-trained to score and critique generative-AI responses. It offers Python and TypeScript SDKs and integrates with stacks like OpenAI and LangChain.
Auto-discovers and suggests fixes
Younger platform, small team
- evaluation
- llm-as-judge
- agents
- observability
Open
View Confident AI details
EvalFREEMIUM
Confident AI
Confident AI
The AI quality platform from the team behind DeepEval.
Confident AI is the hosted platform built on top of DeepEval, the open-source LLM evaluation framework. It adds dataset and test management, research-backed metrics, production tracing and monitoring, adversarial red teaming, and governance dashboards so teams can benchmark, observe, and safeguard LLM apps across the dev-to-prod loop. Python and TypeScript SDKs plug into CI and OpenTelemetry, with managed cloud and enterprise self-hosting.
Built on the DeepEval framework
Platform itself is proprietary
- eval
- observability
- red-teaming
- llm-as-judge
- +1
Open
View PromptLayer details
EvalFREEMIUM
PromptLayer
PromptLayer
Prompt CMS, evals, and observability for LLM teams.
PromptLayer is a prompt-engineering platform that treats prompts as a content-managed asset: version, edit, and deploy them without touching application code. It pairs that registry with an evaluation harness (datasets, scoring) and an observability stack that logs every request and tracks cost and latency. The collaborative model lets non-technical domain experts iterate on prompts alongside engineers.
Prompt CMS — edit/version without code
Cloud-hosted (no self-host on lower tiers)
- prompt-management
- evaluation
- observability
- llmops
Open

Open LMArena

LMArena

Capabilities 2

Pros & cons

Tags

Further reading

Future AGI

Judgment Labs

Iris

Okareo

Hamming

Cekura

Coval

Atla

Confident AI

PromptLayer