Eval AI apps

Evaluation suites and testing harnesses for measuring LLM and agent quality before and after shipping.

40 apps · researched & kept current by Claude Code

Filter & search these 40 apps

View Kili Technology details
Data OpsFREEMIUM
Kili Technology
Kili Technology
Data labeling and quality platform for training and evaluating AI models.
Kili Technology is a data-centric platform for turning raw data into high-quality training and evaluation datasets. It supports annotation across image, video, text, OCR, and geospatial data, with review and quality workflows, plus LLM evaluation and RLHF using human-in-the-loop and LLM-as-a-judge. It is used by enterprises including Airbus and SAP, and offers cloud, private-cloud, and on-premise deployment.
Multi-modal annotation
Enterprise-oriented complexity
- data-labeling
- annotation
- rlhf
- evals
- +1
Open
View Future AGI details
EvalFREEMIUMOpen core
Future AGI
Future AGI
Evaluation, observability, and optimization platform for AI agents and LLM apps.
Future AGI is an end-to-end platform for testing, evaluating, observing, and improving generative-AI applications. It spans simulations, evaluation suites, real-time tracing and dashboards, runtime guardrails, and a model gateway, with multimodal evaluation across text, image, and audio. The core stack is open-source under Apache 2.0 and can be self-hosted or used as a managed cloud.
Open-source, Apache-2.0 licensed
Newer, smaller community
- open-source
- evals
- observability
- guardrails
- +1
Open
View Judgment Labs details
EvalFREEMIUMOpen core
Judgment Labs
Judgment Labs
The continuous-improvement stack for AI agents.
An evaluation and monitoring platform for AI agents, built around the open-source judgeval framework. Judgment traces an agent's full trajectory — tool calls, memory, search queries, and long reasoning chains — then uses trajectory-level judges to surface failure modes, validate fixes before deploy, and catch behavioral regressions in production. The captured environment data and evals feed back into agent post-training (RL and SFT), not just pass/fail scoring. judgeval is Apache-2.0 and free; the hosted platform adds the dashboard, AutoRubrics, and enterprise features.
Open-source judgeval framework (Apache-2.0)
Hosted platform pricing not public
- agent-evals
- llm-as-judge
- tracing
- post-training
- +1
Open
View Iris details
EvalFREEMIUMOpen core
Iris
Iris
MCP-native eval and observability server for AI agents.
Iris scores AI agent output, catches safety failures, and enforces cost budgets — exposed as an MCP server rather than an SDK. Any MCP-compatible agent discovers its tools (trace logging, output evaluation, rule management, LLM-as-judge) and uses them automatically, with no code changes, so every output flowing through the protocol gets evaluated. It detects PII leaks, prompt injection, hallucinations, and budget anomalies. The core is MIT-licensed and free to self-host; a managed cloud adds dashboards and alerting.
No SDK or instrumentation to add
Newer, niche MCP-focused tool
- agent-eval
- observability
- mcp
- open-source
- +1
Open
View Okareo details
EvalFREEMIUM
Okareo
Okareo
Simulate real users to ship reliable voice and text agents.
Okareo is an evaluation and testing platform for AI agents. It drives an agent with synthetic users ("Drivers") that hold personality-rich, multi-turn conversations across voice, text, and headless channels, surfacing edge cases before release. Teams gate releases on conversation quality in CI/CD, turn production failures into automated test scenarios, and evaluate models, RAG pipelines, and agents from one workspace.
Personality-rich synthetic-user Drivers
Younger than general eval platforms
- agent-evaluation
- simulation
- synthetic-users
- voice-agents
- +1
Open
View Hamming details
EvalPAID
Hamming
Hamming
Automated testing and monitoring for voice and chat agents.
Hamming is an enterprise platform for testing and monitoring conversational AI agents. It auto-generates test scenarios from an agent's prompt, load-tests with tens of thousands of concurrent calls, replays production calls for regression testing, and scores 50+ audio-native metrics like latency, hallucinations, sentiment, and compliance. It integrates natively with Vapi, Retell, ElevenLabs, LiveKit, and Pipecat.
Audio-native scoring of voice agents
No public pricing or free tier
- voice-agents
- testing
- evaluation
- monitoring
- +1
Open
View Cekura details
EvalPAID
Cekura
Cekura
Test, monitor, and self-improve voice and chat AI agents.
Cekura is an automated QA and observability platform for conversational AI agents. Before launch it runs simulations across thousands of diverse personas and edge cases, with red-teaming for bias, toxicity, and jailbreaks; in production it monitors live conversations for instruction-following, hallucinations, and voice-specific quality regressions with real-time alerting. It targets regulated sectors like healthcare and finance where reliability and compliance matter.
Simulates thousands of persona conversations
No public pricing; demo/trial required
- voice-agents
- agent-testing
- qa
- red-teaming
- +1
Open
View Coval details
EvalPAID
Coval
Coval
Simulation and evaluation platform for voice and chat AI agents.
Coval is an evaluation and monitoring platform for conversational AI agents, applying the simulation-driven testing rigor developed in self-driving to voice and chat. From a handful of test cases it generates thousands of realistic scenarios, runs them against an agent over text or live phone calls, and scores the results on built-in or custom metrics. In production it monitors and scores real calls so teams can catch regressions across millions of conversations.
Generates realistic scenarios from few cases
No free tier — 7-day trial only
- agent-eval
- voice-agents
- simulation
- monitoring
Open
View Atla details
EvalFREEMIUM
Atla
Atla
Evaluation layer that finds and fixes AI agent failures.
Atla is an evaluation platform that automatically discovers, clusters, and ranks failures in AI agents, then suggests fixes. Rather than prompting a general model to grade outputs, it runs on Atla's own Selene LLM-judge models, purpose-trained to score and critique generative-AI responses. It offers Python and TypeScript SDKs and integrates with stacks like OpenAI and LangChain.
Auto-discovers and suggests fixes
Younger platform, small team
- evaluation
- llm-as-judge
- agents
- observability
Open
View Confident AI details
EvalFREEMIUM
Confident AI
Confident AI
The AI quality platform from the team behind DeepEval.
Confident AI is the hosted platform built on top of DeepEval, the open-source LLM evaluation framework. It adds dataset and test management, research-backed metrics, production tracing and monitoring, adversarial red teaming, and governance dashboards so teams can benchmark, observe, and safeguard LLM apps across the dev-to-prod loop. Python and TypeScript SDKs plug into CI and OpenTelemetry, with managed cloud and enterprise self-hosting.
Built on the DeepEval framework
Platform itself is proprietary
- eval
- observability
- red-teaming
- llm-as-judge
- +1
Open
View LMArena details
EvalFREE
LMArena
LMArena
Crowdsourced LLM leaderboard where humans vote on anonymous model battles.
An open evaluation platform where a user enters a prompt, two anonymous models answer side by side, and the user picks the better response. Those millions of blind pairwise votes are aggregated into Elo-style rankings across text, vision, coding, and other arenas. It works directly with major AI labs and has become the most-cited public reference for comparing frontier model quality.
Real human preference votes
No public API yet
- leaderboard
- benchmark
- model-evaluation
- llm
Open
View Runloop details
InfraFREEMIUM
Runloop
Runloop AI, Inc.
Cloud Devboxes and benchmarks for running AI coding agents at scale.
Runloop provides isolated, cloud-based development environments — "Devboxes" — where AI coding agents safely execute code with full filesystem, build-tool, and compiler access. Teams spin up thousands of Devboxes on demand for large-scale agent tasks and tear them down when done. A benchmarking suite lets you evaluate and compare agents on standardized tests, and enterprise deployments add compliance and VPC options.
Isolated cloud sandboxes for agent code
Usage-based pricing can be hard to forecast
- devbox
- code-sandbox
- coding-agents
- agent-infra
- +1
Open
View Evidently AI details
ObservabilityFREEMIUMOpen core
Evidently AI
Evidently AI
Evaluation and observability for ML and LLM systems.
Evidently is an open-source Python framework for evaluating, testing, and monitoring AI systems — from tabular ML models to LLM apps, RAG pipelines, and agents. It ships 100+ built-in metrics covering data drift, quality, hallucinations, PII leaks, and jailbreaks, plus LLM-as-judge scoring and monitoring dashboards. Evidently Cloud adds a hosted service with no-code evals, alerting, and team management on top of the open library.
Open source (Apache-2.0), self-hostable
Python-library learning curve
- evaluation
- observability
- ml-monitoring
- llm-testing
- +1
Open
View Latitude details
ObservabilityFREEMIUMOpen core
Latitude
Latitude Data S.L.
Observability and evals for AI agents in production.
Latitude is an open-source monitoring platform for AI agents in production. It captures agent-native traces across multi-turn sessions, tool calls, and full execution paths, then clusters failures into tracked issues and alerts via Slack, email, or webhooks. It is OpenTelemetry-native, supports semantic search over sessions, and auto-builds evals from your team's judgments — run it as managed cloud or self-host the whole stack.
Fully self-hostable stack
Younger than LangSmith/Langfuse
- open-source
- observability
- evals
- agents
- +1
Open
View Respan details
ObservabilityFREEMIUM
Respan
Keywords AI, Inc.
Self-driving observability, evals, and an AI gateway for LLM agents.
Respan is a unified LLM-engineering platform that routes, observes, and evaluates every model call from one control plane. Its AI gateway fronts 500+ models with fallbacks, load balancing, and spend caps; trace trees, dashboards, and alerts cover observability; and rule checks, AI judges, and human review run as one evaluation workflow. An automated eval agent surfaces issues from production traffic so teams can fix what breaks without stitching tools together.
Trace trees, dashboards, and alerts
Proprietary, no self-host option
- observability
- ai-gateway
- evals
- tracing
- +1
Open
View PromptLayer details
EvalFREEMIUM
PromptLayer
PromptLayer
Prompt CMS, evals, and observability for LLM teams.
PromptLayer is a prompt-engineering platform that treats prompts as a content-managed asset: version, edit, and deploy them without touching application code. It pairs that registry with an evaluation harness (datasets, scoring) and an observability stack that logs every request and tracks cost and latency. The collaborative model lets non-technical domain experts iterate on prompts alongside engineers.
Prompt CMS — edit/version without code
Cloud-hosted (no self-host on lower tiers)
- prompt-management
- evaluation
- observability
- llmops
Open
View MLflow details
ObservabilityFREEOSS
MLflow
Linux Foundation
Open-source platform for the ML and GenAI lifecycle.
MLflow is an open-source platform for managing the full machine-learning and GenAI lifecycle — experiment tracking, model registry, deployment, and, more recently, LLM/agent observability. Its GenAI stack adds OpenTelemetry-based tracing, systematic evaluation with built-in metrics and LLM judges, and prompt versioning. Framework- and provider-agnostic, it runs on your own infrastructure with no vendor lock-in.
Fully open source, no lock-in
Self-hosting adds operational overhead
- llmops
- tracing
- evaluation
- mlops
- +1
Open
View SuperAnnotate details
Data OpsPAID
SuperAnnotate
SuperAnnotate AI
Platform for building multimodal AI datasets and evaluation pipelines.
SuperAnnotate is an enterprise data platform for creating, managing, and evaluating high-quality datasets for AI. It spans annotation across images, video, text, audio, and LiDAR, with AI-assisted labeling, customizable workflows, and an optional managed annotation workforce. Teams use it to build human-in-the-loop data and evaluation pipelines for agentic, multimodal, and frontier AI.
Multimodal: image, video, text, audio, LiDAR
No free tier; sales-led pricing
- data-labeling
- annotation
- multimodal
- rlhf
- +1
Open
View Freeplay details
EvalPAID
Freeplay
Freeplay
Eval and observability ops platform for AI product teams.
Freeplay is an LLM evaluation and observability platform that unifies prompt management, batch evals, experiments, and production monitoring in one workflow. Built for cross-functional teams, it lets engineers, PMs, and domain experts review the same traces, run model-graded and code-based evals, and align auto-evaluators with human labels before shipping.
Unifies prompt mgmt, evals, and monitoring
Paid plans start around $500/mo
- llm-eval
- observability
- prompt-management
- experiments
Open
View Labelbox details
Data OpsFREEMIUM
Labelbox
Labelbox
Data factory for AI teams — labeling, evals, and human data for training.
Labelbox is a platform for generating and managing training data for AI models, combining annotation tools (Annotate), data curation (Catalog), and model-assisted labeling and evaluation (Model Foundry). It now spans reinforcement-learning data, custom evals, robotics datasets, and an on-demand network of expert human labelers, metered by a usage-based Labelbox Unit (LBU).
Mature, full-featured labeling UI
Usage-based LBU pricing hard to forecast
- data-labeling
- training-data
- annotation
- evals
- +1
Open
View TruLens details
EvalFREEOSS
TruLens
Snowflake
Open-source evaluation and tracing for LLM and agent apps.
TruLens is an open-source Python library for evaluating and tracing LLM, RAG, and agent applications. You wrap your app with feedback functions that score outputs on metrics like groundedness, context relevance, and answer relevance, then trace runs and compare versions on a metrics leaderboard. It integrates OpenTelemetry tracing and runs locally with a built-in dashboard.
OpenTelemetry tracing, runs locally
Python library, no hosted SaaS
- eval
- tracing
- rag
- llm-as-judge
- +1
Open
View Mindgard details
SecurityPAID
Mindgard
Mindgard
Automated AI red teaming and security testing for models and agents.
Mindgard is an automated AI red-teaming and security-testing platform that runs attacker-aligned tests — prompt injection, jailbreaks, model extraction, agent misuse — against LLM applications, agents, and multimodal models. It discovers AI assets, tests continuously through CI/CD and Burp Suite integrations, and adds runtime guardrails informed by findings. The Lancaster University spinout is SOC 2 Type 2 certified and operates from London and Boston.
Research-grade attack library
No public pricing
- security
- red-teaming
- llm-security
- pentesting
Open
View Deepchecks details
EvalFREEMIUMOpen core
Deepchecks
Deepchecks
Testing-first evaluation and monitoring for LLM and ML systems.
Deepchecks brings a testing-first approach to AI quality. Its open-source Python library validates ML models and data from research through production, and the LLM Evaluation product extends that into continuous validation of LLM applications — measuring quality, performance, and pitfalls across experimentation and production with CI/CD hooks. Enterprise deployment runs in VPC, on-prem, or bare metal for teams that can't use cloud-hosted eval.
Open-source core (AGPL-3.0)
AGPL-3.0 may not suit all teams
- eval
- llm-testing
- ml-validation
- observability
- +1
Open
View Scale AI details
Data OpsPAID
Scale AI
Scale AI
Training data, evaluations, and enterprise GenAI from the data-labeling giant.
Scale supplies the human-annotated training data behind most frontier AI labs through its Data Engine, spanning labeling, RLHF, and expert red-teaming. On top of the data business it runs evaluation leaderboards, an enterprise GenAI platform, and Donovan, its platform for the US public sector.
Frontier-scale human data ops
Enterprise sales, no public pricing
- data-labeling
- rlhf
- evals
- training-data
Open
View Agenta details
EvalFREEMIUMOpen core
Agenta
Agenta
Open-source LLMOps: prompt management, evaluation, and observability.
An open-source platform for building and improving LLM apps. Agenta combines a prompt playground, prompt versioning, evaluation (human and LLM-as-judge), and tracing/observability in one tool. Available as managed cloud or self-hosted, so teams can keep the whole eval-and-trace loop on their own infra.
Self-hostable on your own infra
Smaller ecosystem than incumbents
- llmops
- evaluation
- prompt-management
- observability
Open
View Athina AI details
EvalFREEMIUM
Athina AI
Athina AI
Build, test, and monitor LLM apps with evals and observability.
Athina AI is a collaborative platform for building, evaluating, and monitoring LLM features. It bundles prompt management, datasets, experiments, production tracing, and a library of 50+ preset and custom evaluations, with human annotation tools on top. The platform pairs with an open-source eval SDK and works with OpenAI, Azure, Bedrock, Vertex, and custom models hosted anywhere.
50+ preset + custom evals
Monitoring platform is closed
- eval
- observability
- llm-monitoring
- prompt-management
Open
View HoneyHive details
EvalFREEMIUM
HoneyHive
HoneyHive
The observability and evaluation layer for production AI agents.
A platform that unifies monitoring and testing for LLM apps and agents into one improvement loop: distributed tracing, online evaluations and alerts, offline experiments, annotation queues for expert feedback, and CI/CD-integrated regression testing. Built OpenTelemetry-native with support for 100+ models and agent frameworks. The free Developer tier covers small teams; Enterprise adds scale, self-host, and compliance.
Unifies tracing and evaluation
SaaS-only (self-host = Enterprise)
- eval
- observability
- tracing
- agents
- +1
Open
View Inspect AI details
EvalFREEOSS
Inspect AI
UK AI Security Institute
Open-source Python framework for large language model evaluations.
A framework for building and running reproducible LLM and agent evaluations, structured around datasets, solvers, and scorers. Ships sandboxed tool execution, multi-turn agent workflows, and a log viewer, plus a companion library of 200+ prebuilt evals. Run any eval against any model via the inspect CLI or the Python API.
Adopted across major safety labs
Python/code framework, not a UI product
- llm-eval
- open-source
- agents
- ai-safety
- +1
Open
View W&B Weave details
ObservabilityFREEMIUMOpen core
W&B Weave
Weights & Biases
Tracing and evaluation for LLM apps, from Weights & Biases.
An observability and evaluation toolkit for generative-AI applications. A single @weave.op decorator traces every model call — capturing inputs, outputs, latency, token cost, and errors — and the same SDK builds rigorous evaluations using LLM-as-judge and custom scorers. Traces and experiments are organized in the Weights & Biases web platform for side-by-side comparison across prompts and models.
Single decorator traces every call
Traces land in W&B hosted platform
- llm-observability
- tracing
- eval
- open-source
- +1
Open
View Vellum details
EvalFREEMIUM
Vellum
Vellum
Build, evaluate, and deploy production LLM apps and agents.
An end-to-end development platform for building, testing, and shipping LLM applications and agents. Vellum pairs a visual drag-and-drop workflow builder with a Python SDK, and bundles prompt versioning, RAG, evaluation, and production monitoring in one place so technical and non-technical teammates can collaborate. Built-in eval and test suites let teams measure quality before and after deploy. A free tier is available; paid Pro and Enterprise plans add seats and scale.
Visual builder plus Python SDK
Cloud-only platform
- llmops
- evaluation
- prompt-engineering
- workflows
- +1
Open
View Galileo details
ObservabilityFREEMIUM
Galileo
Galileo
Evaluation and observability for GenAI apps and agents, with inline guardrails.
A platform for testing, monitoring, and guardrailing LLM and agent applications. It ships 20+ out-of-the-box evals for RAG, agents, and safety, lets teams author custom evaluators, and turns those offline evals into real-time production guardrails powered by its own Luna eval models.
20+ out-of-the-box evals for RAG and agents
Pricing tiers gate the production guardrails
- evaluation
- observability
- guardrails
- agents
Open
View Giskard details
EvalFREEMIUMOpen core
Giskard
Giskard
Open-source evaluation and red-teaming for LLM agents and RAG apps.
Giskard is an open-source (Apache-2.0) Python library for testing LLMs, RAG pipelines, and ML models — its Scan automatically surfaces hallucinations, prompt injection, bias, and other vulnerabilities, while red-teaming agents run multi-turn adversarial attacks across dozens of probes. The paid Giskard Hub adds team collaboration, continuous testing, and scheduled scans. The team also publishes the open Phare LLM safety benchmark.
Automatic vulnerability scan
Python-library learning curve
- llm-eval
- red-teaming
- testing
- rag
- +1
Open
View Maxim AI details
EvalFREEMIUM
Maxim AI
Maxim AI
Simulate, evaluate, and observe AI agents end-to-end.
An end-to-end platform for testing and monitoring AI agents across their lifecycle. It combines a prompt experimentation IDE, agent simulation across scenarios and personas, offline and online evaluations with custom metrics, and production observability with tracing and alerts. Aimed at teams shipping reliable agentic and RAG systems.
Agent simulation across personas/scenarios
Newer, smaller community than rivals
- eval
- agent-simulation
- observability
- tracing
- +1
Open
View DeepEval details
EvalFREEMIUMOpen core
DeepEval
Confident AI
Pytest-style framework for evaluating LLM apps in CI.
Open-source (Apache 2.0) framework for evaluating LLM apps the way Pytest tests code — assertions backed by 50+ ready metrics spanning LLM-as-judge, RAG, agents, conversation, and safety. Plugs into LangChain, CrewAI, OpenAI Agents and more. Confident AI is the paid cloud platform that adds test management, dashboards, and observability on top.
Assertions run in your CI pipeline
LLM-as-judge adds cost
- eval
- open-source
- llm-as-judge
- rag
- +1
Open
View Ragas details
EvalFREEOSS
Ragas
Exploding Gradients
Evaluation toolkit for RAG and LLM applications.
Open-source (Apache-2.0) Python framework for evaluating retrieval-augmented generation and LLM apps. Provides reference-free metrics — faithfulness, answer relevancy, context precision/recall — plus knowledge-graph-based synthetic test generation. Integrates with LangChain, LlamaIndex, and CI pipelines.
Faithfulness & relevancy metrics
LLM-judge scores add cost/variance
- eval
- rag
- llm-as-judge
- open-source
- +1
Open
View Opik details
ObservabilityFREEMIUMOpen core
Opik
Comet
Open-source LLM evaluation, tracing, and monitoring.
Open-source platform from Comet for debugging and evaluating LLM and agent apps: full tracing of calls, tools, and agent steps, LLM-as-a-judge and heuristic evals, prompt management, and production dashboards. Self-host via Docker or Kubernetes, or use Comet's hosted cloud.
Self-host via Docker/Kubernetes
Younger than some rivals
- observability
- evaluation
- tracing
- open-source
Open
View Patronus AI details
EvalFREEMIUM
Patronus AI
Patronus AI
Automated evaluation, guardrails, and monitoring for AI systems.
Platform for evaluating, guarding, and monitoring LLM and agent applications across the deployment lifecycle. Anchored by research-backed evaluator models — Lynx (hallucination detection), GLIDER (LLM judge), and Percival (agent-trace debugger). Offers a self-serve API with free credits, usage-based pricing, and enterprise plans.
Research-backed Lynx, GLIDER, and Percival models
Cloud-only; no self-host
- eval
- guardrails
- monitoring
- hallucination
- +1
Open
View Label Studio details
Data OpsFREEMIUMOpen core
Label Studio
HumanSignal
Multi-type data labeling and AI evaluation across every modality.
Widely-used open-source tool for labeling and annotating data across images, text, audio, video, and time-series, with a standardized export format for training and fine-tuning. ML backends can pre-label data to speed up human review, and it increasingly doubles as a human-in-the-loop AI evaluation surface. Maintained by HumanSignal, which offers a hosted Starter tier and Label Studio Enterprise.
Covers all data modalities in one tool
Self-host setup needs DevOps maturity
- data-labeling
- open-source
- annotation
- human-in-the-loop
- +1
Open
View Braintrust details
EvalFREEMIUM
Braintrust
Braintrust
Hosted eval + tracing platform for LLM apps.
Production-grade eval orchestration with a dashboard, dataset versioning, and OpenTelemetry tracing. Useful once eval volume outgrows a CI YAML file.
Eval workflow as the primary interface
Closed-source SaaS
- eval
- tracing
- datasets
- production
Open
View Promptfoo details
EvalFREEOSS
Promptfoo
Promptfoo
LLM eval CLI with rubric scoring and golden sets.
YAML-driven eval harness. Pair a prompt with a goldset, define rubrics, run across multiple models in CI. Strong for catching prompt regressions before they hit production.
YAML-driven, version-controllable evals
CLI-first, less of a hosted UI
- eval
- ci
- rubric
- open-source
Open

Eval AI apps

Kili Technology

Future AGI

Judgment Labs

Iris

Okareo

Hamming

Cekura

Coval

Atla

Confident AI

LMArena

Runloop

Evidently AI

Latitude

Respan

PromptLayer

MLflow

SuperAnnotate

Freeplay

Labelbox

TruLens

Mindgard

Deepchecks

Scale AI

Agenta

Athina AI

HoneyHive

Inspect AI

W&B Weave

Vellum

Galileo

Giskard

Maxim AI

DeepEval

Ragas

Opik

Patronus AI

Label Studio

Braintrust

Promptfoo