Loading…
Eval · Braintrust
Hosted eval + tracing platform for LLM apps.
Production-grade eval orchestration with a dashboard, dataset versioning, and OpenTelemetry tracing. Useful once eval volume outgrows a CI YAML file.
Model support
Where it runs
Tags
Related in Eval
HoneyHive
The observability and evaluation layer for production AI agents.
A platform that unifies monitoring and testing for LLM apps and agents into one improvement loop: distributed tracing, online evaluations and alerts, offline experiments, annotation queues for expert feedback, and CI/CD-integrated regression testing. Built OpenTelemetry-native with support for 100+ models and agent frameworks. The free Developer tier covers small teams; Enterprise adds scale, self-host, and compliance.
AI insight: OpenTelemetry-native, so tracing rides standard OTel spans instead of a proprietary SDK, unifying observability and eval in one loop.
UK AI Security Institute
Open-source Python framework for large language model evaluations.
A framework for building and running reproducible LLM and agent evaluations, structured around datasets, solvers, and scorers. Ships sandboxed tool execution, multi-turn agent workflows, and a log viewer, plus a companion library of 200+ prebuilt evals. Run any eval against any model via the inspect CLI or the Python API.
AI insight: Built by the UK's AI Security Institute and adopted by Anthropic and Google DeepMind as a shared eval framework; MIT-licensed.
Maxim AI
Simulate, evaluate, and observe AI agents end-to-end.
An end-to-end platform for testing and monitoring AI agents across their lifecycle. It combines a prompt experimentation IDE, agent simulation across scenarios and personas, offline and online evaluations with custom metrics, and production observability with tracing and alerts. Aimed at teams shipping reliable agentic and RAG systems.
AI insight: Simulates multi-turn agent conversations across personas and scenarios before release, not just single-prompt scoring.
Confident AI
Pytest-style LLM evaluation framework. Open source.
Open-source (Apache 2.0) framework for evaluating LLM apps the way Pytest tests code — assertions backed by 50+ ready metrics spanning LLM-as-judge, RAG, agents, conversation, and safety. Plugs into LangChain, CrewAI, OpenAI Agents and more. Confident AI is the paid cloud platform that adds test management, dashboards, and observability on top.
AI insight: Modeled on Pytest — you write LLM evals as unit-test assertions and run them in CI, with 50+ metrics spanning RAG, agents, and safety.
Patronus AI
Automated evaluation, guardrails, and monitoring for AI systems.
Platform for evaluating, guarding, and monitoring LLM and agent applications across the deployment lifecycle. Anchored by research-backed evaluator models — Lynx (hallucination detection), GLIDER (LLM judge), and Percival (agent-trace debugger). Offers a self-serve API with free credits, usage-based pricing, and enterprise plans.
AI insight: Ships proprietary evaluators — Lynx for hallucination, GLIDER as judge, Percival for agent-trace debugging — beyond prompt-based scoring.
Exploding Gradients
Open-source evaluation toolkit for RAG and LLM applications.
Open-source (Apache-2.0) Python framework for evaluating retrieval-augmented generation and LLM apps. Provides reference-free metrics — faithfulness, answer relevancy, context precision/recall — plus knowledge-graph-based synthetic test generation. Integrates with LangChain, LlamaIndex, and CI pipelines.
AI insight: Popularized reference-free RAG metrics — faithfulness, context precision — scored by an LLM judge, so you evaluate without gold answers.
Promptfoo
Open-source LLM eval CLI. Rubric scoring + golden sets.
YAML-driven eval harness. Pair a prompt with a goldset, define rubrics, run across multiple models in CI. Strong for catching prompt regressions before they hit production.
AI insight: Define evals in plain YAML and run one goldset across models in CI — a prompt regression fails the build like any other test.