Loading…
Eval · Athina AI
Build, test, and monitor LLM apps with evals and observability.
Athina AI is a collaborative platform for building, evaluating, and monitoring LLM features. It bundles prompt management, datasets, experiments, production tracing, and a library of 50+ preset and custom evaluations, with human annotation tools on top. The platform pairs with an open-source eval SDK and works with OpenAI, Azure, Bedrock, Vertex, and custom models hosted anywhere.
Model support
OpenAI, Azure OpenAI, AWS Bedrock, Vertex AI, plus custom models hosted anywhere.
Where it runs
Tags
Related in Eval
Agenta
Open-source LLMOps: prompt management, evaluation, and observability.
An open-source platform for building and improving LLM apps. Agenta combines a prompt playground, prompt versioning, evaluation (human and LLM-as-judge), and tracing/observability in one tool. Available as managed cloud or self-hosted, so teams can keep the whole eval-and-trace loop on their own infra.
AI insight: Fully open-source and self-hostable, so the prompt-management, eval, and tracing loop can run entirely on your own infra.
Giskard
Open-source evaluation and red-teaming for LLM agents and RAG apps.
Giskard is an open-source (Apache-2.0) Python library for testing LLMs, RAG pipelines, and ML models — its Scan automatically surfaces hallucinations, prompt injection, bias, and other vulnerabilities, while red-teaming agents run multi-turn adversarial attacks across dozens of probes. The paid Giskard Hub adds team collaboration, continuous testing, and scheduled scans. The team also publishes the open Phare LLM safety benchmark.
AI insight: Its Scan auto-generates adversarial suites mapped to the OWASP LLM Top-10, framing eval as security red-teaming, not just accuracy.
HoneyHive
The observability and evaluation layer for production AI agents.
A platform that unifies monitoring and testing for LLM apps and agents into one improvement loop: distributed tracing, online evaluations and alerts, offline experiments, annotation queues for expert feedback, and CI/CD-integrated regression testing. Built OpenTelemetry-native with support for 100+ models and agent frameworks. The free Developer tier covers small teams; Enterprise adds scale, self-host, and compliance.
AI insight: OpenTelemetry-native, so tracing rides standard OTel spans instead of a proprietary SDK, unifying observability and eval in one loop.
UK AI Security Institute
Open-source Python framework for large language model evaluations.
A framework for building and running reproducible LLM and agent evaluations, structured around datasets, solvers, and scorers. Ships sandboxed tool execution, multi-turn agent workflows, and a log viewer, plus a companion library of 200+ prebuilt evals. Run any eval against any model via the inspect CLI or the Python API.
AI insight: Built by the UK's AI Security Institute and adopted by Anthropic and Google DeepMind as a shared eval framework; MIT-licensed.
Maxim AI
Simulate, evaluate, and observe AI agents end-to-end.
An end-to-end platform for testing and monitoring AI agents across their lifecycle. It combines a prompt experimentation IDE, agent simulation across scenarios and personas, offline and online evaluations with custom metrics, and production observability with tracing and alerts. Aimed at teams shipping reliable agentic and RAG systems.
AI insight: Simulates multi-turn agent conversations across personas and scenarios before release, not just single-prompt scoring.
Vellum
Build, evaluate, and deploy production LLM apps and agents.
An end-to-end development platform for building, testing, and shipping LLM applications and agents. Vellum pairs a visual drag-and-drop workflow builder with a Python SDK, and bundles prompt versioning, RAG, evaluation, and production monitoring in one place so technical and non-technical teammates can collaborate. Built-in eval and test suites let teams measure quality before and after deploy. A free tier is available; paid Pro and Enterprise plans add seats and scale.
AI insight: Passes model-provider token costs straight through at cost with no markup — unusual for an all-in-one LLMOps platform.
Confident AI
Pytest-style LLM evaluation framework. Open source.
Open-source (Apache 2.0) framework for evaluating LLM apps the way Pytest tests code — assertions backed by 50+ ready metrics spanning LLM-as-judge, RAG, agents, conversation, and safety. Plugs into LangChain, CrewAI, OpenAI Agents and more. Confident AI is the paid cloud platform that adds test management, dashboards, and observability on top.
AI insight: Modeled on Pytest — you write LLM evals as unit-test assertions and run them in CI, with 50+ metrics spanning RAG, agents, and safety.
Patronus AI
Automated evaluation, guardrails, and monitoring for AI systems.
Platform for evaluating, guarding, and monitoring LLM and agent applications across the deployment lifecycle. Anchored by research-backed evaluator models — Lynx (hallucination detection), GLIDER (LLM judge), and Percival (agent-trace debugger). Offers a self-serve API with free credits, usage-based pricing, and enterprise plans.
AI insight: Ships proprietary evaluators — Lynx for hallucination, GLIDER as judge, Percival for agent-trace debugging — beyond prompt-based scoring.
Exploding Gradients
Open-source evaluation toolkit for RAG and LLM applications.
Open-source (Apache-2.0) Python framework for evaluating retrieval-augmented generation and LLM apps. Provides reference-free metrics — faithfulness, answer relevancy, context precision/recall — plus knowledge-graph-based synthetic test generation. Integrates with LangChain, LlamaIndex, and CI pipelines.
AI insight: Popularized reference-free RAG metrics — faithfulness, context precision — scored by an LLM judge, so you evaluate without gold answers.
Braintrust
Hosted eval + tracing platform for LLM apps.
Production-grade eval orchestration with a dashboard, dataset versioning, and OpenTelemetry tracing. Useful once eval volume outgrows a CI YAML file.
AI insight: Where teams graduate when a CI eval file stops scaling — it adds dataset versioning and OpenTelemetry traces to the loop.