Skip to content

Eval AI apps

Evaluation suites and testing harnesses for measuring LLM and agent quality before and after shipping.

12 apps · researched & kept current by Claude Code

Filter & search these 12 apps
  • View Agenta details
    EvalFREEMIUMOpen core

    Agenta

    Agenta

    Open-source LLMOps: prompt management, evaluation, and observability.

    An open-source platform for building and improving LLM apps. Agenta combines a prompt playground, prompt versioning, evaluation (human and LLM-as-judge), and tracing/observability in one tool. Available as managed cloud or self-hosted, so teams can keep the whole eval-and-trace loop on their own infra.

    Worth knowing

    Open-sourced its full core under MIT in Nov 2025; only enterprise extras (SSO, RBAC, audit logs) stay proprietary.

    • llmops
    • evaluation
    • prompt-management
    • observability
  • View Athina AI details
    EvalFREEMIUM

    Athina AI

    Athina AI

    Build, test, and monitor LLM apps with evals and observability.

    Athina AI is a collaborative platform for building, evaluating, and monitoring LLM features. It bundles prompt management, datasets, experiments, production tracing, and a library of 50+ preset and custom evaluations, with human annotation tools on top. The platform pairs with an open-source eval SDK and works with OpenAI, Azure, Bedrock, Vertex, and custom models hosted anywhere.

    Worth knowing

    A Y Combinator W23 startup founded by Shiv Sakhuja and Himanshu Bamoria.

    • eval
    • observability
    • llm-monitoring
    • prompt-management
  • View HoneyHive details
    EvalFREEMIUM

    HoneyHive

    HoneyHive

    The observability and evaluation layer for production AI agents.

    A platform that unifies monitoring and testing for LLM apps and agents into one improvement loop: distributed tracing, online evaluations and alerts, offline experiments, annotation queues for expert feedback, and CI/CD-integrated regression testing. Built OpenTelemetry-native with support for 100+ models and agent frameworks. The free Developer tier covers small teams; Enterprise adds scale, self-host, and compliance.

    Worth knowing

    Co-founded by two Columbia roommates — one off Microsoft's OpenAI Innovation Team — and raised $7.4M led by Insight Partners.

    • eval
    • observability
    • tracing
    • agents
    • +1
  • View Inspect AI details
    EvalFREEOSS

    Inspect AI

    UK AI Security Institute

    Open-source Python framework for large language model evaluations.

    A framework for building and running reproducible LLM and agent evaluations, structured around datasets, solvers, and scorers. Ships sandboxed tool execution, multi-turn agent workflows, and a log viewer, plus a companion library of 200+ prebuilt evals. Run any eval against any model via the inspect CLI or the Python API.

    Worth knowing

    Open-sourced May 2024 by the UK AI Safety Institute (renamed AI Security Institute in 2025) and used in its frontier-model evals.

    • llm-eval
    • open-source
    • agents
    • ai-safety
    • +1
  • View Vellum details
    EvalFREEMIUM

    Vellum

    Vellum

    Build, evaluate, and deploy production LLM apps and agents.

    An end-to-end development platform for building, testing, and shipping LLM applications and agents. Vellum pairs a visual drag-and-drop workflow builder with a Python SDK, and bundles prompt versioning, RAG, evaluation, and production monitoring in one place so technical and non-technical teammates can collaborate. Built-in eval and test suites let teams measure quality before and after deploy. A free tier is available; paid Pro and Enterprise plans add seats and scale.

    Worth knowing

    A Y Combinator W23 company whose three founders had been building on GPT-3 since March 2020, well before the LLMOps category existed.

    • llmops
    • evaluation
    • prompt-engineering
    • workflows
    • +1
  • View Giskard details
    EvalFREEMIUMOpen core

    Giskard

    Giskard

    Open-source evaluation and red-teaming for LLM agents and RAG apps.

    Giskard is an open-source (Apache-2.0) Python library for testing LLMs, RAG pipelines, and ML models — its Scan automatically surfaces hallucinations, prompt injection, bias, and other vulnerabilities, while red-teaming agents run multi-turn adversarial attacks across dozens of probes. The paid Giskard Hub adds team collaboration, continuous testing, and scheduled scans. The team also publishes the open Phare LLM safety benchmark.

    Worth knowing

    Paris-based and YC-backed; joined a France 2030 R&D consortium with Mistral to build LLM-evaluation methods.

    • llm-eval
    • red-teaming
    • testing
    • rag
    • +1
  • View Maxim AI details
    EvalFREEMIUM

    Maxim AI

    Maxim AI

    Simulate, evaluate, and observe AI agents end-to-end.

    An end-to-end platform for testing and monitoring AI agents across their lifecycle. It combines a prompt experimentation IDE, agent simulation across scenarios and personas, offline and online evaluations with custom metrics, and production observability with tracing and alerts. Aimed at teams shipping reliable agentic and RAG systems.

    Worth knowing

    Founded 2023 by ex-Google and Postman engineers; raised a $3M seed led by Elevation Capital in 2024.

    • eval
    • agent-simulation
    • observability
    • tracing
    • +1
  • View DeepEval details
    EvalFREEMIUMOpen core

    DeepEval

    Confident AI

    Pytest-style LLM evaluation framework. Open source.

    Open-source (Apache 2.0) framework for evaluating LLM apps the way Pytest tests code — assertions backed by 50+ ready metrics spanning LLM-as-judge, RAG, agents, conversation, and safety. Plugs into LangChain, CrewAI, OpenAI Agents and more. Confident AI is the paid cloud platform that adds test management, dashboards, and observability on top.

    Worth knowing

    Built by Confident AI (YC W25, founded 2024); its open-source framework runs ~2M evaluations a day.

    • eval
    • open-source
    • llm-as-judge
    • rag
    • +1
  • View Ragas details
    EvalFREEOSS

    Ragas

    Exploding Gradients

    Open-source evaluation toolkit for RAG and LLM applications.

    Open-source (Apache-2.0) Python framework for evaluating retrieval-augmented generation and LLM apps. Provides reference-free metrics — faithfulness, answer relevancy, context precision/recall — plus knowledge-graph-based synthetic test generation. Integrates with LangChain, LlamaIndex, and CI pipelines.

    Worth knowing

    Began as a 2023 research paper (EACL 2024) and a Y Combinator W24 startup before becoming the default open-source RAG eval standard.

    • eval
    • rag
    • llm-as-judge
    • open-source
    • +1
  • View Patronus AI details
    EvalFREEMIUM

    Patronus AI

    Patronus AI

    Automated evaluation, guardrails, and monitoring for AI systems.

    Platform for evaluating, guarding, and monitoring LLM and agent applications across the deployment lifecycle. Anchored by research-backed evaluator models — Lynx (hallucination detection), GLIDER (LLM judge), and Percival (agent-trace debugger). Offers a self-serve API with free credits, usage-based pricing, and enterprise plans.

    Worth knowing

    Founded by two ex-Meta AI researchers who led responsible-NLP and ML-interpretability work before spinning out in 2023.

    • eval
    • guardrails
    • monitoring
    • hallucination
    • +1
  • View Braintrust details
    EvalFREEMIUM

    Braintrust

    Braintrust

    Hosted eval + tracing platform for LLM apps.

    Production-grade eval orchestration with a dashboard, dataset versioning, and OpenTelemetry tracing. Useful once eval volume outgrows a CI YAML file.

    Worth knowing

    Raised a $36M Series A led by a16z at a $150M valuation in Oct 2024; angels include Greg Brockman and Guillermo Rauch.

    • eval
    • tracing
    • datasets
    • production
  • View Promptfoo details
    EvalFREEOSS

    Promptfoo

    Promptfoo

    Open-source LLM eval CLI. Rubric scoring + golden sets.

    YAML-driven eval harness. Pair a prompt with a goldset, define rubrics, run across multiple models in CI. Strong for catching prompt regressions before they hit production.

    Worth knowing

    Acquired by OpenAI in March 2026; stays open-source and MIT-licensed, and is used internally by both OpenAI and Anthropic.

    • eval
    • ci
    • rubric
    • open-source