Skip to content

EvalLMArena

LMArena

Crowdsourced LLM leaderboard where humans vote on anonymous model battles.

Category
Eval
Pricing
FREE
Hosting
Cloud
Platforms
Web
Models
Multi-model
Verified
Jun 15, 2026

An open evaluation platform where a user enters a prompt, two anonymous models answer side by side, and the user picks the better response. Those millions of blind pairwise votes are aggregated into Elo-style rankings across text, vision, coding, and other arenas. It works directly with major AI labs and has become the most-cited public reference for comparing frontier model quality.

Pros & cons

  • Real human preference votes
  • Covers all major frontier models
  • Free and open to use
  • Widely cited industry standard
  • No public API yet
  • Crowd votes can be noisy or gamed
  • A leaderboard, not a dev test harness

Tags

Further reading

View all Eval
  • View PromptLayer details
    EvalFREEMIUM

    PromptLayer

    PromptLayer

    Prompt CMS, evals, and observability for LLM teams.

    PromptLayer is a prompt-engineering platform that treats prompts as a content-managed asset: version, edit, and deploy them without touching application code. It pairs that registry with an evaluation harness (datasets, scoring) and an observability stack that logs every request and tracks cost and latency. The collaborative model lets non-technical domain experts iterate on prompts alongside engineers.

    • prompt-management
    • evaluation
    • observability
    • llmops
  • View Freeplay details
    EvalPAID

    Freeplay

    Freeplay

    Eval and observability ops platform for AI product teams.

    Freeplay is an LLM evaluation and observability platform that unifies prompt management, batch evals, experiments, and production monitoring in one workflow. Built for cross-functional teams, it lets engineers, PMs, and domain experts review the same traces, run model-graded and code-based evals, and align auto-evaluators with human labels before shipping.

    Worth knowing

    Founded in 2022 by two former Twitter developer-platform leaders; raised a $5.6M round led by Renegade Partners in June 2025.

    • llm-eval
    • observability
    • prompt-management
    • experiments
  • View TruLens details
    EvalFREEOSS

    TruLens

    Snowflake

    Open-source evaluation and tracing for LLM and agent apps.

    TruLens is an open-source Python library for evaluating and tracing LLM, RAG, and agent applications. You wrap your app with feedback functions that score outputs on metrics like groundedness, context relevance, and answer relevance, then trace runs and compare versions on a metrics leaderboard. It integrates OpenTelemetry tracing and runs locally with a built-in dashboard.

    Worth knowing

    Built by TruEra, whose AI-observability platform Snowflake acquired in 2024; Snowflake kept TruLens open source and now stewards it.

    • eval
    • tracing
    • rag
    • llm-as-judge
    • +1
  • View Deepchecks details
    EvalFREEMIUMOpen core

    Deepchecks

    Deepchecks

    Testing-first evaluation and monitoring for LLM and ML systems.

    Deepchecks brings a testing-first approach to AI quality. Its open-source Python library validates ML models and data from research through production, and the LLM Evaluation product extends that into continuous validation of LLM applications — measuring quality, performance, and pitfalls across experimentation and production with CI/CD hooks. Enterprise deployment runs in VPC, on-prem, or bare metal for teams that can't use cloud-hosted eval.

    Worth knowing

    Began as an open-source library for continuous validation of ML models and data, then extended into LLM evaluation.

    • eval
    • llm-testing
    • ml-validation
    • observability
    • +1
  • View Agenta details
    EvalFREEMIUMOpen core

    Agenta

    Agenta

    Open-source LLMOps: prompt management, evaluation, and observability.

    An open-source platform for building and improving LLM apps. Agenta combines a prompt playground, prompt versioning, evaluation (human and LLM-as-judge), and tracing/observability in one tool. Available as managed cloud or self-hosted, so teams can keep the whole eval-and-trace loop on their own infra.

    Worth knowing

    Open-sourced its full core under MIT in Nov 2025; only enterprise extras (SSO, RBAC, audit logs) stay proprietary.

    • llmops
    • evaluation
    • prompt-management
    • observability
  • View Athina AI details
    EvalFREEMIUM

    Athina AI

    Athina AI

    Build, test, and monitor LLM apps with evals and observability.

    Athina AI is a collaborative platform for building, evaluating, and monitoring LLM features. It bundles prompt management, datasets, experiments, production tracing, and a library of 50+ preset and custom evaluations, with human annotation tools on top. The platform pairs with an open-source eval SDK and works with OpenAI, Azure, Bedrock, Vertex, and custom models hosted anywhere.

    Worth knowing

    A Y Combinator W23 startup founded by Shiv Sakhuja and Himanshu Bamoria.

    • eval
    • observability
    • llm-monitoring
    • prompt-management
  • View HoneyHive details
    EvalFREEMIUM

    HoneyHive

    HoneyHive

    The observability and evaluation layer for production AI agents.

    A platform that unifies monitoring and testing for LLM apps and agents into one improvement loop: distributed tracing, online evaluations and alerts, offline experiments, annotation queues for expert feedback, and CI/CD-integrated regression testing. Built OpenTelemetry-native with support for 100+ models and agent frameworks. The free Developer tier covers small teams; Enterprise adds scale, self-host, and compliance.

    Worth knowing

    Co-founded by two Columbia roommates — one off Microsoft's OpenAI Innovation Team — and raised $7.4M led by Insight Partners.

    • eval
    • observability
    • tracing
    • agents
    • +1
  • View Inspect AI details
    EvalFREEOSS

    Inspect AI

    UK AI Security Institute

    Open-source Python framework for large language model evaluations.

    A framework for building and running reproducible LLM and agent evaluations, structured around datasets, solvers, and scorers. Ships sandboxed tool execution, multi-turn agent workflows, and a log viewer, plus a companion library of 200+ prebuilt evals. Run any eval against any model via the inspect CLI or the Python API.

    Worth knowing

    Open-sourced May 2024 by the UK AI Safety Institute (renamed AI Security Institute in 2025) and used in its frontier-model evals.

    • llm-eval
    • open-source
    • agents
    • ai-safety
    • +1
  • View Vellum details
    EvalFREEMIUM

    Vellum

    Vellum

    Build, evaluate, and deploy production LLM apps and agents.

    An end-to-end development platform for building, testing, and shipping LLM applications and agents. Vellum pairs a visual drag-and-drop workflow builder with a Python SDK, and bundles prompt versioning, RAG, evaluation, and production monitoring in one place so technical and non-technical teammates can collaborate. Built-in eval and test suites let teams measure quality before and after deploy. A free tier is available; paid Pro and Enterprise plans add seats and scale.

    Worth knowing

    A Y Combinator W23 company whose three founders had been building on GPT-3 since March 2020, well before the LLMOps category existed.

    • llmops
    • evaluation
    • prompt-engineering
    • workflows
    • +1
  • View Giskard details
    EvalFREEMIUMOpen core

    Giskard

    Giskard

    Open-source evaluation and red-teaming for LLM agents and RAG apps.

    Giskard is an open-source (Apache-2.0) Python library for testing LLMs, RAG pipelines, and ML models — its Scan automatically surfaces hallucinations, prompt injection, bias, and other vulnerabilities, while red-teaming agents run multi-turn adversarial attacks across dozens of probes. The paid Giskard Hub adds team collaboration, continuous testing, and scheduled scans. The team also publishes the open Phare LLM safety benchmark.

    Worth knowing

    Paris-based and YC-backed; joined a France 2030 R&D consortium with Mistral to build LLM-evaluation methods.

    • llm-eval
    • red-teaming
    • testing
    • rag
    • +1