Evidently AI

Evaluation and observability for ML and LLM systems.

Categories: ObservabilityEval
Pricing: FREEMIUM
Source: Open core
Hosting: Hybrid
Platforms: WebAPI
Models: Model-agnostic
Verified: Jun 15, 2026

Evidently is an open-source Python framework for evaluating, testing, and monitoring AI systems — from tabular ML models to LLM apps, RAG pipelines, and agents. It ships 100+ built-in metrics covering data drift, quality, hallucinations, PII leaks, and jailbreaks, plus LLM-as-judge scoring and monitoring dashboards. Evidently Cloud adds a hosted service with no-code evals, alerting, and team management on top of the open library.

Capabilities 3

What it actually does — grouped by capability family.

LLM evaluation (primary capability)
LLM observability (primary capability)
Red-teaming (secondary capability)

Pros & cons

Open source (Apache-2.0), self-hostable
Covers both ML and LLM evaluation
Built-in metrics and presets
LLM-as-judge plus drift detection
Optional hosted cloud with free tier

Python-library learning curve
Less agent-trace-centric than rivals
Cloud features gated to paid tiers
Reports can get heavy at scale

Tags

View all Observability →

View Langfuse details
ObservabilityFREEMIUMOpen core
Langfuse
Langfuse
Open-source LLM observability. Self-hostable, OpenTelemetry-native.
Tracing, evals, prompt management, and dataset tooling for LLM apps — self-host on your own infra or use Langfuse Cloud. The open-source default when you want full ownership of your observability stack.
Own your observability data
Self-host infra cost at scale
- open-source
- tracing
- evals
- self-hosted
Open
View Arize Phoenix details
ObservabilityFREEMIUM
Arize Phoenix
Arize AI
LLM tracing and evaluation with retrieval debugging.
Phoenix is Arize's observability platform — run locally in a notebook or as a hosted service. Especially strong for inspecting RAG pipelines, finding bad chunks, and tracking retrieval quality over time.
Source-available, runs locally
Less polished than hosted SaaS evals
- tracing
- rag
- retrieval-debugging
Open
View DeepEval details
EvalFREEMIUMOpen core
DeepEval
Confident AI
Pytest-style framework for evaluating LLM apps in CI.
Open-source (Apache 2.0) framework for evaluating LLM apps the way Pytest tests code — assertions backed by 50+ ready metrics spanning LLM-as-judge, RAG, agents, conversation, and safety. Plugs into LangChain, CrewAI, OpenAI Agents and more. Confident AI is the paid cloud platform that adds test management, dashboards, and observability on top.
Assertions run in your CI pipeline
LLM-as-judge adds cost
- eval
- open-source
- llm-as-judge
- rag
- +1
Open
View Giskard details
EvalFREEMIUMOpen core
Giskard
Giskard
Open-source evaluation and red-teaming for LLM agents and RAG apps.
Giskard is an open-source (Apache-2.0) Python library for testing LLMs, RAG pipelines, and ML models — its Scan automatically surfaces hallucinations, prompt injection, bias, and other vulnerabilities, while red-teaming agents run multi-turn adversarial attacks across dozens of probes. The paid Giskard Hub adds team collaboration, continuous testing, and scheduled scans. The team also publishes the open Phare LLM safety benchmark.
Automatic vulnerability scan
Python-library learning curve
- llm-eval
- red-teaming
- testing
- rag
- +1
Open

Open Evidently AI

Capabilities 3

Pros & cons

Tags

Langfuse

Arize Phoenix

DeepEval

Giskard