Observability AI apps

Tracing, monitoring, and debugging for LLM apps — see what your prompts, chains, and agents actually did.

40 apps · researched & kept current by Claude Code

Filter & search these 40 apps

View Future AGI details
EvalFREEMIUMOpen core
Future AGI
Future AGI
Evaluation, observability, and optimization platform for AI agents and LLM apps.
Future AGI is an end-to-end platform for testing, evaluating, observing, and improving generative-AI applications. It spans simulations, evaluation suites, real-time tracing and dashboards, runtime guardrails, and a model gateway, with multimodal evaluation across text, image, and audio. The core stack is open-source under Apache 2.0 and can be self-hosted or used as a managed cloud.
Open-source, Apache-2.0 licensed
Newer, smaller community
- open-source
- evals
- observability
- guardrails
- +1
Open
View InsightFinder details
ObservabilityPAID
InsightFinder
InsightFinder, Inc.
AI reliability platform for monitoring and fixing AI and IT systems.
InsightFinder is an AI-driven reliability platform that detects anomalies, pinpoints root causes, and remediates issues across both AI systems and traditional IT infrastructure. It ingests metrics, logs, traces, and events, applying 'composite AI' — streaming anomaly detection plus deterministic causal analysis — and ships fine-tuned, domain-specific small language models rather than relying on generative AI alone. Its ARI agent runs end-to-end incident management: detection, diagnosis, evidence collection, and real-time remediation. The platform runs as SaaS or in fully air-gapped environments and integrates with OpenTelemetry, Datadog, and Prometheus.
Causal root-cause, not just anomaly alerts
Pricing not public
- aiops
- observability
- incident-response
- root-cause
- +1
Open
View Judgment Labs details
EvalFREEMIUMOpen core
Judgment Labs
Judgment Labs
The continuous-improvement stack for AI agents.
An evaluation and monitoring platform for AI agents, built around the open-source judgeval framework. Judgment traces an agent's full trajectory — tool calls, memory, search queries, and long reasoning chains — then uses trajectory-level judges to surface failure modes, validate fixes before deploy, and catch behavioral regressions in production. The captured environment data and evals feed back into agent post-training (RL and SFT), not just pass/fail scoring. judgeval is Apache-2.0 and free; the hosted platform adds the dashboard, AutoRubrics, and enterprise features.
Open-source judgeval framework (Apache-2.0)
Hosted platform pricing not public
- agent-evals
- llm-as-judge
- tracing
- post-training
- +1
Open
View Cleric details
ObservabilityPAID
Cleric
Cleric
An autonomous AI SRE that investigates production alerts and finds root cause.
Cleric is an AI agent for site reliability engineering that automates incident response. When an alert fires, it investigates across your observability dashboards, performs root-cause analysis in minutes, recommends and verifies fixes against the live environment, and retains what it learns as institutional knowledge for the team. It works inside Slack and existing tooling with read-only access by default, adding write access only when teams are ready.
Autonomous alert triage
No public pricing
- ai-sre
- incident-response
- root-cause
- devops
Open
View Iris details
EvalFREEMIUMOpen core
Iris
Iris
MCP-native eval and observability server for AI agents.
Iris scores AI agent output, catches safety failures, and enforces cost budgets — exposed as an MCP server rather than an SDK. Any MCP-compatible agent discovers its tools (trace logging, output evaluation, rule management, LLM-as-judge) and uses them automatically, with no code changes, so every output flowing through the protocol gets evaluated. It detects PII leaks, prompt injection, hallucinations, and budget anomalies. The core is MIT-licensed and free to self-host; a managed cloud adds dashboards and alerting.
No SDK or instrumentation to add
Newer, niche MCP-focused tool
- agent-eval
- observability
- mcp
- open-source
- +1
Open
View Traversal details
ObservabilityPAID
Traversal
Traversal
An autonomous AI SRE that triages alerts and finds root causes in minutes.
Traversal is an AI site reliability engineer (SRE) that autonomously triages noisy alerts, traces incidents across services and dependencies to isolate root causes, and feeds production context back to prevent recurrence. Its causal search engine and 'production world model' narrow failures down at petabyte scale. At American Express it reached 82% root-cause accuracy and cut potential MTTR by ~32%, with PepsiCo also among its enterprise users.
Autonomous root-cause analysis
Enterprise sales-only, no public pricing
- sre
- incident-response
- root-cause-analysis
- aiops
- +1
Open
View Okareo details
EvalFREEMIUM
Okareo
Okareo
Simulate real users to ship reliable voice and text agents.
Okareo is an evaluation and testing platform for AI agents. It drives an agent with synthetic users ("Drivers") that hold personality-rich, multi-turn conversations across voice, text, and headless channels, surfacing edge cases before release. Teams gate releases on conversation quality in CI/CD, turn production failures into automated test scenarios, and evaluate models, RAG pipelines, and agents from one workspace.
Personality-rich synthetic-user Drivers
Younger than general eval platforms
- agent-evaluation
- simulation
- synthetic-users
- voice-agents
- +1
Open
View OpenObserve details
ObservabilityFREEMIUMOpen core
OpenObserve
OpenObserve Inc.
Open-source observability for logs, metrics, traces and LLMs in one platform.
OpenObserve (O2) is a unified, open-source observability platform that ingests logs, metrics, traces and real-user monitoring with dashboards, alerts and session replay. It adds LLM observability, a natural-language query assistant and an SRE agent for automated incident correlation and root-cause analysis. It runs as a single self-hosted binary or Helm chart, or as fully managed OpenObserve Cloud.
Single binary self-host or managed cloud
AGPL-3.0 can deter some commercial embedding
- observability
- logs
- metrics
- tracing
- +1
Open
View Hamming details
EvalPAID
Hamming
Hamming
Automated testing and monitoring for voice and chat agents.
Hamming is an enterprise platform for testing and monitoring conversational AI agents. It auto-generates test scenarios from an agent's prompt, load-tests with tens of thousands of concurrent calls, replays production calls for regression testing, and scores 50+ audio-native metrics like latency, hallucinations, sentiment, and compliance. It integrates natively with Vapi, Retell, ElevenLabs, LiveKit, and Pipecat.
Audio-native scoring of voice agents
No public pricing or free tier
- voice-agents
- testing
- evaluation
- monitoring
- +1
Open
View Cekura details
EvalPAID
Cekura
Cekura
Test, monitor, and self-improve voice and chat AI agents.
Cekura is an automated QA and observability platform for conversational AI agents. Before launch it runs simulations across thousands of diverse personas and edge cases, with red-teaming for bias, toxicity, and jailbreaks; in production it monitors live conversations for instruction-following, hallucinations, and voice-specific quality regressions with real-time alerting. It targets regulated sectors like healthcare and finance where reliability and compliance matter.
Simulates thousands of persona conversations
No public pricing; demo/trial required
- voice-agents
- agent-testing
- qa
- red-teaming
- +1
Open
View Coval details
EvalPAID
Coval
Coval
Simulation and evaluation platform for voice and chat AI agents.
Coval is an evaluation and monitoring platform for conversational AI agents, applying the simulation-driven testing rigor developed in self-driving to voice and chat. From a handful of test cases it generates thousands of realistic scenarios, runs them against an agent over text or live phone calls, and scores the results on built-in or custom metrics. In production it monitors and scores real calls so teams can catch regressions across millions of conversations.
Generates realistic scenarios from few cases
No free tier — 7-day trial only
- agent-eval
- voice-agents
- simulation
- monitoring
Open
View Atla details
EvalFREEMIUM
Atla
Atla
Evaluation layer that finds and fixes AI agent failures.
Atla is an evaluation platform that automatically discovers, clusters, and ranks failures in AI agents, then suggests fixes. Rather than prompting a general model to grade outputs, it runs on Atla's own Selene LLM-judge models, purpose-trained to score and critique generative-AI responses. It offers Python and TypeScript SDKs and integrates with stacks like OpenAI and LangChain.
Auto-discovers and suggests fixes
Younger platform, small team
- evaluation
- llm-as-judge
- agents
- observability
Open
View Confident AI details
EvalFREEMIUM
Confident AI
Confident AI
The AI quality platform from the team behind DeepEval.
Confident AI is the hosted platform built on top of DeepEval, the open-source LLM evaluation framework. It adds dataset and test management, research-backed metrics, production tracing and monitoring, adversarial red teaming, and governance dashboards so teams can benchmark, observe, and safeguard LLM apps across the dev-to-prod loop. Python and TypeScript SDKs plug into CI and OpenTelemetry, with managed cloud and enterprise self-hosting.
Built on the DeepEval framework
Platform itself is proprietary
- eval
- observability
- red-teaming
- llm-as-judge
- +1
Open
View AgentOps details
ObservabilityFREEMIUMOpen core
AgentOps
AgentOps
Observability and tracing built for AI agents.
AgentOps is a developer platform for monitoring, debugging, and evaluating AI agents. It records every LLM call, tool use, and decision in a replayable session trace, with time-travel debugging, token and cost tracking, and agent benchmarking. Its open-source SDK drops into Python or TypeScript agents in two lines and integrates with frameworks like CrewAI, LangChain, Autogen, and the OpenAI Agents SDK.
Open-source MIT SDK, two-line setup
Python/TypeScript SDK-centric
- agent-monitoring
- tracing
- debugging
- open-source
Open
View Evidently AI details
ObservabilityFREEMIUMOpen core
Evidently AI
Evidently AI
Evaluation and observability for ML and LLM systems.
Evidently is an open-source Python framework for evaluating, testing, and monitoring AI systems — from tabular ML models to LLM apps, RAG pipelines, and agents. It ships 100+ built-in metrics covering data drift, quality, hallucinations, PII leaks, and jailbreaks, plus LLM-as-judge scoring and monitoring dashboards. Evidently Cloud adds a hosted service with no-code evals, alerting, and team management on top of the open library.
Open source (Apache-2.0), self-hostable
Python-library learning curve
- evaluation
- observability
- ml-monitoring
- llm-testing
- +1
Open
View Latitude details
ObservabilityFREEMIUMOpen core
Latitude
Latitude Data S.L.
Observability and evals for AI agents in production.
Latitude is an open-source monitoring platform for AI agents in production. It captures agent-native traces across multi-turn sessions, tool calls, and full execution paths, then clusters failures into tracked issues and alerts via Slack, email, or webhooks. It is OpenTelemetry-native, supports semantic search over sessions, and auto-builds evals from your team's judgments — run it as managed cloud or self-host the whole stack.
Fully self-hostable stack
Younger than LangSmith/Langfuse
- open-source
- observability
- evals
- agents
- +1
Open
View Respan details
ObservabilityFREEMIUM
Respan
Keywords AI, Inc.
Self-driving observability, evals, and an AI gateway for LLM agents.
Respan is a unified LLM-engineering platform that routes, observes, and evaluates every model call from one control plane. Its AI gateway fronts 500+ models with fallbacks, load balancing, and spend caps; trace trees, dashboards, and alerts cover observability; and rule checks, AI judges, and human review run as one evaluation workflow. An automated eval agent surfaces issues from production traffic so teams can fix what breaks without stitching tools together.
Trace trees, dashboards, and alerts
Proprietary, no self-host option
- observability
- ai-gateway
- evals
- tracing
- +1
Open
View MLflow details
ObservabilityFREEOSS
MLflow
Linux Foundation
Open-source platform for the ML and GenAI lifecycle.
MLflow is an open-source platform for managing the full machine-learning and GenAI lifecycle — experiment tracking, model registry, deployment, and, more recently, LLM/agent observability. Its GenAI stack adds OpenTelemetry-based tracing, systematic evaluation with built-in metrics and LLM judges, and prompt versioning. Framework- and provider-agnostic, it runs on your own infrastructure with no vendor lock-in.
Fully open source, no lock-in
Self-hosting adds operational overhead
- llmops
- tracing
- evaluation
- mlops
- +1
Open
View TruLens details
EvalFREEOSS
TruLens
Snowflake
Open-source evaluation and tracing for LLM and agent apps.
TruLens is an open-source Python library for evaluating and tracing LLM, RAG, and agent applications. You wrap your app with feedback functions that score outputs on metrics like groundedness, context relevance, and answer relevance, then trace runs and compare versions on a metrics leaderboard. It integrates OpenTelemetry tracing and runs locally with a built-in dashboard.
OpenTelemetry tracing, runs locally
Python library, no hosted SaaS
- eval
- tracing
- rag
- llm-as-judge
- +1
Open
View Laminar details
ObservabilityFREEMIUMOpen core
Laminar
Laminar
Open-source observability built for AI agents.
An observability and evals platform purpose-built for AI agents: one-line OpenTelemetry-native tracing across SDKs like the Vercel AI SDK, OpenAI, Anthropic, LangChain and browser agents, plus an evals SDK/CLI for local and CI runs. Self-host the Apache-2.0 stack or use Laminar Cloud to debug long-running agent failures.
Self-hostable, Apache-2.0 licensed
Younger than LangSmith/Langfuse
- open-source
- tracing
- agents
- evals
Open
View PostHog details
AnalyticsFREEMIUMOpen core
PostHog
PostHog
Open-source product analytics with LLM observability built in.
A product analytics platform that folds session replay, feature flags, experiments, and a data warehouse into one stack. Its LLM analytics suite tracks traces, token costs, latency, and model errors for AI products, with automatic clustering that groups traces by behavior. Usage-based pricing with a generous free tier on every product.
Generous free tier (1M events/mo)
Product breadth can overwhelm small teams
- product-analytics
- llm-analytics
- session-replay
- feature-flags
Open
View Deepchecks details
EvalFREEMIUMOpen core
Deepchecks
Deepchecks
Testing-first evaluation and monitoring for LLM and ML systems.
Deepchecks brings a testing-first approach to AI quality. Its open-source Python library validates ML models and data from research through production, and the LLM Evaluation product extends that into continuous validation of LLM applications — measuring quality, performance, and pitfalls across experimentation and production with CI/CD hooks. Enterprise deployment runs in VPC, on-prem, or bare metal for teams that can't use cloud-hosted eval.
Open-source core (AGPL-3.0)
AGPL-3.0 may not suit all teams
- eval
- llm-testing
- ml-validation
- observability
- +1
Open
View LangWatch details
ObservabilityFREEMIUMOpen core
LangWatch
LangWatch
LLM observability, evaluation, and agent testing.
An open-source platform for monitoring, evaluating, and testing LLM and agent applications. LangWatch captures traces, runs evaluations and simulations, and surfaces quality and cost metrics in production. Offered as managed cloud or fully self-hosted for teams with strict data-residency needs.
Agent simulation testing built in
Smaller community than peers
- observability
- evaluation
- agent-testing
- llmops
Open
View Pydantic Logfire details
ObservabilityFREEMIUMOpen core
Pydantic Logfire
Pydantic
Observability for LLM and agent apps, from the Pydantic team.
An observability platform that traces your whole application stack — LLM calls, agents, databases, and HTTP — not just the model layer. The Python/JS/Rust SDKs are open source and built on OpenTelemetry, while the hosted backend handles storage, querying, and dashboards. Free tier covers 10M spans per month.
OpenTelemetry-based, portable traces
Hosted backend is proprietary
- observability
- tracing
- opentelemetry
- open-source
Open
View Agenta details
EvalFREEMIUMOpen core
Agenta
Agenta
Open-source LLMOps: prompt management, evaluation, and observability.
An open-source platform for building and improving LLM apps. Agenta combines a prompt playground, prompt versioning, evaluation (human and LLM-as-judge), and tracing/observability in one tool. Available as managed cloud or self-hosted, so teams can keep the whole eval-and-trace loop on their own infra.
Self-hostable on your own infra
Smaller ecosystem than incumbents
- llmops
- evaluation
- prompt-management
- observability
Open
View Athina AI details
EvalFREEMIUM
Athina AI
Athina AI
Build, test, and monitor LLM apps with evals and observability.
Athina AI is a collaborative platform for building, evaluating, and monitoring LLM features. It bundles prompt management, datasets, experiments, production tracing, and a library of 50+ preset and custom evaluations, with human annotation tools on top. The platform pairs with an open-source eval SDK and works with OpenAI, Azure, Bedrock, Vertex, and custom models hosted anywhere.
50+ preset + custom evals
Monitoring platform is closed
- eval
- observability
- llm-monitoring
- prompt-management
Open
View HoneyHive details
EvalFREEMIUM
HoneyHive
HoneyHive
The observability and evaluation layer for production AI agents.
A platform that unifies monitoring and testing for LLM apps and agents into one improvement loop: distributed tracing, online evaluations and alerts, offline experiments, annotation queues for expert feedback, and CI/CD-integrated regression testing. Built OpenTelemetry-native with support for 100+ models and agent frameworks. The free Developer tier covers small teams; Enterprise adds scale, self-host, and compliance.
Unifies tracing and evaluation
SaaS-only (self-host = Enterprise)
- eval
- observability
- tracing
- agents
- +1
Open
View Traceloop details
ObservabilityFREEMIUMOpen core
Traceloop
Traceloop
LLM observability built on OpenTelemetry.
A reliability platform for LLM apps: its open-source OpenLLMetry SDK instruments LLM, vector-DB, and framework calls as standard OpenTelemetry spans, which Traceloop's hosted dashboard turns into traces, cost/latency analytics, and quality monitoring. Because the data is plain OTel, you can pipe it to existing observability stacks instead of a proprietary one.
Built on open OpenTelemetry standard
Hosted dashboard less rich than rivals
- observability
- opentelemetry
- tracing
- open-source
- +1
Open
View Portkey details
InferenceFREEMIUMOpen core
Portkey
Portkey
AI gateway with observability, guardrails, and governance.
A production AI gateway that gives apps and agents unified access to 1,600+ LLMs across providers behind a single API, with built-in observability, prompt management, guardrails, and governance. Portkey adds routing, caching, fallbacks, cost limits, PII redaction, RBAC, and an MCP gateway. Its core gateway is open-source; run it self-hosted/hybrid or use the managed cloud, which offers a free tier.
One API across many providers
Acquired by Palo Alto Networks (closed 2025)
- ai-gateway
- llm-routing
- observability
- guardrails
Open
View Lunary details
ObservabilityFREEMIUMOpen core
Lunary
Lunary
Open-source observability and prompt management for LLM apps.
An open-source platform for monitoring, debugging, and improving LLM applications and chatbots. Lunary combines request tracing, cost and user analytics, versioned prompt management with A/B testing, plus human-in-the-loop review and automated scoring. Self-host the Apache-2.0 community edition or use the managed cloud, which starts free with a 10k-events monthly tier.
Apache-2.0, self-hostable
Evaluation features minimal/early-stage
- llm-observability
- prompt-management
- tracing
- open-source
Open
View W&B Weave details
ObservabilityFREEMIUMOpen core
W&B Weave
Weights & Biases
Tracing and evaluation for LLM apps, from Weights & Biases.
An observability and evaluation toolkit for generative-AI applications. A single @weave.op decorator traces every model call — capturing inputs, outputs, latency, token cost, and errors — and the same SDK builds rigorous evaluations using LLM-as-judge and custom scorers. Traces and experiments are organized in the Weights & Biases web platform for side-by-side comparison across prompts and models.
Single decorator traces every call
Traces land in W&B hosted platform
- llm-observability
- tracing
- eval
- open-source
- +1
Open
View Galileo details
ObservabilityFREEMIUM
Galileo
Galileo
Evaluation and observability for GenAI apps and agents, with inline guardrails.
A platform for testing, monitoring, and guardrailing LLM and agent applications. It ships 20+ out-of-the-box evals for RAG, agents, and safety, lets teams author custom evaluators, and turns those offline evals into real-time production guardrails powered by its own Luna eval models.
20+ out-of-the-box evals for RAG and agents
Pricing tiers gate the production guardrails
- evaluation
- observability
- guardrails
- agents
Open
View Fiddler AI details
ObservabilityPAID
Fiddler AI
Fiddler AI
AI observability and security platform for LLM apps, agents, and ML models.
An enterprise platform to monitor, analyze, and safeguard generative AI and ML in production. The Fiddler Trust Service scores prompts and responses for hallucination, toxicity, PII leakage, and prompt-injection, with low-latency guardrails plus real-time alerting and root-cause analysis. Originally an explainable-AI and model-monitoring pioneer, now spanning LLM and agent observability.
Covers classic ML and LLM monitoring
Enterprise, sales-quoted pricing
- llm-observability
- monitoring
- guardrails
- ml-monitoring
Open
View Langtrace details
ObservabilityFREEMIUMOpen core
Langtrace
Scale3 Labs
OpenTelemetry-based observability for LLM apps and agents.
Langtrace is an open-source observability and evaluation platform for LLM applications, capturing traces, token usage, latency, and cost across popular models, frameworks, and vector databases. Because it emits standard OpenTelemetry spans, traces flow to any OTel-compatible backend, and instrumentation is a two-line SDK install in Python or TypeScript. It ships as a hosted cloud with a free tier plus a self-hostable / on-prem option for data-sensitive teams.
Two-line SDK instrumentation
Smaller team and ecosystem
- observability
- tracing
- opentelemetry
- open-source
- +1
Open
View Maxim AI details
EvalFREEMIUM
Maxim AI
Maxim AI
Simulate, evaluate, and observe AI agents end-to-end.
An end-to-end platform for testing and monitoring AI agents across their lifecycle. It combines a prompt experimentation IDE, agent simulation across scenarios and personas, offline and online evaluations with custom metrics, and production observability with tracing and alerts. Aimed at teams shipping reliable agentic and RAG systems.
Agent simulation across personas/scenarios
Newer, smaller community than rivals
- eval
- agent-simulation
- observability
- tracing
- +1
Open
View Opik details
ObservabilityFREEMIUMOpen core
Opik
Comet
Open-source LLM evaluation, tracing, and monitoring.
Open-source platform from Comet for debugging and evaluating LLM and agent apps: full tracing of calls, tools, and agent steps, LLM-as-a-judge and heuristic evals, prompt management, and production dashboards. Self-host via Docker or Kubernetes, or use Comet's hosted cloud.
Self-host via Docker/Kubernetes
Younger than some rivals
- observability
- evaluation
- tracing
- open-source
Open
View Arize Phoenix details
ObservabilityFREEMIUM
Arize Phoenix
Arize AI
LLM tracing and evaluation with retrieval debugging.
Phoenix is Arize's observability platform — run locally in a notebook or as a hosted service. Especially strong for inspecting RAG pipelines, finding bad chunks, and tracking retrieval quality over time.
Source-available, runs locally
Less polished than hosted SaaS evals
- tracing
- rag
- retrieval-debugging
Open
View LangSmith details
ObservabilityFREEMIUM
LangSmith
LangChain
LangChain's hosted observability + eval platform.
Tracing, dataset management, eval orchestration, and prompt playground from the LangChain team. Pairs naturally if LangChain or LangGraph already runs in your stack, but works standalone via SDKs.
Native LangChain/LangGraph tracing
Closed source, cloud-only
- tracing
- evals
- datasets
- langchain
Open
View Helicone details
ObservabilityFREEMIUMOpen core
Helicone
Helicone
Drop-in LLM proxy with logging, caching, and cost tracking.
One-line integration — change your OpenAI/Anthropic base URL and get a dashboard with every prompt, response, latency, and dollar tracked. Adds caching and rate-limit handling without code changes.
No SDK or code changes to integrate
Request/response focused, not span-based
- proxy
- logging
- caching
- cost-tracking
Open
View Langfuse details
ObservabilityFREEMIUMOpen core
Langfuse
Langfuse
Open-source LLM observability. Self-hostable, OpenTelemetry-native.
Tracing, evals, prompt management, and dataset tooling for LLM apps — self-host on your own infra or use Langfuse Cloud. The open-source default when you want full ownership of your observability stack.
Own your observability data
Self-host infra cost at scale
- open-source
- tracing
- evals
- self-hosted
Open

Observability AI apps

Future AGI

InsightFinder

Judgment Labs

Cleric

Iris

Traversal

Okareo

OpenObserve

Hamming

Cekura

Coval

Atla

Confident AI

AgentOps

Evidently AI

Latitude

Respan

MLflow

TruLens

Laminar

PostHog

Deepchecks

LangWatch

Pydantic Logfire

Agenta

Athina AI

HoneyHive

Traceloop

Portkey

Lunary

W&B Weave

Galileo

Fiddler AI

Langtrace

Maxim AI

Opik

Arize Phoenix

LangSmith

Helicone

Langfuse