Open-source LLM observability, evaluation, and agent testing.
An open-source platform for monitoring, evaluating, and testing LLM and agent applications. LangWatch captures traces, runs evaluations and simulations, and surfaces quality and cost metrics in production. Offered as managed cloud or fully self-hosted for teams with strict data-residency needs.
Worth knowing
Amsterdam startup whose founders met at an Antler residency; raised a €1M pre-seed led by Passion Capital in 2025.
Observability for LLM and agent apps, from the Pydantic team.
An observability platform that traces your whole application stack — LLM calls, agents, databases, and HTTP — not just the model layer. The Python/JS/Rust SDKs are open source and built on OpenTelemetry, while the hosted backend handles storage, querying, and dashboards. Free tier covers 10M spans per month.
Worth knowing
Pydantic's first commercial product, launched alongside a $12.5M Sequoia-led Series A in October 2024 to expand beyond the OSS library.
A reliability platform for LLM apps: its open-source OpenLLMetry SDK instruments LLM, vector-DB, and framework calls as standard OpenTelemetry spans, which Traceloop's hosted dashboard turns into traces, cost/latency analytics, and quality monitoring. Because the data is plain OTel, you can pipe it to existing observability stacks instead of a proprietary one.
Worth knowing
A Y Combinator (W23) startup behind OpenLLMetry; acquired by ServiceNow in 2026.
Open-source observability and prompt management for LLM apps.
An open-source platform for monitoring, debugging, and improving LLM applications and chatbots. Lunary combines request tracing, cost and user analytics, versioned prompt management with A/B testing, plus human-in-the-loop review and automated scoring. Self-host the Apache-2.0 community edition or use the managed cloud, which starts free with a 10k-events monthly tier.
Worth knowing
Apache-2.0 and self-hostable, it bundles prompt versioning, A/B tests, and human review alongside tracing — not observability alone.
Tracing and evaluation for LLM apps, from Weights & Biases.
An observability and evaluation toolkit for generative-AI applications. A single @weave.op decorator traces every model call — capturing inputs, outputs, latency, token cost, and errors — and the same SDK builds rigorous evaluations using LLM-as-judge and custom scorers. Traces and experiments are organized in the Weights & Biases web platform for side-by-side comparison across prompts and models.
Worth knowing
The SDK is Apache-2.0 open source, but the traces it captures land in W&B's hosted platform — free for solo use.
Evaluation and observability for GenAI apps and agents, with inline guardrails.
A platform for testing, monitoring, and guardrailing LLM and agent applications. It ships 20+ out-of-the-box evals for RAG, agents, and safety, lets teams author custom evaluators, and turns those offline evals into real-time production guardrails powered by its own Luna eval models.
Worth knowing
Raised a $45M Series B led by Scale Venture Partners, with HuggingFace and Postman CEOs joining as angels.
AI observability and security platform for LLM apps, agents, and ML models.
An enterprise platform to monitor, analyze, and safeguard generative AI and ML in production. The Fiddler Trust Service scores prompts and responses for hallucination, toxicity, PII leakage, and prompt-injection, with low-latency guardrails plus real-time alerting and root-cause analysis. Originally an explainable-AI and model-monitoring pioneer, now spanning LLM and agent observability.
Worth knowing
Founder Krishna Gade built Facebook's 'Why am I seeing this?' explainability feature before starting Fiddler.
Open-source, OpenTelemetry-based observability for LLM apps and agents.
Langtrace is an open-source observability and evaluation platform for LLM applications, capturing traces, token usage, latency, and cost across popular models, frameworks, and vector databases. Because it emits standard OpenTelemetry spans, traces flow to any OTel-compatible backend, and instrumentation is a two-line SDK install in Python or TypeScript. It ships as a hosted cloud with a free tier plus a self-hostable / on-prem option for data-sensitive teams.
Worth knowing
Maker Scale3 Labs contributed the first official OpenAI instrumentation to OpenTelemetry and helped author its GenAI conventions.
Open-source LLM evaluation, tracing, and monitoring.
Open-source platform from Comet for debugging and evaluating LLM and agent apps: full tracing of calls, tools, and agent steps, LLM-as-a-judge and heuristic evals, prompt management, and production dashboards. Self-host via Docker or Kubernetes, or use Comet's hosted cloud.
Worth knowing
Launched in September 2024 by Comet, the established ML experiment-tracking company, extending its platform from training into LLM ops.
LLM tracing + evaluation. Strong on retrieval debugging.
Phoenix is Arize's observability platform — run locally in a notebook or as a hosted service. Especially strong for inspecting RAG pipelines, finding bad chunks, and tracking retrieval quality over time.
Worth knowing
Licensed under Elastic License 2.0 (source-available), not OSI open-source — despite its open GitHub repo.
Tracing, dataset management, eval orchestration, and prompt playground from the LangChain team. Pairs naturally if LangChain or LangGraph already runs in your stack, but works standalone via SDKs.
Worth knowing
LangChain's primary commercial product and revenue driver behind its 2025 $1.25B unicorn valuation.
Drop-in LLM proxy with logging, caching, and cost tracking.
One-line integration — change your OpenAI/Anthropic base URL and get a dashboard with every prompt, response, latency, and dollar tracked. Adds caching and rate-limit handling without code changes.
Worth knowing
YC W23 startup acquired by docs platform Mintlify in March 2026, having processed over 14 trillion tokens for 16,000+ orgs.
Tracing, evals, prompt management, and dataset tooling for LLM apps — self-host on your own infra or use Langfuse Cloud. The open-source default when you want full ownership of your observability stack.
Worth knowing
Y Combinator W23 startup; acquired by ClickHouse in January 2026.