Skip to content

EvalDeepchecks

Deepchecks

Testing-first evaluation and monitoring for LLM and ML systems.

Category
Eval
Pricing
FREEMIUM
Source
Open core
Hosting
Hybrid
Platforms
WebAPICLI
Models
Model-agnostic
Verified
Jun 12, 2026

Deepchecks brings a testing-first approach to AI quality. Its open-source Python library validates ML models and data from research through production, and the LLM Evaluation product extends that into continuous validation of LLM applications — measuring quality, performance, and pitfalls across experimentation and production with CI/CD hooks. Enterprise deployment runs in VPC, on-prem, or bare metal for teams that can't use cloud-hosted eval.

Pros & cons

  • Open-source core (AGPL-3.0)
  • Testing-first, CI/CD-friendly evals
  • VPC, on-prem, bare-metal options
  • Covers both ML and LLM validation
  • AGPL-3.0 may not suit all teams
  • Hosted platform pricing is steep
  • Breadth adds setup overhead

Tags

View all Eval
  • View Giskard details
    EvalFREEMIUMOpen core

    Giskard

    Giskard

    Open-source evaluation and red-teaming for LLM agents and RAG apps.

    Giskard is an open-source (Apache-2.0) Python library for testing LLMs, RAG pipelines, and ML models — its Scan automatically surfaces hallucinations, prompt injection, bias, and other vulnerabilities, while red-teaming agents run multi-turn adversarial attacks across dozens of probes. The paid Giskard Hub adds team collaboration, continuous testing, and scheduled scans. The team also publishes the open Phare LLM safety benchmark.

    Worth knowing

    Paris-based and YC-backed; joined a France 2030 R&D consortium with Mistral to build LLM-evaluation methods.

    • llm-eval
    • red-teaming
    • testing
    • rag
    • +1
  • View DeepEval details
    EvalFREEMIUMOpen core

    DeepEval

    Confident AI

    Pytest-style LLM evaluation framework. Open source.

    Open-source (Apache 2.0) framework for evaluating LLM apps the way Pytest tests code — assertions backed by 50+ ready metrics spanning LLM-as-judge, RAG, agents, conversation, and safety. Plugs into LangChain, CrewAI, OpenAI Agents and more. Confident AI is the paid cloud platform that adds test management, dashboards, and observability on top.

    Worth knowing

    Built by Confident AI (YC W25, founded 2024); its open-source framework runs ~2M evaluations a day.

    • eval
    • open-source
    • llm-as-judge
    • rag
    • +1
  • View Ragas details
    EvalFREEOSS

    Ragas

    Exploding Gradients

    Open-source evaluation toolkit for RAG and LLM applications.

    Open-source (Apache-2.0) Python framework for evaluating retrieval-augmented generation and LLM apps. Provides reference-free metrics — faithfulness, answer relevancy, context precision/recall — plus knowledge-graph-based synthetic test generation. Integrates with LangChain, LlamaIndex, and CI pipelines.

    Worth knowing

    Began as a 2023 research paper (EACL 2024) and a Y Combinator W24 startup before becoming the default open-source RAG eval standard.

    • eval
    • rag
    • llm-as-judge
    • open-source
    • +1
  • View Promptfoo details
    EvalFREEOSS

    Promptfoo

    Promptfoo

    Open-source LLM eval CLI. Rubric scoring + golden sets.

    YAML-driven eval harness. Pair a prompt with a goldset, define rubrics, run across multiple models in CI. Strong for catching prompt regressions before they hit production.

    Worth knowing

    Acquired by OpenAI in March 2026; stays open-source and MIT-licensed, and is used internally by both OpenAI and Anthropic.

    • eval
    • ci
    • rubric
    • open-source