Skip to content

Data Ops AI apps

Data labeling, curation, and pipeline tooling that prepares training and evaluation data for AI.

11 apps · researched & kept current by Claude Code

Filter & search these 11 apps
  • View Nanonets details
    Data OpsFREEMIUM

    Nanonets

    Nanonets

    AI agents for document processing and enterprise data extraction.

    Nanonets automates document-heavy workflows — invoices, orders, contracts, and claims — with AI agents that read, extract, and route structured data across ERPs, email, and approval chains. It runs on its own OCR-3 extraction model and can fold in LLMs for agentic pipelines. Offered as managed cloud with VPC, single-tenant, and on-premises deployment options and regional data residency.

    Worth knowing

    A Y Combinator alum founded in 2017; raised a $29M Series B led by Accel in 2024.

    • document-ai
    • idp
    • ocr
    • extraction
    • +1
  • View Apify details
    Data OpsFREEMIUM

    Apify

    Apify

    Full-stack web scraping and browser automation platform for AI data.

    A cloud platform for web scraping, data extraction, and browser automation built around 'Actors' — serverless programs that crawl sites and return structured data. Its store offers tens of thousands of ready-made Actors, and outputs clean Markdown or JSON that feed LLMs, vector databases, and RAG pipelines via LangChain and LlamaIndex. The company also maintains the open-source Crawlee crawling library for local development.

    Worth knowing

    Founded in Prague in 2016 as "Apifier," rebranded to Apify in 2017; it still maintains the open-source Crawlee library.

    • web-scraping
    • crawling
    • automation
    • rag
  • View V7 Go details
    Data OpsPAID

    V7 Go

    V7 Labs

    Agentic AI that automates document-heavy knowledge work and data extraction.

    An operational AI platform from V7 Labs that builds and runs agents over complex documents — extracting financial, legal, and commercial terms, completing DDQs, and generating memos with source traceability. It chains foundation models from OpenAI, Anthropic, and Google into multi-step, auditable workflows aimed at finance, insurance, legal, and real-estate teams.

    Worth knowing

    Made by V7 Labs — began as SF vision startup Aipoly (2015), moved to London as V7 (2018), and raised a $33M Series A in 2022.

    • document-ai
    • data-extraction
    • agents
    • knowledge-work
  • View Docling details
    Data OpsFREEOSS

    Docling

    Docling Project

    Open-source toolkit that turns documents into AI-ready Markdown and JSON.

    A document-processing toolkit that converts PDF, DOCX, PPTX, XLSX, HTML, images, and audio into clean Markdown or JSON for LLM and RAG pipelines. It does advanced PDF understanding — page layout, reading order, table structure, and OCR for scans — and ships a hybrid chunker plus native LangChain and LlamaIndex integrations. Small enough to run on a laptop via a Python API or CLI; MIT-licensed and community-governed.

    Worth knowing

    Built at IBM Research Zurich and donated to the LF AI & Data Foundation in April 2025.

    • document-parsing
    • rag
    • open-source
    • pdf
    • +1
  • View Crawl4AI details
    Data OpsFREEOSS

    Crawl4AI

    Crawl4AI

    Open-source crawler that turns the web into clean, LLM-ready Markdown.

    Crawl4AI is an open-source (Apache 2.0) web crawler and scraper built for AI pipelines, converting pages into clean Markdown or structured data for RAG, agents, and data pipelines. The core runs locally with no API key, handles JS rendering, and supports optional LLM-based extraction with any provider. It installs as a Python library/CLI or deploys as a Dockerized FastAPI server; a hosted Cloud API is in closed beta.

    Worth knowing

    Created in 2023 by 'unclecode' (Hossein Tohidi), Kidocode's founder; once hit #1 on GitHub's Python trending.

    • web-scraping
    • crawling
    • open-source
    • markdown
    • +1
  • View Reducto details
    Data OpsFREEMIUM

    Reducto

    Reducto

    Agentic document parsing and extraction for AI teams, via one API.

    A document-intelligence API that parses, splits, extracts, and edits PDFs, images, spreadsheets, and slides into clean, structured output for RAG and AI pipelines. It blends custom in-house models with frontier ones and bills via usage credits, automatically discounting pages it can parse without the heavier pipeline.

    Worth knowing

    Founded in 2023 by MIT alumni; raised a $24.5M Series A led by Benchmark in 2025, with customers including Harvey, Scale AI and Vanta.

    • document-parsing
    • ocr
    • extraction
    • rag
  • View Chunkr details
    Data OpsFREEMIUMOpen core

    Chunkr

    Lumina AI

    Open-source document intelligence API for RAG-ready data.

    A document parsing and intelligence API that turns complex PDFs, slides, Word docs, and images into clean, LLM/RAG-ready chunks. Chunkr runs layout analysis, OCR, reading-order detection, semantic chunking, and schema-based extraction, emitting HTML, Markdown, or JSON. Self-host the open-source pipeline or call the managed cloud API, which includes a free tier of 200 pages with no card required.

    Worth knowing

    Built by Lumina AI (YC W24), the team behind a scientific-literature search engine; its parser is written in Rust.

    • document-parsing
    • ocr
    • rag
    • open-source
  • View ScrapeGraphAI details
    Data OpsFREEMIUMOpen core

    ScrapeGraphAI

    ScrapeGraphAI

    Turn any webpage into structured data with one prompt-driven API call.

    ScrapeGraphAI is an AI web-scraping tool that extracts structured data from pages and documents using natural-language prompts instead of CSS selectors or XPath, orchestrating LLMs in graph-style pipelines (single-page, multi-page, search, crawl). The core library is open-source under the MIT license with Python and Node SDKs; a hosted API adds a credit-based free tier and paid plans, plus integrations with LangChain, LlamaIndex, n8n, and an MCP server.

    Worth knowing

    Built by Italian founders Marco Vinciguerra and Lorenzo Padoan; the open-source library has passed 20,000 GitHub stars.

    • web-scraping
    • extraction
    • open-source
    • rag
    • +1
  • View Label Studio details
    Data OpsFREEMIUMOpen core

    Label Studio

    HumanSignal

    Open-source multi-type data labeling and AI evaluation.

    Widely-used open-source tool for labeling and annotating data across images, text, audio, video, and time-series, with a standardized export format for training and fine-tuning. ML backends can pre-label data to speed up human review, and it increasingly doubles as a human-in-the-loop AI evaluation surface. Maintained by HumanSignal, which offers a hosted Starter tier and Label Studio Enterprise.

    Worth knowing

    Maker Heartex rebranded to HumanSignal in June 2023; Label Studio has labeled 200M+ data points.

    • data-labeling
    • open-source
    • annotation
    • human-in-the-loop
    • +1
  • View Unstructured details
    Data OpsFREEMIUMOpen core

    Unstructured

    Unstructured

    ETL for LLMs — turn PDFs, decks, and emails into clean, structured data.

    Ingests 64+ file types and partitions, chunks, enriches, and embeds them into LLM-ready output, handling OCR, tables, and document hierarchy. An open-source library plus a low-code platform and API; a staple preprocessing layer for production RAG.

    Worth knowing

    Raised a $40M Series B in March 2024 led by Menlo Ventures, with Databricks Ventures, IBM Ventures and NVIDIA's NVentures all participating.

    • document-etl
    • preprocessing
    • rag
    • open-source
  • View Firecrawl details
    Data OpsFREEMIUMOpen core

    Firecrawl

    Firecrawl

    Turn any website into clean, LLM-ready data — scrape, crawl, search.

    A web data API for AI — scrape, crawl, map, and search pages into clean markdown or structured JSON, handling proxies, anti-bot, and JS rendering for you. Open-source core (AGPL) plus a hosted service; a default web-ingestion layer for agents and RAG pipelines.

    Worth knowing

    Pivoted out of Mendable (AI doc-chat used by Snapchat, MongoDB); a YC S22 company that raised $14.5M in 2025.

    • web-scraping
    • crawling
    • rag
    • open-source