Data Ops AI apps

Data labeling, curation, and pipeline tooling that prepares training and evaluation data for AI.

50 apps · researched & kept current by Claude Code

Filter & search these 50 apps

View Kili Technology details
Data OpsFREEMIUM
Kili Technology
Kili Technology
Data labeling and quality platform for training and evaluating AI models.
Kili Technology is a data-centric platform for turning raw data into high-quality training and evaluation datasets. It supports annotation across image, video, text, OCR, and geospatial data, with review and quality workflows, plus LLM evaluation and RLHF using human-in-the-loop and LLM-as-a-judge. It is used by enterprises including Airbus and SAP, and offers cloud, private-cloud, and on-premise deployment.
Multi-modal annotation
Enterprise-oriented complexity
- data-labeling
- annotation
- rlhf
- evals
- +1
Open
View Zerve details
AnalyticsFREEMIUM
Zerve
Zerve AI
Agentic AI data platform for research and analytics.
Zerve is a data platform that brings AI agents, notebooks, automatic data discovery, conversational reports, and deployment into a single canvas for data scientists and analysts. Agentic notebooks assist with exploration and analysis, the platform maps schemas and surfaces patterns from connected data, and finished work can ship directly as an API, app, or dashboard. It offers a free tier, with self-hosted and air-gapped deployments for enterprise.
Agentic notebooks with auto data discovery
Not open source
- data-science
- notebooks
- agentic
- analytics
- +1
Open
View Pulse details
Data OpsFREEMIUM
Pulse
Pulse AI
Production-grade extraction for complex documents.
A document-extraction platform that converts messy, real-world documents — financial statements, medical records, contracts, spreadsheets — into clean, LLM-ready structured data. Pulse runs its own OCR, layout, and vision models (including its Ultra extraction model) rather than wrapping a general-purpose LLM, and exposes the pipeline through an API that drops into existing data workflows. It offers a free sandbox to try, with enterprise tiers for scale; the company says it has processed over a billion document pages.
Purpose-built models for hard layouts
Cloud-only (no self-host)
- document-extraction
- ocr
- unstructured-data
- data-ingestion
Open
View Unstract details
Data OpsFREEMIUMOpen core
Unstract
Zipstack
Turn unstructured documents into structured data.
An agentic document-processing platform that extracts clean, structured JSON from PDFs, scans, and other complex documents using LLMs. Its Prompt Studio gives a no-code IDE to author and test extraction prompts per field, which you then deploy as APIs or ETL pipelines into your warehouse. Built by Zipstack, Unstract is open source under AGPL-3.0 and self-hostable via Docker Compose, with a managed cloud that adds SSO, human-in-the-loop review, and compliance certifications (SOC 2, HIPAA, ISO 27001, GDPR).
Open-source (AGPL-3.0), self-hostable
AGPL-3.0 may deter some commercial use
- document-extraction
- unstructured-data
- etl
- rag
- +1
Open
View DBHub details
MCPFREEOSS
DBHub
Bytebase
Zero-dependency, token-efficient database MCP server.
An open-source Model Context Protocol server that lets AI coding assistants connect to and query databases through one unified gateway. DBHub deliberately exposes only two tools — execute_sql (with transaction support and safety controls) and search_objects (schema/table/column exploration) — to keep the agent's context window lean. It supports Postgres, MySQL, SQL Server, MariaDB, and SQLite, works with clients like Claude Desktop, Claude Code, Cursor, VS Code, and Copilot CLI, and includes a built-in web interface for running queries and viewing request traces.
Free and MIT-licensed
SQL databases only (no NoSQL)
- mcp
- database
- sql
- postgres
- +1
Open
View DVC details
Data OpsFREEOSS
DVC
lakeFS
Git extension for versioning data, models, and ML experiments.
DVC (Data Version Control) brings software-engineering practices to machine learning: it versions datasets, models, and pipelines alongside code in any Git repository, storing large files in your own remote storage while keeping lightweight pointers in Git. Enables reproducible experiments, data/model lineage, and pipeline orchestration from the command line.
Free and open source
CLI-centric learning curve
- data-versioning
- mlops
- reproducibility
- ml-pipelines
- +2
Open
View lakeFS details
Data OpsFREEMIUMOpen core
lakeFS
Treeverse
Git-like version control for data lakes over your existing object storage.
Open-source data version control that turns object storage (S3, GCS, Azure Blob, MinIO) into Git-like repositories. Teams branch, commit, merge, and roll back petabyte-scale data lakes for isolated experimentation, reproducible ML pipelines, data-quality gates, and compliance lineage — without copying data. Integrates with Spark, Trino, Databricks, Delta Lake, and Iceberg.
Open source (Apache 2.0)
Operational overhead to self-host
- data-versioning
- data-lake
- mlops
- reproducibility
- +2
Open
View T-Rex Label details
VisionFREEMIUM
T-Rex Label
Visincept (IDEA Research)
Zero-shot AI image annotation that batch-labels with visual prompts.
T-Rex Label is a browser-based image annotation tool built on the T-Rex2 open-set detection model. Point it at one example and its visual-prompt, zero-shot detection finds and labels matching objects across an entire dataset—no training or fine-tuning required—handling dense, occluded, and varied-lighting scenes. It exports to COCO and YOLO formats and integrates with tools like Roboflow and Labelbox.
No per-class training needed
Browser-only, no offline mode
- image-annotation
- object-detection
- zero-shot
- dataset-labeling
- +1
Open
View Nomic Atlas details
AnalyticsFREEMIUM
Nomic Atlas
Nomic AI
Explore, structure, and analyze millions of unstructured records as interactive embedding maps.
Nomic Atlas is a data-intelligence platform that embeds large collections of text, image, and audio data and renders them as interactive 2-D maps you can browse, search, cluster, and label in the browser. Powered by Nomic's own embedding and topic-modeling models, it scales from hundreds to tens of millions of points and exposes the same pipeline through a developer API for embeddings and retrieval.
Visual maps of huge unstructured datasets
Public free tier exposes maps publicly
- data-visualization
- embeddings
- unstructured-data
- topic-modeling
- +1
Open
View Datalab details
Data OpsFREEMIUMOpen core
Datalab
Datalab
High-accuracy document parsing — PDFs and images to markdown, JSON, and HTML.
Datalab turns PDFs, images, and office documents into clean markdown, JSON, and HTML with layout, table, math, and code preservation. It is the commercial, hosted layer over the open-source Marker converter and Surya OCR toolkit, offered as a pay-as-you-go API with a free monthly allowance, while the underlying models stay free to self-host for research and small startups.
Pay-as-you-go API with free allowance
Hosted API metered per page
- document-parsing
- ocr
- pdf-to-markdown
- rag
- +1
Open
View Ragie details
SearchFREEMIUM
Ragie
Ragie, Corp
Managed RAG-as-a-service — the context engine for AI agents and apps.
Ragie is a fully managed retrieval-augmented-generation platform. It ingests data through native connectors like Google Drive and Notion, parses multimodal content (PDFs, images, audio, video), and serves hybrid vector + keyword + summary retrieval over an API and MCP server. Developers add accurate, grounded context to LLM apps without building their own ingestion and retrieval pipeline.
Fully managed, fast to integrate
Production tier starts at $500/month
- rag
- retrieval
- context
- mcp
Open
View DataRobot details
Data OpsPAID
DataRobot
DataRobot
Enterprise AI platform for building, deploying, and governing ML and agentic AI.
DataRobot is an enterprise AI platform spanning the full lifecycle: predictive AI with automated machine learning (AutoML), generative and agentic AI, plus observability and governance. Its AutoML builds and compares many models at once so teams ship production-ready AI faster, and newer agent kits make enterprise AI agents practical to deploy. It runs in the cloud or on-premises for large organizations.
AutoML builds many models fast
Enterprise pricing, often six figures
- automl
- mlops
- enterprise-ai
- predictive-ai
- +1
Open
View Maia details
Data OpsPAID
Maia
Matillion
An agentic 'AI data team' that turns requests into production-ready data pipelines.
Maia is an AI data automation platform that uses agentic AI mapped to real data-team roles to build, govern, and manage data pipelines from natural-language requests. It covers pipeline design, data quality, integration, DataOps monitoring, cost (FinOps) optimization, and legacy ETL migration. It targets enterprise data teams, with customers including DocuSign, Autodesk, Siemens Healthineers, and Cisco.
Agentic pipeline building from prompts
No public pricing; sales-gated
- data-engineering
- data-pipelines
- etl
- agentic-ai
- +1
Open
View Tabstack details
InfraFREEMIUMOpen core
Tabstack
Mozilla
Browsing infrastructure for AI agents — extract, research, automate.
A managed web API that lets agents browse the live web without running headless Chrome. /extract turns pages into Markdown or structured JSON, /generate transforms content on the fly, and /automate clicks, scrolls, and fills forms. Lightweight fetches escalate to full browser automation only when a site needs JS execution.
Single API for extract, generate, and automate
Hosted service is proprietary (only the engine is OSS)
- browser-automation
- web-extraction
- agents
- open-core
Open
View Extend details
Data OpsFREEMIUM
Extend
Extend AI
Full-stack document processing platform for AI agents and pipelines.
Extend is an LLM-powered document processing platform that parses, extracts, classifies, splits, and edits complex documents — handwriting, tables, and mixed formats — into reliable structured data via API or its web Studio. It combines multiple frontier models with proprietary context engineering for messy real-world files. Used by teams at Brex, Square, Checkr, and Flatiron Health.
Handles handwriting, tables, mixed formats
Newer than incumbent IDP vendors
- document-processing
- ocr
- extraction
- vlm
- +1
Open
View olmOCR details
VisionFREEOSS
olmOCR
Allen Institute for AI
Open-source OCR that converts PDFs and scans into clean, structured text.
olmOCR is an open-source toolkit from the Allen Institute for AI that turns PDFs and document images into clean, reading-order plain text, preserving tables, equations, and handwriting. It runs a fine-tuned 7B vision-language model with a document-anchoring prompting technique, and is built for cheap, dataset-scale conversion for LLM training and retrieval. Released with model weights, training data, and inference code; runs on your own GPUs or via third-party inference providers.
Ships weights, training data, and code
Requires a capable GPU to self-host
- ocr
- open-source
- pdf
- document-parsing
- +1
Open
View Hyperscience details
VisionPAID
Hyperscience
Hyperscience
Enterprise document processing that turns messy paperwork into structured data.
Hyperscience is an enterprise intelligent document processing (IDP) platform that reads, classifies, and extracts data from forms, invoices, and handwritten paperwork at high accuracy. It trains custom machine-learning models per document type and routes low-confidence cases to humans, targeting straight-through automation for high-volume back-office workflows. Sold to large enterprises and government agencies, with cloud, private-cloud, and air-gapped on-prem deployment options.
Routes low-confidence cases to humans
Enterprise sales, no public pricing
- document-processing
- idp
- ocr
- enterprise
- +1
Open
View CocoIndex details
Data OpsFREEOSS
CocoIndex
CocoIndex
Incremental data framework for fresh AI context.
CocoIndex is an open-source data transformation framework that keeps AI agents and LLM apps supplied with continuously fresh, structured context. It turns sources like codebases, PDFs, databases, and Slack into vector or graph stores, and reprocesses only what changed (delta-only) with parallel execution by default. A Rust core drives reliability while pipelines are defined declaratively in Python, with end-to-end lineage and an observability UI called CocoInsight.
Parallel execution by default
Younger, smaller ecosystem
- data-pipeline
- etl
- rag
- open-source
Open
View Diffbot details
Data OpsFREEMIUM
Diffbot
Diffbot
Web-scale data extraction and a knowledge graph that grounds AI in facts.
Diffbot reads the public web like a person and turns it into structured data: an Extract API for articles and products, a Crawl service, a Natural Language API, and a Knowledge Graph of billions of entities and over a trillion facts. The graph is refreshed continuously, so AI systems can ground answers in current, verifiable data rather than model memory. Diffbot also ships its own factually grounded GraphRAG language model.
Continuously refreshed knowledge graph
Enterprise pricing for serious volume
- knowledge-graph
- web-data
- graphrag
- extraction-api
Open
View Rossum details
Data OpsPAID
Rossum
Rossum (Coupa)
AI-first intelligent document processing for end-to-end transaction automation.
Rossum reads transactional documents like invoices and purchase orders, then captures, validates, and transforms the data and pushes it into downstream ERP and approval workflows. It is built on a proprietary transactional large language model trained on tens of millions of documents that learns continuously from each customer's feedback, supports 276 languages plus handwriting, and is cloud-native. The platform targets accounts-payable and complex invoicing automation for enterprises.
Captures and validates invoice data
Enterprise pricing, no public tiers
- intelligent-document-processing
- invoice-automation
- ocr
- accounts-payable
Open
View Snorkel AI details
Data OpsPAID
Snorkel AI
Snorkel AI
Data development platform for programmatically labeling AI training data.
Enterprise platform for building and curating AI training and evaluation data with programmatic labeling instead of hand-annotating examples one by one. Teams encode domain knowledge as labeling functions that Snorkel Flow applies and refines at scale, then use the resulting datasets to fine-tune and evaluate models.
Programmatic labeling scales past manual
Enterprise pricing, no self-serve tier
- data-labeling
- training-data
- data-centric-ai
- enterprise
Open
View micro1 details
HRPAID
micro1
micro1 Inc.
AI interviewer 'Zara' that sources and vets experts at scale.
micro1 is a human-data platform that uses its AI recruiter agent 'Zara' to source, interview, and vet domain experts at high velocity, then channels them into dataset creation for AI labs. The same Zara interviewer also powers a vetted-engineer and talent offering. It is positioned as a Scale AI competitor.
Sources and interviews experts fast
Focused on AI-data, not general hiring
- recruiting
- ai-interview
- ai-data
- hr
Open
View Mercor details
HRPAID
Mercor
Mercor
AI talent marketplace matching vetted experts to AI labs and companies.
Mercor runs an AI-powered talent marketplace that matches vetted experts — engineers, scientists, doctors, and lawyers — with AI labs and companies, largely to generate and review training and reinforcement data. AI conducts the initial screening interview, then humans are placed on hourly paid work. It began as an AI recruiting tool and became a leading supplier of 'human data' for frontier-model training.
AI-screens experts at scale
Not a self-serve hiring tool
- recruiting
- talent-marketplace
- ai-data
- hr
Open
View Daloopa details
FinanceFREEMIUM
Daloopa
Daloopa
Source-linked financial data extracted from filings, feeding analyst models, AI agents, and LLMs.
Daloopa automatically extracts and standardizes fundamental financial data — from SEC filings, investor presentations, and earnings materials — for 5,500+ global tickers with up to 14 years of history, linking every number back to its source document for auditability. Analysts use it to build and update models faster, cutting time spent during earnings season. The same data is exposed to AI tools via API, an Excel add-in, and Model Context Protocol connectors for ChatGPT, Claude, Perplexity, and Rogo.
Auditable: every number traces to filings
Paid tier pricing not public
- financial-data
- fundamental-data
- equity-research
- mcp
Open
View Surge AI details
Data OpsPAID
Surge AI
Surge AI
Premium human data and RLHF for frontier AI labs.
Surge AI provides high-quality human-generated training data and reinforcement learning from human feedback (RLHF) for AI developers. It pairs a large network of expert annotators with a labeling platform and API to produce complex, specialized data — code, math, safety, and domain reasoning — used to train and align frontier models. Reported customers include OpenAI, Anthropic, Google, and Meta.
Expert annotators, high-quality data
Premium pricing (sales-led)
- data-labeling
- rlhf
- training-data
- human-feedback
Open
View Lightly details
VisionPAID
Lightly
Lightly
Computer-vision data curation, labeling, and model pretraining.
Lightly is a computer-vision data platform that helps teams curate the most informative samples from large image and video datasets using embeddings, active learning, and near-duplicate detection. Its suite spans LightlyStudio (curation and labeling), LightlyTrain (self-supervised pretraining and fine-tuning of vision models), and LightlyEdge (smart data selection on devices). The aim is to cut labeling cost by training on the data that actually improves models.
Embedding-based data curation and dedup
No public pricing; sales-led
- computer-vision
- data-curation
- active-learning
- self-supervised
Open
View SuperAnnotate details
Data OpsPAID
SuperAnnotate
SuperAnnotate AI
Platform for building multimodal AI datasets and evaluation pipelines.
SuperAnnotate is an enterprise data platform for creating, managing, and evaluating high-quality datasets for AI. It spans annotation across images, video, text, audio, and LiDAR, with AI-assisted labeling, customizable workflows, and an optional managed annotation workforce. Teams use it to build human-in-the-loop data and evaluation pipelines for agentic, multimodal, and frontier AI.
Multimodal: image, video, text, audio, LiDAR
No free tier; sales-led pricing
- data-labeling
- annotation
- multimodal
- rlhf
- +1
Open
View Labelbox details
Data OpsFREEMIUM
Labelbox
Labelbox
Data factory for AI teams — labeling, evals, and human data for training.
Labelbox is a platform for generating and managing training data for AI models, combining annotation tools (Annotate), data curation (Catalog), and model-assisted labeling and evaluation (Model Foundry). It now spans reinforcement-learning data, custom evals, robotics datasets, and an on-demand network of expert human labelers, metered by a usage-based Labelbox Unit (LBU).
Mature, full-featured labeling UI
Usage-based LBU pricing hard to forecast
- data-labeling
- training-data
- annotation
- evals
- +1
Open
View LlamaParse details
Data OpsFREEMIUM
LlamaParse
LlamaIndex
Agentic document parsing that turns complex PDFs into AI-ready markdown.
LlamaParse is LlamaIndex's managed document-parsing service: it extracts text, tables, charts, and images from PDFs and 90+ other formats into clean markdown for RAG pipelines. It offers layout-aware and multimodal parsing modes and 100+ language support, and anchors the LlamaCloud platform alongside Extract, Classify, Split, and Index.
Strong on tables, charts, scanned PDFs
Cloud-only, credit-based costs add up
- document-parsing
- rag
- ocr
- pdf
- +1
Open
View Clarifai details
VisionFREEMIUM
Clarifai
Clarifai
Full-stack AI platform for computer vision and LLMs.
Clarifai is a full-stack AI platform for building with unstructured image, video, text, and audio data. It pairs production computer-vision models — classification, detection, visual search — with a model hub for LLMs, plus data labeling, training of custom models, and inference, all behind one API and console. A free Community tier lets you discover and run models before moving to paid usage plans.
Mature, end-to-end vision stack
Broad platform has a learning curve
- computer-vision
- model-hub
- data-labeling
- inference
- +1
Open
View Mindee details
Data OpsFREEMIUM
Mindee
Mindee
AI document-processing API that turns files into structured data.
Mindee is a developer-first document-AI platform that converts photos, PDFs, and scans — invoices, receipts, IDs, financial and mail documents — into structured JSON through a single REST API, with no model training required. Beyond extraction it handles document splitting, classification, and cropping, and ships SDKs for Python, Java, PHP, and more. Billing is credit-based per page processed.
Pretrained models for common doc types
Hosted API is proprietary
- document-ai
- ocr
- idp
- extraction
- +1
Open
View Scale AI details
Data OpsPAID
Scale AI
Scale AI
Training data, evaluations, and enterprise GenAI from the data-labeling giant.
Scale supplies the human-annotated training data behind most frontier AI labs through its Data Engine, spanning labeling, RLHF, and expert red-teaming. On top of the data business it runs evaluation leaderboards, an enterprise GenAI platform, and Donovan, its platform for the US public sector.
Frontier-scale human data ops
Enterprise sales, no public pricing
- data-labeling
- rlhf
- evals
- training-data
Open
View CVAT details
VisionFREEMIUMOpen core
CVAT
CVAT.ai
Open-source annotation platform for vision AI datasets.
Data-labeling suite for images, video, and 3D: bounding boxes, polygons, segmentation, keypoints, and object tracking, with AI-assisted labeling via SAM and custom models through its API and SDK. Ships as the MIT-licensed Community edition to self-host, the hosted CVAT Online with free and paid plans, or a self-hosted Enterprise tier.
Boxes, polygons, keypoints, 3D, and video
Self-hosting needs Docker ops effort
- annotation
- labeling
- computer-vision
- datasets
Open
View Clay details
SalesFREEMIUM
Clay
Clay Labs, Inc.
GTM data enrichment and AI research agents in a spreadsheet-style workflow.
Clay is a go-to-market platform that enriches leads from 150+ data providers via a waterfall, runs AI research agents (Claygents) to answer custom questions about each prospect, and orchestrates the results into CRM updates and outbound sequences. Teams build the logic in a familiar table interface with API and webhook hooks.
Best-in-class waterfall match rates (~78%)
Steep learning curve (4-6 weeks)
- gtm
- lead-enrichment
- sales
- ai-agents
- +1
Open
View Nanonets details
Data OpsFREEMIUM
Nanonets
Nanonets
AI agents for document processing and enterprise data extraction.
Nanonets automates document-heavy workflows — invoices, orders, contracts, and claims — with AI agents that read, extract, and route structured data across ERPs, email, and approval chains. It runs on its own OCR-3 extraction model and can fold in LLMs for agentic pipelines. Offered as managed cloud with VPC, single-tenant, and on-premises deployment options and regional data residency.
Handles invoices, orders, contracts, claims
Leaderboard claims are vendor-reported
- document-ai
- idp
- ocr
- extraction
- +1
Open
View Smartling details
TranslationPAID
Smartling
Smartling
Enterprise AI translation and localization with optional human review.
An enterprise localization platform combining a translation management system with AI translation and an in-house translator network. Content is tiered between fully automated AI translation and AI-plus-human review based on visibility and quality needs. Sold via custom enterprise quotes with platform fees plus per-word pricing.
Enterprise TMS + AI + human network
Custom enterprise quotes, opaque pricing
- localization
- enterprise
- translation-management
- human-review
Open
View Apify details
Data OpsFREEMIUM
Apify
Apify
Full-stack web scraping and browser automation platform for AI data.
A cloud platform for web scraping, data extraction, and browser automation built around 'Actors' — serverless programs that crawl sites and return structured data. Its store offers tens of thousands of ready-made Actors, and outputs clean Markdown or JSON that feed LLMs, vector databases, and RAG pipelines via LangChain and LlamaIndex. The company also maintains the open-source Crawlee crawling library for local development.
Serverless 'Actors' scale automatically
Usage-based costs add up at scale
- web-scraping
- crawling
- automation
- rag
Open
View V7 Go details
Data OpsPAID
V7 Go
V7 Labs
Agentic AI that automates document-heavy knowledge work and data extraction.
An operational AI platform from V7 Labs that builds and runs agents over complex documents — extracting financial, legal, and commercial terms, completing DDQs, and generating memos with source traceability. It chains foundation models from OpenAI, Anthropic, and Google into multi-step, auditable workflows aimed at finance, insurance, legal, and real-estate teams.
Source-traceable extractions
Paid-only, enterprise pricing
- document-ai
- data-extraction
- agents
- knowledge-work
Open
View HARPA AI details
AutomationFREEMIUM
HARPA AI
HARPA AI
AI browser agent for Chrome that automates web tasks.
A browser extension that fuses large language models with on-page web automation, letting an AI agent read, understand, and act on web pages — navigating, extracting data, filling forms, and monitoring sites for changes. It connects to multiple AI providers and ships 100+ preset commands for research, SEO, and writing. Available for Chrome, Edge, Firefox, Brave, and Opera.
Real web automation, not just chat
Steep learning curve
- browser-agent
- web-automation
- chrome-extension
- scraping
- +1
Open
View Docling details
Data OpsFREEOSS
Docling
Docling Project
Toolkit that turns documents into AI-ready Markdown and JSON.
A document-processing toolkit that converts PDF, DOCX, PPTX, XLSX, HTML, images, and audio into clean Markdown or JSON for LLM and RAG pipelines. It does advanced PDF understanding — page layout, reading order, table structure, and OCR for scans — and ships a hybrid chunker plus native LangChain and LlamaIndex integrations. Small enough to run on a laptop via a Python API or CLI; MIT-licensed and community-governed.
Runs on a laptop via Python API or CLI
Lower accuracy than top hosted parsers
- document-parsing
- rag
- open-source
- pdf
- +1
Open
View Crawl4AI details
Data OpsFREEOSS
Crawl4AI
Crawl4AI
Open-source crawler that turns the web into clean, LLM-ready Markdown.
Crawl4AI is an open-source (Apache 2.0) web crawler and scraper built for AI pipelines, converting pages into clean Markdown or structured data for RAG, agents, and data pipelines. The core runs locally with no API key, handles JS rendering, and supports optional LLM-based extraction with any provider. It installs as a Python library/CLI or deploys as a Dockerized FastAPI server; a hosted Cloud API is in closed beta.
Core runs fully locally
You run the infra
- web-scraping
- crawling
- open-source
- markdown
- +1
Open
View Jina AI details
SearchFREEMIUMOpen core
Jina AI
Jina AI
Search-foundation APIs — Reader, embeddings, and reranker — for grounding LLMs.
A suite of search-foundation APIs for retrieval and RAG: a Reader that turns any URL or web search into LLM-ready markdown, multilingual multimodal embeddings, and a reranker. One key spans every service, the Reader is open source, and the embedding models are also released as open weights for self-hosting.
One key spans Reader, embeddings, reranker
Acquired by Elastic (Oct 2025); roadmap may shift
- search
- embeddings
- reranker
- rag
- +1
Open
View Reducto details
Data OpsFREEMIUM
Reducto
Reducto
Agentic document parsing and extraction for AI teams, via one API.
A document-intelligence API that parses, splits, extracts, and edits PDFs, images, spreadsheets, and slides into clean, structured output for RAG and AI pipelines. It blends custom in-house models with frontier ones and bills via usage credits, automatically discounting pages it can parse without the heavier pipeline.
Strong on complex/nested table layouts
API-only, no app UI
- document-parsing
- ocr
- extraction
- rag
Open
View Chunkr details
Data OpsFREEMIUMOpen core
Chunkr
Lumina AI
Open-source document intelligence API for RAG-ready data.
A document parsing and intelligence API that turns complex PDFs, slides, Word docs, and images into clean, LLM/RAG-ready chunks. Chunkr runs layout analysis, OCR, reading-order detection, semantic chunking, and schema-based extraction, emitting HTML, Markdown, or JSON. Self-host the open-source pipeline or call the managed cloud API, which includes a free tier of 200 pages with no card required.
Self-host or call the managed API
Accuracy below Reducto on hard layouts
- document-parsing
- ocr
- rag
- open-source
Open
View Dataloop details
VisionPAID
Dataloop
Dataloop
Enterprise data engine for labeling and managing unstructured AI data.
An AI-ready data platform that manages, labels, and orchestrates unstructured data — images, video, LiDAR, audio, and text — across the model lifecycle. It pairs data management and human-in-the-loop annotation with a serverless pipeline layer for pre/post-processing, RLHF, and RAG, plus a model-and-app marketplace. Originally focused on computer-vision production pipelines.
Multimodal labeling (image, video, LiDAR, audio, text)
Enterprise pricing
- data-labeling
- computer-vision
- annotation
- mlops
Open
View Thunderbit details
Data OpsFREEMIUM
Thunderbit
Thunderbit
AI web scraper that turns any page into structured data in two clicks.
A no-code AI web scraper and automation agent that runs as a Chrome extension. It visually reads a page, suggests the fields to capture, and extracts structured rows with support for pagination, subpages, and bulk lists — exporting to Excel, Google Sheets, Airtable, or Notion. Built for lead lists, price monitoring, and research without writing selectors or code.
Two-click no-code scraping
Chrome extension only
- web-scraping
- data-extraction
- chrome-extension
- no-code
Open
View ScrapeGraphAI details
Data OpsFREEMIUMOpen core
ScrapeGraphAI
ScrapeGraphAI
Turn any webpage into structured data with one prompt-driven API call.
ScrapeGraphAI is an AI web-scraping tool that extracts structured data from pages and documents using natural-language prompts instead of CSS selectors or XPath, orchestrating LLMs in graph-style pipelines (single-page, multi-page, search, crawl). The core library is open-source under the MIT license with Python and Node SDKs; a hosted API adds a credit-based free tier and paid plans, plus integrations with LangChain, LlamaIndex, n8n, and an MCP server.
Prompt-driven, selector-free extraction
LLM cost per extraction page
- web-scraping
- extraction
- open-source
- rag
- +1
Open
View Label Studio details
Data OpsFREEMIUMOpen core
Label Studio
HumanSignal
Multi-type data labeling and AI evaluation across every modality.
Widely-used open-source tool for labeling and annotating data across images, text, audio, video, and time-series, with a standardized export format for training and fine-tuning. ML backends can pre-label data to speed up human review, and it increasingly doubles as a human-in-the-loop AI evaluation surface. Maintained by HumanSignal, which offers a hosted Starter tier and Label Studio Enterprise.
Covers all data modalities in one tool
Self-host setup needs DevOps maturity
- data-labeling
- open-source
- annotation
- human-in-the-loop
- +1
Open
View Unstructured details
Data OpsFREEMIUMOpen core
Unstructured
Unstructured
ETL for LLMs — turn PDFs, decks, and emails into clean, structured data.
Ingests 64+ file types and partitions, chunks, enriches, and embeds them into LLM-ready output, handling OCR, tables, and document hierarchy. An open-source library plus a low-code platform and API; a staple preprocessing layer for production RAG.
64+ file types ingested
OSS quality trails hosted partition models
- document-etl
- preprocessing
- rag
- open-source
Open
View Firecrawl details
Data OpsFREEMIUMOpen core
Firecrawl
Firecrawl
Turn any website into clean, LLM-ready data — scrape, crawl, search.
A web data API for AI — scrape, crawl, map, and search pages into clean markdown or structured JSON, handling proxies, anti-bot, and JS rendering for you. Open-source core (AGPL) plus a hosted service; a default web-ingestion layer for agents and RAG pipelines.
Clean markdown / structured JSON output
AGPL license constrains redistribution
- web-scraping
- crawling
- rag
- open-source
Open

Data Ops AI apps

Kili Technology

Zerve

Pulse

Unstract

DBHub

DVC

lakeFS

T-Rex Label

Nomic Atlas

Datalab

Ragie

DataRobot

Maia

Tabstack

Extend

olmOCR

Hyperscience

CocoIndex

Diffbot

Rossum

Snorkel AI

micro1

Mercor

Daloopa

Surge AI

Lightly

SuperAnnotate

Labelbox

LlamaParse

Clarifai

Mindee

Scale AI

CVAT

Clay

Nanonets

Smartling

Apify

V7 Go

HARPA AI

Docling

Crawl4AI

Jina AI

Reducto

Chunkr

Dataloop

Thunderbit

ScrapeGraphAI

Label Studio

Unstructured

Firecrawl