Loading…
Data Ops · Firecrawl
Turn any website into clean, LLM-ready data — scrape, crawl, search.
A web data API for AI — scrape, crawl, map, and search pages into clean markdown or structured JSON, handling proxies, anti-bot, and JS rendering for you. Open-source core (AGPL) plus a hosted service; a default web-ingestion layer for agents and RAG pipelines.
Model support
Where it runs
Tags
Related in Data Ops
Apify
Full-stack web scraping and browser automation platform for AI data.
A cloud platform for web scraping, data extraction, and browser automation built around 'Actors' — serverless programs that crawl sites and return structured data. Its store offers tens of thousands of ready-made Actors, and outputs clean Markdown or JSON that feed LLMs, vector databases, and RAG pipelines via LangChain and LlamaIndex. The company also maintains the open-source Crawlee crawling library for local development.
AI insight: Maintains the open-source Crawlee library, but the platform itself is a hosted marketplace of thousands of serverless scraping 'Actors'.
Crawl4AI
Open-source crawler that turns the web into clean, LLM-ready Markdown.
Crawl4AI is an open-source (Apache 2.0) web crawler and scraper built for AI pipelines, converting pages into clean Markdown or structured data for RAG, agents, and data pipelines. The core runs locally with no API key, handles JS rendering, and supports optional LLM-based extraction with any provider. It installs as a Python library/CLI or deploys as a Dockerized FastAPI server; a hosted Cloud API is in closed beta.
AI insight: Apache-2.0, self-host-first crawler needing no API key for its core — among GitHub's most-starred (68k+) web-to-Markdown tools for LLMs.
Docling Project
Open-source toolkit that turns documents into AI-ready Markdown and JSON.
A document-processing toolkit that converts PDF, DOCX, PPTX, XLSX, HTML, images, and audio into clean Markdown or JSON for LLM and RAG pipelines. It does advanced PDF understanding — page layout, reading order, table structure, and OCR for scans — and ships a hybrid chunker plus native LangChain and LlamaIndex integrations. Small enough to run on a laptop via a Python API or CLI; MIT-licensed and community-governed.
AI insight: Started at IBM Research, now an LF AI & Data project; its parser preserves page layout, reading order, and table structure, not just text.
Reducto
Agentic document parsing and extraction for AI teams, via one API.
A document-intelligence API that parses, splits, extracts, and edits PDFs, images, spreadsheets, and slides into clean, structured output for RAG and AI pipelines. It blends custom in-house models with frontier ones and bills via usage credits, automatically discounting pages it can parse without the heavier pipeline.
AI insight: Bills by page complexity, not a flat per-page rate — it auto-discounts simple pages so you don't overpay for an easy PDF.
ScrapeGraphAI
Turn any webpage into structured data with one prompt-driven API call.
ScrapeGraphAI is an AI web-scraping tool that extracts structured data from pages and documents using natural-language prompts instead of CSS selectors or XPath, orchestrating LLMs in graph-style pipelines (single-page, multi-page, search, crawl). The core library is open-source under the MIT license with Python and Node SDKs; a hosted API adds a credit-based free tier and paid plans, plus integrations with LangChain, LlamaIndex, n8n, and an MCP server.
AI insight: Swaps CSS selectors for LLM graph pipelines: describe the data in plain English, and the MIT core runs on any provider or local Ollama.
Julius AI
Chat with your data — an AI data analyst for CSVs, sheets, and DBs.
An AI data analyst that lets you upload CSVs, Excel, and Google Sheets, then ask questions in plain language to clean, analyze, visualize, and model your data. It writes and runs Python behind the scenes and can generate charts, slides, and reports from the results. Pro plans add direct connectors to live databases like Snowflake, BigQuery, and Postgres.
AI insight: Writes and runs Python under the hood, and its Pro tier connects directly to live Snowflake, BigQuery, and Postgres databases.
HumanSignal
Open-source multi-type data labeling and AI evaluation.
Widely-used open-source tool for labeling and annotating data across images, text, audio, video, and time-series, with a standardized export format for training and fine-tuning. ML backends can pre-label data to speed up human review, and it increasingly doubles as a human-in-the-loop AI evaluation surface. Maintained by HumanSignal, which offers a hosted Starter tier and Label Studio Enterprise.
AI insight: One UI labels every modality — image, text, audio, video, time-series — and ML backends can pre-annotate so humans correct, not start cold.
Unstructured
ETL for LLMs — turn PDFs, decks, and emails into clean, structured data.
Ingests 64+ file types and partitions, chunks, enriches, and embeds them into LLM-ready output, handling OCR, tables, and document hierarchy. An open-source library plus a low-code platform and API; a staple preprocessing layer for production RAG.
AI insight: Handles the unglamorous pre-RAG step — OCR, tables, and document hierarchy across 64+ file types — that makes or breaks retrieval.