Vision AI apps

Computer-vision platforms and APIs — detection, OCR, visual search, and multimodal understanding.

35 apps · researched & kept current by Claude Code

Filter & search these 35 apps

View Bucket Robotics details
VisionPAID
Bucket Robotics
Bucket Robotics
Computer-vision defect detection for manufacturing, trained from CAD.
Bucket Robotics builds computer-vision defect detection for manufacturing that trains from CAD files and synthetic data instead of hand-labeled photos. It generates simulated defects — burn marks, bumps, breaks — from the CAD that every modern part already has, producing production-ready vision models that deploy in minutes and adapt as parts and lines change. The system integrates into existing production lines without adding new hardware, and has drawn early customers in automotive and defense.
No hand-labeling — trains from CAD
Early-stage (founded 2024, small team)
- computer-vision
- manufacturing
- defect-detection
- synthetic-data
- +1
Open
View Pulse details
Data OpsFREEMIUM
Pulse
Pulse AI
Production-grade extraction for complex documents.
A document-extraction platform that converts messy, real-world documents — financial statements, medical records, contracts, spreadsheets — into clean, LLM-ready structured data. Pulse runs its own OCR, layout, and vision models (including its Ultra extraction model) rather than wrapping a general-purpose LLM, and exposes the pipeline through an API that drops into existing data workflows. It offers a free sandbox to try, with enterprise tiers for scale; the company says it has processed over a billion document pages.
Purpose-built models for hard layouts
Cloud-only (no self-host)
- document-extraction
- ocr
- unstructured-data
- data-ingestion
Open
View Unstract details
Data OpsFREEMIUMOpen core
Unstract
Zipstack
Turn unstructured documents into structured data.
An agentic document-processing platform that extracts clean, structured JSON from PDFs, scans, and other complex documents using LLMs. Its Prompt Studio gives a no-code IDE to author and test extraction prompts per field, which you then deploy as APIs or ETL pipelines into your warehouse. Built by Zipstack, Unstract is open source under AGPL-3.0 and self-hostable via Docker Compose, with a managed cloud that adds SSO, human-in-the-loop review, and compliance certifications (SOC 2, HIPAA, ISO 27001, GDPR).
Open-source (AGPL-3.0), self-hostable
AGPL-3.0 may deter some commercial use
- document-extraction
- unstructured-data
- etl
- rag
- +1
Open
View Memories.ai details
VisionFREEMIUM
Memories.ai
Memories.ai
A 'visual memory' layer for AI — search and reason over huge video libraries.
Video understanding platform built around a large visual memory model. It ingests long-form and large-scale video, then supports natural-language search, transcription, clip retrieval, and content analysis with unlimited video context. Applied to security and surveillance review, sports analytics, media production, and robotics, with a free playground and on-device processing options.
Handles very long and large video sets
Newer, smaller track record
- video-understanding
- visual-memory
- video-search
- multimodal
Open
View T-Rex Label details
VisionFREEMIUM
T-Rex Label
Visincept (IDEA Research)
Zero-shot AI image annotation that batch-labels with visual prompts.
T-Rex Label is a browser-based image annotation tool built on the T-Rex2 open-set detection model. Point it at one example and its visual-prompt, zero-shot detection finds and labels matching objects across an entire dataset—no training or fine-tuning required—handling dense, occluded, and varied-lighting scenes. It exports to COCO and YOLO formats and integrates with tools like Roboflow and Labelbox.
No per-class training needed
Browser-only, no offline mode
- image-annotation
- object-detection
- zero-shot
- dataset-labeling
- +1
Open
View Datalab details
Data OpsFREEMIUMOpen core
Datalab
Datalab
High-accuracy document parsing — PDFs and images to markdown, JSON, and HTML.
Datalab turns PDFs, images, and office documents into clean markdown, JSON, and HTML with layout, table, math, and code preservation. It is the commercial, hosted layer over the open-source Marker converter and Surya OCR toolkit, offered as a pay-as-you-go API with a free monthly allowance, while the underlying models stay free to self-host for research and small startups.
Pay-as-you-go API with free allowance
Hosted API metered per page
- document-parsing
- ocr
- pdf-to-markdown
- rag
- +1
Open
View Matroid details
VisionPAID
Matroid
Matroid
No-code computer vision to detect anything in images and video.
Matroid is an enterprise platform for building custom computer vision detectors without writing code. Non-programmers train detectors to find objects, defects, people, events, and actions, then deploy them against any existing camera or video feed. It is widely used for industrial visual inspection and quality control, where it can flag cracks, weld defects, and assembly errors in real time.
Non-programmers train custom detectors
Enterprise sales, no public pricing
- computer-vision
- no-code
- object-detection
- manufacturing
- +1
Open
View Extend details
Data OpsFREEMIUM
Extend
Extend AI
Full-stack document processing platform for AI agents and pipelines.
Extend is an LLM-powered document processing platform that parses, extracts, classifies, splits, and edits complex documents — handwriting, tables, and mixed formats — into reliable structured data via API or its web Studio. It combines multiple frontier models with proprietary context engineering for messy real-world files. Used by teams at Brex, Square, Checkr, and Flatiron Health.
Handles handwriting, tables, mixed formats
Newer than incumbent IDP vendors
- document-processing
- ocr
- extraction
- vlm
- +1
Open
View Mixpeek details
VisionFREEMIUM
Mixpeek
Mixpeek
Find any scene in your video and multimodal library.
Mixpeek is a multimodal retrieval API for searching across video, images, audio, and documents with natural language. It extracts and indexes structured features — faces, scenes, transcripts, OCR, and embeddings — over object storage like S3, GCS, and R2, then runs hybrid dense, sparse, and BM25 search with reranking. Cross-modal joins let a single query combine signals such as faces, spoken phrases, and on-screen text.
Searches video, image, audio, and docs
Developer/API-first, not no-code
- multimodal-search
- video-search
- retrieval
- embeddings
- +1
Open
View Mecha Health details
HealthcarePAID
Mecha Health
Mecha Health
Radiology foundation models that draft clinical reports from scans.
Mecha Health builds AI foundation models for radiology. It analyzes DICOM medical images with pixel- and voxel-level reasoning and generates complete, editable draft reports — structured findings and impressions — in seconds for radiologist review. It spans multiple imaging modalities and anatomies and integrates with PACS, delivering results over FHIR/HL7, deployable in the cloud or on-premise.
Spans multiple modalities and anatomies
Enterprise-only; no public pricing
- radiology
- medical-imaging
- healthcare
- foundation-models
- +1
Open
View Optifye details
VisionPAID
Optifye
Optifye
Computer-vision monitoring of factory-floor efficiency from existing cameras.
Optifye is an AI computer-vision platform for manufacturing operations. It connects to a plant's existing IP/CCTV cameras to measure per-operator cycle times, detect bottlenecks and check standard-operating-procedure compliance in real time, then turns that into efficiency analytics and automated production reports. It targets labour-intensive lines across automotive, apparel, welding, medical and electronics manufacturing.
Detects production bottlenecks in real time
Worker-surveillance and privacy concerns
- computer-vision
- manufacturing
- operations
- monitoring
- +1
Open
View Datature details
VisionFREEMIUM
Datature
Datature
Build and deploy computer-vision models without code.
Datature is an end-to-end, no-code platform for computer vision. Its Label module provides AI-assisted, pixel-perfect annotation with multi-annotator review; Train offers drag-and-drop model building with hyperparameter tuning; and Deploy ships models to edge or cloud via API. It supports image classification, object detection, keypoint annotation, and semantic segmentation across industries from healthcare to manufacturing.
Covers label, train and deploy in one place
Pricing not transparent on the site
- computer-vision
- no-code
- annotation
- mlops
- +1
Open
View olmOCR details
VisionFREEOSS
olmOCR
Allen Institute for AI
Open-source OCR that converts PDFs and scans into clean, structured text.
olmOCR is an open-source toolkit from the Allen Institute for AI that turns PDFs and document images into clean, reading-order plain text, preserving tables, equations, and handwriting. It runs a fine-tuned 7B vision-language model with a document-anchoring prompting technique, and is built for cheap, dataset-scale conversion for LLM training and retrieval. Released with model weights, training data, and inference code; runs on your own GPUs or via third-party inference providers.
Ships weights, training data, and code
Requires a capable GPU to self-host
- ocr
- open-source
- pdf
- document-parsing
- +1
Open
View Hyperscience details
VisionPAID
Hyperscience
Hyperscience
Enterprise document processing that turns messy paperwork into structured data.
Hyperscience is an enterprise intelligent document processing (IDP) platform that reads, classifies, and extracts data from forms, invoices, and handwritten paperwork at high accuracy. It trains custom machine-learning models per document type and routes low-confidence cases to humans, targeting straight-through automation for high-volume back-office workflows. Sold to large enterprises and government agencies, with cloud, private-cloud, and air-gapped on-prem deployment options.
Routes low-confidence cases to humans
Enterprise sales, no public pricing
- document-processing
- idp
- ocr
- enterprise
- +1
Open
View VLM Run details
VisionFREEMIUM
VLM Run
Autonomi AI
Unified API gateway that extracts structured JSON from images, video, and documents.
VLM Run is a developer platform for visual AI that returns reliable structured JSON from images, video, and documents through a single API, combining hyper-specialized vision-language models with computer-vision tools for tasks like document parsing, structured OCR, object detection, and segmentation. It offers fine-tuning to specialize models for a domain, dashboards, and flexible deployment. The platform is operated by Autonomi AI.
One API for images, video, and documents
Pro tier jumps to $799/mo
- visual-ai
- document-extraction
- vision-language-model
- ocr
Open
View Move AI details
VisionFREEMIUM
Move AI
Move AI
Markerless 3D motion capture from ordinary video.
Markerless motion-capture technology that turns 2D video into broadcast-quality 3D animation data using computer vision, biomechanics, and physics. The Move One app captures motion from a single iPhone, while multi-camera setups serve studio production; output exports to FBX and USD for game engines and animation pipelines. Used by studios including Ubisoft, Sony, and Disney.
No markers, suits, or specialist hardware
Cloud processing; credit-based pricing
- motion-capture
- markerless
- 3d-animation
- mocap
- +1
Open
View Groundlight details
VisionFREEMIUM
Groundlight
Groundlight AI
Build reliable computer vision by asking plain-English questions about images.
Groundlight lets developers create visual detectors by describing what to look for in natural language, with no training dataset required. Its system pairs ML models with built-in 24/7 human labeling, so applications return reliable answers from day one and the models improve automatically over time. It ships a Python SDK and REST API, supports edge inference on hardware like Raspberry Pi, and powers monitoring, industrial inspection, and robotics use cases.
Natural-language visual queries
Narrower than general vision platforms
- computer-vision
- edge-ai
- human-in-the-loop
- no-code
Open
View Coactive AI details
VisionPAID
Coactive AI
Coactive AI
Multimodal platform that makes images and video searchable and structured.
Coactive AI is an enterprise multimodal application platform that pulls context directly from the pixels and audio in images and video — no manual tagging or metadata required. Teams use it to semantically search, label, govern, and structure large visual libraries at scale, turning unstructured media into queryable data. It is aimed at media, retail, and other enterprises with vast image and video archives.
Search visual data with no tagging
Enterprise-only, no public pricing
- multimodal
- visual-search
- video-understanding
- data-labeling
- +1
Open
View AskUI details
AutomationPAID
AskUI
AskUI
Vision-based agents that automate any UI across desktop, mobile, and web.
AskUI builds and deploys computer-use agents that visually detect and operate on-screen elements across operating systems — desktop, mobile, web, embedded, and automotive HMIs — without relying on selectors or accessibility trees. Its AgentOS runtime and Python SDK let teams automate UI testing and workflows and route between models like Claude, Gemini, and OpenAI. It is aimed at enterprise automation and QA.
Cross-platform incl. embedded/HMI
Enterprise focus, niche audience
- computer-use
- ui-automation
- test-automation
- vision-agents
- +1
Open
View Lightly details
VisionPAID
Lightly
Lightly
Computer-vision data curation, labeling, and model pretraining.
Lightly is a computer-vision data platform that helps teams curate the most informative samples from large image and video datasets using embeddings, active learning, and near-duplicate detection. Its suite spans LightlyStudio (curation and labeling), LightlyTrain (self-supervised pretraining and fine-tuning of vision models), and LightlyEdge (smart data selection on devices). The aim is to cut labeling cost by training on the data that actually improves models.
Embedding-based data curation and dedup
No public pricing; sales-led
- computer-vision
- data-curation
- active-learning
- self-supervised
Open
View Hive details
VisionPAID
Hive
Hive
Cloud AI APIs for content moderation, search, and generation.
Hive offers pre-trained deep-learning models delivered as cloud APIs for understanding, moderating, and generating visual, text, and audio content. Its core business is automated content moderation — flagging unsafe imagery, video, text, and audio at platform scale — alongside logo and object detection, AI-generated-content and deepfake detection, and reverse image search. The San Francisco company powers trust-and-safety pipelines for platforms including Reddit, Bluesky, and Midjourney. Models integrate with a few lines of code and run as managed or on-premise deployments.
Broad pre-trained moderation models
Enterprise sales, no public pricing
- content-moderation
- computer-vision
- deepfake-detection
- moderation-api
Open
View Clarifai details
VisionFREEMIUM
Clarifai
Clarifai
Full-stack AI platform for computer vision and LLMs.
Clarifai is a full-stack AI platform for building with unstructured image, video, text, and audio data. It pairs production computer-vision models — classification, detection, visual search — with a model hub for LLMs, plus data labeling, training of custom models, and inference, all behind one API and console. A free Community tier lets you discover and run models before moving to paid usage plans.
Mature, end-to-end vision stack
Broad platform has a learning curve
- computer-vision
- model-hub
- data-labeling
- inference
- +1
Open
View CVAT details
VisionFREEMIUMOpen core
CVAT
CVAT.ai
Open-source annotation platform for vision AI datasets.
Data-labeling suite for images, video, and 3D: bounding boxes, polygons, segmentation, keypoints, and object tracking, with AI-assisted labeling via SAM and custom models through its API and SDK. Ships as the MIT-licensed Community edition to self-host, the hosted CVAT Online with free and paid plans, or a self-hosted Enterprise tier.
Boxes, polygons, keypoints, 3D, and video
Self-hosting needs Docker ops effort
- annotation
- labeling
- computer-vision
- datasets
Open
View Nanonets details
Data OpsFREEMIUM
Nanonets
Nanonets
AI agents for document processing and enterprise data extraction.
Nanonets automates document-heavy workflows — invoices, orders, contracts, and claims — with AI agents that read, extract, and route structured data across ERPs, email, and approval chains. It runs on its own OCR-3 extraction model and can fold in LLMs for agentic pipelines. Offered as managed cloud with VPC, single-tenant, and on-premises deployment options and regional data residency.
Handles invoices, orders, contracts, claims
Leaderboard claims are vendor-reported
- document-ai
- idp
- ocr
- extraction
- +1
Open
View Supervisely details
VisionFREEMIUM
Supervisely
Supervisely
All-in-one computer vision platform to curate, label, and train models.
A unified computer vision platform covering data curation, annotation, model training, and deployment across images, video, 3D point clouds, and medical imagery. AI-assisted labeling, experiment tracking, and a large catalog of installable apps make it customizable for most CV workflows. Free for researchers and small teams; Pro and self-hostable Enterprise editions for companies.
Images, video, 3D point cloud, DICOM
Broad platform has a learning curve
- computer-vision
- data-annotation
- labeling
- model-training
- +1
Open
View Mathpix details
VisionFREEMIUM
Mathpix
Mathpix
OCR and document conversion built for math, science, and STEM.
OCR and document-conversion tooling specialized for STEM content. Mathpix reads printed and handwritten math, chemistry, tables, and text from images and PDFs, exporting to LaTeX, DOCX, Markdown, Excel, ChemDraw, and more. It ships as the Snip app (web, mobile, desktop, browser extension) for individuals and teams, plus a Convert API for developers building solving, tutoring, and grading products.
Near-flawless math OCR on PDFs
Limited free tier for heavy users
- ocr
- document-conversion
- latex
- stem
- +1
Open
View Encord details
VisionPAID
Encord
Encord
Data platform to curate, label, and manage AI training data.
An enterprise data development platform for preparing high-quality training data across images, video, documents, audio, DICOM, and 3D point clouds. It pairs AI-assisted labeling (SAM auto-segmentation, object tracking) with data curation, model evaluation, and workflow tooling, plus LLM-powered data agents for document tasks. Used heavily in medical imaging, robotics, and other physical-AI domains.
DICOM/NIfTI/point-cloud support
Enterprise pricing, no free tier
- data-annotation
- training-data
- computer-vision
- medical-imaging
- +1
Open
View Reka Vision details
VisionPAID
Reka Vision
Reka
Multimodal platform to search, reason over, and clip large volumes of video.
Reka Vision is an enterprise multimodal system that indexes large image and video libraries so teams can search by meaning, ask timestamp-aware questions, and auto-generate highlights and clips. It is built by Reka, a frontier multimodal-model lab, and is available via API, an MCP server, or a hosted app. Access is sales-led (request a demo).
Natural-language search over video archives
Sales-led, demo-gated access
- video-understanding
- multimodal
- visual-search
- video-clipping
- +1
Open
View Ultralytics YOLO details
VisionFREEMIUMOpen core
Ultralytics YOLO
Ultralytics
YOLO models for real-time object detection and vision.
The open-source PyTorch framework behind the YOLO (You Only Look Once) family of vision models. One unified API covers object detection, instance and semantic segmentation, image classification, pose estimation, and oriented bounding boxes, with both a CLI and a Python interface. The 2026 flagship, YOLO26, is an end-to-end, NMS-free architecture tuned for edge and low-power deployment.
Real-time inference on edge and GPU
AGPL-3.0 — commercial use needs a paid license
- object-detection
- segmentation
- yolo
- open-source
- +1
Open
View Dataloop details
VisionPAID
Dataloop
Dataloop
Enterprise data engine for labeling and managing unstructured AI data.
An AI-ready data platform that manages, labels, and orchestrates unstructured data — images, video, LiDAR, audio, and text — across the model lifecycle. It pairs data management and human-in-the-loop annotation with a serverless pipeline layer for pre/post-processing, RLHF, and RAG, plus a model-and-app marketplace. Originally focused on computer-vision production pipelines.
Multimodal labeling (image, video, LiDAR, audio, text)
Enterprise pricing
- data-labeling
- computer-vision
- annotation
- mlops
Open
View Moondream details
VisionFREEMIUMOpen core
Moondream
M87 Labs
Tiny open vision-language model for efficient image understanding.
An open-weights family of small vision-language models for captioning, visual Q&A, pointing, counting, and object detection — small enough to run on-device (checkpoints down to 0.5B on Hugging Face). Run it locally with the Photon engine, or call Moondream Cloud's OpenAI-compatible API with a free monthly credit tier and pay-per-image pricing.
Open-weights, free to self-host
Small models trail frontier VLMs on hard tasks
- vision-language
- open-weights
- on-device
- object-detection
Open
View TwelveLabs details
VisionFREEMIUM
TwelveLabs
TwelveLabs
Video intelligence API: search, classify, and summarize video.
Video understanding platform built on its own multimodal foundation models — Marengo for embeddings and semantic search, Pegasus for generative tasks like summaries and captions. Developers index video once and run natural-language search, classification, and analysis via API. Free tier with usage-based pricing beyond it.
Marengo embeddings + Pegasus generation
Proprietary, closed models
- video-understanding
- search
- multimodal
- embeddings
- +1
Open
View LandingAI details
VisionFREEMIUM
LandingAI
LandingAI
Build vision detectors and agents from a few labeled examples.
Build vision applications with a labelling-light workflow — point at examples, get a deployable detector. Recently extended into vision agents that reason over images and PDFs without bespoke training.
Fast path to a deployable detector
Less control than custom model training
- visual-prompting
- agents
- document-ai
- no-code
Open
View Roboflow details
VisionFREEMIUM
Roboflow
Roboflow
Vision MLOps end-to-end. Annotate, train, deploy.
Annotation tooling, auto-labelling, hosted training, and edge deployment for computer-vision projects. Strong default when you're shipping a custom vision model rather than reaching for a multimodal LLM.
End-to-end vision MLOps
Free tier caps usage and privacy
- annotation
- training
- deployment
- edge
Open
View Voxel51 details
VisionFREEMIUMOpen core
Voxel51
Voxel51
FiftyOne — open-source vision data platform.
A toolkit for exploring, debugging, and curating vision datasets. Strong story for finding model failure modes, balancing classes, and tracking experiment drift across visual data at scale.
Open-source FiftyOne core
Vision-only focus
- open-source
- datasets
- evaluation
- python
Open

Vision AI apps

Bucket Robotics

Pulse

Unstract

Memories.ai

T-Rex Label

Datalab

Matroid

Extend

Mixpeek

Mecha Health

Optifye

Datature

olmOCR

Hyperscience

VLM Run

Move AI

Groundlight

Coactive AI

AskUI

Lightly

Hive

Clarifai

CVAT

Nanonets

Supervisely

Mathpix

Encord

Reka Vision

Ultralytics YOLO

Dataloop

Moondream

TwelveLabs

LandingAI

Roboflow

Voxel51