AI-Powered Time-Series Image Anomaly Detection Platform

Business outcome. Inspection teams in solar, manufacturing, healthcare, and agriculture cut visual review time while keeping humans in the loop on every call. The vision model proposes; the reviewer confirms or corrects with a single click; corrections feed the next run. No labelled training set required to start — a one-sentence rule is enough — and every detection lands with a timestamp, an audit trail, and an exportable report for compliance.

Built for inspection-heavy teams with piles of imagery (solar arrays, weld stations, chest X-rays, ag-tech flyovers) that want a reviewable AI second opinion before a human looks. The platform's design choice is rule-first detection: every project ships with a plain-English detection rule, and an open-vocabulary vision model localizes anomalies that match the rule. Outputs auto-flow into the annotate canvas as machine suggestions, where the reviewer accepts or overrides each one in two clicks.

Pipeline

Scroll to zoom · click-drag to pan · double-click to reset.

Open in Mermaid Live

What's built

Description-driven detection. Each project carries a plain-English rule ("darkened or cracked solar cells, soot streaks, broken cell strings"). Detection routes through lib/ai/vision-judge.ts:judgeImageAgainstRule, which prefers Grounding DINO via Replicate (pixel-precise open-vocabulary localization) and falls back to gpt-4o-mini in JSON mode. With no keys, a deterministic FNV-style stub still produces regions so the demo runs end-to-end on a fresh clone.
Few-shot from confirmed annotations. Before each gpt-4o-mini call the detect route fetches the 3 most recent human / human-correction anomaly annotations, crops them via sharp().rotate().extract(), and attaches them to the prompt as "Confirmed anomaly examples from this project (use these as visual reference)". The model now sees both the rule and concrete pixel evidence of what your project considers an anomaly. Skipped on the Grounding DINO path (text-only model; its pixel precision makes few-shot less necessary).
AI annotations + accept/reject loop. Each detection writes its regions as source: "ai" annotations on the Konva canvas, rendered as a coloured dot with a small label (not a bounding box — vision models eyeball coordinates and a misleading rectangle reads worse than a marker). Reviewers click ✓ to promote a suggestion to ground truth (source: "human-correction") so the next detect run can't overwrite it, or 🗑 to reject, or draw a new box to replace it. Re-running detect clears stale AI suggestions on that image but never touches human-authored annotations.
EXIF-aware coordinate frame. Browsers auto-apply EXIF orientation when reading uploaded images, so the natural width/height stored at upload time is post-rotation. Sharp does not auto-rotate by default — without an explicit .rotate() call, extract() cuts from the raw byte frame and crops drift on every phone/drone JPEG. The cropToDataUrl helper applies .rotate() before metadata + extract, and orientedImageForKey does the same for the target image passed to the vision model. All coordinate systems agree.
Pipeline Jobs page. Every long-running operation writes a PipelineRun row with named phases that get upserted as state transitions happen. Training runs show six phases (prepare regions → embed annotations → split → fit centroid → evaluate → save), detection runs show seven (queued → reserve → load few-shot → call vision → clear stale → save detection → write annotations), each with structured meta (counts, identifiers, backend, region count, confidence, reasoning). The Jobs page polls every 1.5s while any run is running and stops automatically when nothing's active.
Streaming step-by-step training. The train route returns application/x-ndjson — one JSON event per line — and the panel renders a live pipeline checklist. Embedding is parallelized 5-way against OpenAI, cutting wall time ~5× on a typical 25-annotation run. A failed phase turns red and the error sits inline.
Per-annotation split fallback. Sequence-aware 60/20/20 split (clean, no temporal leakage) for projects with ≥ 3 sequences. Falls back to a shuffled per-annotation split for sparse projects so metrics aren't stuck at 0%; the model card flags the fallback with an amber chip + warning so the optimism is visible at a glance.
Detection lifecycle status. Every Detection carries status: "pending" | "completed" | "failed". The detect route reserves the row before the vision call so other tabs see the run in flight; clients fan multi-image batches out 3-wide with live status badges on each image. Failed calls render red with the error text, not a half-empty bbox.
Carousel sequence viewer. Fixed-size stage (560 × 360, centered) with prev/next overlay buttons, keyboard navigation, dot indicators under 12 frames, and a scrolling thumbnail strip with auto-centred active above. Deterministic image ordering — (sequenceId, capturedAt, _id) — so the Annotate and Detect sidebars line up with the gallery exactly.
Hydration-safe dates everywhere. All client-component date strings go through lib/utils/date.ts (UTC, no locale dependency). No more SSR/client mismatch warnings on the Annotate, Detect, or Jobs pages.
JSON / CSV / PDF export. A reports route emits the full detection log in three formats; pdfkit renders a one-page summary with model metadata and the top 200 detection rows for sharing with stakeholders.

Tradeoffs

Rule-driven, not learned end-to-end. A plain-English rule is fast to set up, transparent in the UI, and works on a single uploaded image with zero training. The cost: rule wording matters, and recall on dataset-specific patterns depends on how well the user articulates "what counts as anomalous". The few-shot crops mitigate this — confirmed annotations teach the model in-context without retraining anything.
Grounding DINO over a generic VLM for localization. Replicate-hosted Grounding DINO returns pixel-precise boxes because that's what it's trained for. gpt-4o-mini is great at deciding whether a rule matches but eyeballs coordinates badly. The route prefers GD whenever REPLICATE_API_TOKEN is set and falls back gracefully — a Replicate cold-start adds ~10–30s, subsequent calls inside a warm window are sub-second. Cost is ~$0.005/image.
Markers over bounding boxes in the UI. Even with Grounding DINO, the demo renders AI results as a pin + label rather than a rectangle. A box implies pixel-precision the model can't always deliver; a marker reads as "look here" honestly. Human-drawn shapes still render as the box / polygon the user actually drew.
Centroid pipeline kept for metrics. The original caption-then-embed + centroid pipeline still runs at train time and produces precision / recall / F1 against held-out annotations. It also serves as the detection backend when a project has no rule. So the demo carries two backends; the schema and Jobs page handle both modes uniformly.
EXIF normalization on every detect call. Decoding through sharp + re-encoding the rotated bytes is a ~50ms overhead per detection. Worth it: without it, every phone/drone JPEG with an EXIF orientation tag silently misplaces few-shot crops, and the model's accuracy collapses for reasons that don't show up anywhere except by squinting at the dots.
Vercel Blob private store. Uploads use access: "private"; reads go through the SDK's authenticated get() in a server-side proxy so URLs never leak. Adds one hop per image render in exchange for not having to think about signed-URL expiry — and it's the only way to keep medical/industrial source imagery genuinely private on Vercel infra.
Per-annotation split fallback over 0% metrics. Strict sequence-aware splits prevent temporal leakage but produce nothing to evaluate on small projects with one sequence. Falling back to a shuffled per-annotation split surfaces meaningful numbers earlier, at the cost of mild optimism — the UI says so explicitly with an amber chip.

Live demo →