AI Invoice Processing and ERP Integration Platform

Business outcome. Accounts-payable and ops teams stop keying invoices into ERPs by hand. Structured extraction lifts vendor, totals, line items, and tax fields straight from the PDF, validated against a Zod schema so a half-parsed row never reaches the system of record. Procurement, finance, and audit can search the same archive by content and by amount without a separate database query.

Built for accounts-payable and operations teams that process a steady volume of supplier invoices and need to query them by both content and amount — "unpaid invoices over €10k", "contracts mentioning indemnity" — without manual data-entry sitting between the PDF inbox and the ERP.

Pipeline

Scroll to zoom · click-drag to pan · double-click to reset.

Open in Mermaid Live

What's built

Private upload. Files land in private Vercel Blob, served through an authenticated proxy. Blob URLs never exposed.
Zod-typed extraction. GPT-4o-mini pulls invoice number, vendor, dates, subtotal, tax rate/amount, total, and line items against a Zod schema. The same schema is the type written to Postgres, so there are no "partially extracted" rows.
Hybrid search router. A single search box at /api/search inspects the query: numeric tokens (currency, percentages, comparisons) → SQL range query against total_amount / tax_rate; everything else → Pinecone top-5.
Retry & delete. Failed extractions can be re-run in place; deletes cascade across blob, vectors, and the DB row in one server action.
Sample data. Homepage offers a ZIP of three randomly-generated invoices (pdfkit) so visitors can try the pipeline without their own files.
Index reuse. 1024-dim embeddings configured to fit the same Pinecone Serverless cosine-1024 index used by the support chatbot. One vector store, two products.

Tradeoffs

Zod-enforced all-or-nothing rows. Failed extractions are visibly retryable; nothing partial reaches the DB to contaminate downstream queries. The cost is occasional re-runs on edge-case PDFs instead of a half-row that "mostly works".
Numeric-token routing for the search box. Predictable and explainable; mixed queries like "Acme invoices over €10k" fall to one branch only. Solvable later with a parser that splits the query, but not worth the complexity on day one.
Pinecone index shared with the chatbot. One vector store serving two products saves operational overhead and a per-product index cost, at the price of coupling their deployment lifecycle. Worth it for a portfolio; revisit if the products diverge.

Live demo →

Theuy Limpanont

AI Invoice Processing and ERP Integration Platform

Automates PDF invoice intake, validates extracted fields against a strict schema, and exposes one search box that routes both content queries ("contracts mentioning indemnity") and numeric range queries ("unpaid invoices over €10k") to the right store.

Pipeline

What's built

Tradeoffs