Business outcome. Accounts-payable and ops teams stop keying invoices into ERPs by hand. Structured extraction lifts vendor, totals, line items, and tax fields straight from the PDF, validated against a Zod schema so a half-parsed row never reaches the system of record. Procurement, finance, and audit can search the same archive by content and by amount without a separate database query.
Built for accounts-payable and operations teams that process a steady volume of supplier invoices and need to query them by both content and amount — "unpaid invoices over €10k", "contracts mentioning indemnity" — without manual data-entry sitting between the PDF inbox and the ERP.
Pipeline
What's built
- Private upload. Files land in private Vercel Blob, served through an authenticated proxy. Blob URLs never exposed.
- Zod-typed extraction. GPT-4o-mini pulls invoice number, vendor, dates, subtotal, tax rate/amount, total, and line items against a Zod schema. The same schema is the type written to Postgres, so there are no "partially extracted" rows.
- Hybrid search router. A single search box at
/api/searchinspects the query: numeric tokens (currency, percentages, comparisons) → SQL range query againsttotal_amount/tax_rate; everything else → Pinecone top-5. - Retry & delete. Failed extractions can be re-run in place; deletes cascade across blob, vectors, and the DB row in one server action.
- Sample data. Homepage offers a ZIP of three randomly-generated invoices (
pdfkit) so visitors can try the pipeline without their own files. - Index reuse. 1024-dim embeddings configured to fit the same Pinecone Serverless cosine-1024 index used by the support chatbot. One vector store, two products.
Tradeoffs
- Zod-enforced all-or-nothing rows. Failed extractions are visibly retryable; nothing partial reaches the DB to contaminate downstream queries. The cost is occasional re-runs on edge-case PDFs instead of a half-row that "mostly works".
- Numeric-token routing for the search box. Predictable and explainable; mixed queries like "Acme invoices over €10k" fall to one branch only. Solvable later with a parser that splits the query, but not worth the complexity on day one.
- Pinecone index shared with the chatbot. One vector store serving two products saves operational overhead and a per-product index cost, at the price of coupling their deployment lifecycle. Worth it for a portfolio; revisit if the products diverge.