
Sifter
Turn messy document folders into queryable structured databases.
Tagline
Turn messy folders into databases
Documents as a queryable database layer
Stop building brittle extraction templates
Open-source document intelligence for real chaos
Sifter is a document database layer, not a document search tool.
The page explicitly contrasts structured aggregation with RAG similarity search and emphasizes filter/aggregate/query semantics. This is a strong category-creation angle because the product behaves like a database for files, not a retrieval layer.
The anti-template extractor for real-world document chaos.
The strongest repeated proof point is that it works across layout changes, supplier variations, and mixed document formats without per-layout configuration. That directly attacks the brittle template-extraction market.
Open-source document intelligence for teams that can’t trust black-box SaaS.
MIT license, self-hosting, BYOK, and no vendor lock-in are prominent. This angle will resonate with engineering-led buyers, privacy-conscious teams, and companies building workflows on top of extracted document data.
Primary user
Technical operations or automation lead at a mid-market company managing large folders of recurring documents
ICP #1
Finance operations manager at a 50-500 employee company drowning in supplier invoices and expense receipts
Pain
They need totals, due dates, vendor names, and month-by-month spend, but invoice formats vary by supplier and manual entry is slow and error-prone.
Why this solves
Sifter turns those files into structured rows with citations, then lets them query totals, filter unpaid invoices, and generate live dashboards without building templates for each vendor.
ICP #2
Talent operations manager at a scaling startup processing hundreds of resumes per week
Pain
Resumes come in wildly different layouts, so resume parsers miss fields and recruiters end up scanning PDFs manually.
Why this solves
Sifter can extract consistent fields like skills, years of experience, and work history across any CV format, making the folder queryable as a database instead of a pile of attachments.
ICP #3
Founding engineer at a vertical SaaS or internal tools company building document workflows for customers
Pain
They need a reliable way to extract and query data from user-uploaded documents without getting locked into a brittle vendor or template system.
Why this solves
Sifter gives them an API, SDK, webhooks, MCP, and self-hosting, so they can ship document intelligence fast while keeping control over infra, data, and LLM provider choice.
Strengths
- +The positioning is unusually crisp and memorable: "Your documents are a dark database" instantly frames the category.
- +The RAG-vs-Sifter comparison is concrete and persuasive because it shows exact failure modes like wrong month, wrong client, and missing invoices.
- +The product surface area is clearly communicated across UI, API, SDK, MCP, webhooks, dashboards, and self-hosting, which builds trust with technical buyers.
Weaknesses
- −The homepage is heavily developer-centric and risks losing non-technical operations buyers who would benefit from the product but don’t care about MCP or typed SDKs.
- −The copy explains how Sifter works better than it explains the business outcome; there’s too much mechanism and not enough ROI language around hours saved, faster closes, or fewer extraction errors.
- −There’s no strong vertical entry point on the page; it lists many use cases, which dilutes the message instead of choosing a wedge like invoices, contracts, or resumes.
- −The pricing is simple, but the extraction-based metric is abstract and needs clearer examples of what 10, 500, or 3,000 extractions means in real business terms.
- −The landing page assumes readers already understand the pain of RAG and template extraction; that makes the argument weaker for mainstream buyers who just want document automation.
Fix these
- Create separate homepage hero variants for the top three wedges: invoices/expenses, contracts, and resumes, each with its own proof, screenshots, and query examples.
- Add an ROI block with specific outcomes like "close month-end faster," "replace manual invoice coding," or "turn a folder of PDFs into a searchable dataset in minutes."
- Lead with one primary buyer and one primary workflow instead of listing every feature; for example, "invoice intelligence for finance teams" or "contract extraction for ops teams."
- Replace some of the technical feature inventory with before/after workflow visuals showing upload → extracted table → query → dashboard → webhook.
- Add trust-building proof artifacts like sample schemas, benchmark comparisons, accuracy numbers, and a short customer quote to make the anti-template claim more credible.
Drop-in replacement copy
Headline
Documents, now a database
Extract structured records from messy files and query them like data.
Turn files into rows you can trust
Upload PDFs, scans, images, receipts, contracts, or resumes and extract structured records with a schema you define in plain English or JSON. Every field includes citations back to the source text and page number.
Query folders like a real dataset
Filter, sort, aggregate, export, and search your extracted records in the web UI or via REST API. Stop opening files one by one just to find totals, missing fields, or the latest version.
Built for layout drift, not perfect templates
Sifter is designed for document collections that are mostly the same but never identical. You do not need to maintain a brittle rule per vendor, format, or page design.
Ship it into your stack fast
Use the Python SDK, TypeScript client, webhooks, and MCP server to plug extraction into internal tools or SaaS workflows. Self-host with Docker Compose and keep control of your data and model provider.
FAQ
How is this different from OCR or RAG?
OCR gives you text. RAG helps you find text. Sifter gives you structured records you can query, aggregate, and export.
What kinds of documents work best?
Homogeneous collections with variable layouts: invoices, receipts, contracts, resumes, utility bills, and scans. If the same fields repeat across messy files, Sifter is a fit.
Do I need to create templates for every vendor or layout?
No. You define the fields you want, and Sifter extracts across changing formats without per-layout configuration.
Can I trust the extracted data?
Each field includes citations with page number and source text so you can verify where the value came from.
Can I self-host Sifter?
Yes. Sifter is MIT licensed, self-hostable with Docker Compose, and supports BYOK LLM setups.
Your PDFs are a dark database. Sifter turns messy folders into queryable structured rows. No templates. No per-layout rules. Just schema-driven extraction, citations, filters, aggregates, dashboards, API, SDK, and MCP. Ship it on your docs.
RAG was built for retrieval. Sifter was built for documents that need rows. Invoices, receipts, contracts, resumes, scans. Extract fields with natural language schemas. Query them like a database. Export them like data. Dashboard them like metrics.
I kept seeing the same failure: Teams would dump PDFs into a vector store and then ask why totals, dates, and vendor names were wrong. Because search is not structure. Sifter makes document collections queryable. Think database, not chatbot.
We removed templates entirely. Instead of drawing boxes and praying layouts never change, you describe the schema in plain English. Sifter extracts the records, keeps citations, and updates dashboards automatically. Less glue code. Less breakage.
Month-end closes die in folders. One supplier uses clean PDFs. Another sends scanned images. A third changes invoice layout every quarter. Sifter normalizes all of it into rows you can filter, aggregate, and export.
Resume parsers miss the important bits. Different layouts. Different formatting. Same headache. Sifter turns a pile of CVs into structured fields you can query by skills, years of experience, roles, and history.
Query 5,000 invoices in one line. "Show unpaid invoices over $10k from last quarter" That’s the product. Upload files, define the schema, extract the rows, then filter, aggregate, export, or feed the data into your stack.
One schema, many file formats. PDFs, photos, scans, receipts, contracts, CVs. Sifter extracts the same fields even when layouts drift. Then you get typed data, citations, webhooks, dashboards, and an API your engineers won’t hate.
The best part is not OCR. It’s that ops teams can finally ask real questions: - What’s overdue? - What changed this month? - Which clients are missing docs? Sifter turns document piles into a live dataset.
Teams do not want another parser. They want answers, tables, exports, and a system they can trust. That’s why Sifter ships with citations, self-hosting, BYOK LLM support, and open-source code.
Angle: Document database, not search tool
Most document AI tools are still thinking like search engines. They try to find text. They try to answer questions. They are built around retrieval. That works until your real problem is structure. Invoices. Receipts. Contracts. Resumes. Utility bills. You do not want “similar documents.” You want rows. You want filters. You want aggregations. You want a dashboard that updates when new files land. That is why I built Sifter. It turns messy folders into queryable structured databases. You define the schema in plain English or JSON. It extracts records from PDFs, scans, images, and photos. Every field gets a citation back to source text and page number. Then you can query it like data, not like a pile of attachments. If you’ve ever built brittle templates for document intake, you know the pain. Layouts drift. Suppliers change formats. Manual review becomes the bottleneck. Sifter is the anti-template layer for document chaos.
Angle: Ops outcome and ROI
The easiest way to lose a week is to let documents stay as documents. A finance team gets 300 invoices. A recruiting team gets 200 resumes. An ops team gets a folder of scanned PDFs. Someone has to open each one. Someone has to copy fields into a sheet. Someone has to fix mistakes when layouts change. That is not automation. That is outsourced clicking. Sifter turns those files into structured records. Then the team can filter unpaid invoices, group spend by vendor, search candidates by skill, or export everything into the downstream system. What changed for me while building this was simple: people do not buy extraction. They buy time back. They buy fewer errors. They buy faster closes. They buy a process that does not break every time a supplier changes a PDF. That is the business outcome we’re optimizing for. Not “AI document processing.” A folder that behaves like a database.
Angle: Open-source trust and control
There is a reason a lot of teams hesitate to send critical documents to black-box SaaS. Data sensitivity. Vendor lock-in. Provider risk. No control over infra. No control over model choice. That is especially true when the documents are invoices, contracts, HR files, or customer-uploaded records. Sifter was built for teams that want the benefits of document intelligence without giving up control. It is open-source. It is self-hostable with Docker Compose. It supports BYOK LLMs. It ships with an API, SDKs, webhooks, and an MCP server. That means a founding engineer can ship document workflows fast, and an ops team can keep the system inside their own trust boundary. My view is simple: if the data matters enough to extract, it matters enough to control. That is the product.
No visuals for this kit yet.
Tagline
Turn documents into structured databases
Description
Sifter turns PDFs, scans, receipts, contracts, and resumes into queryable rows with citations. Define schemas in plain English, then filter, aggregate, export, dashboard, or connect via API, SDK, webhooks, and MCP.
Maker's first comment
I built Sifter because I kept seeing the same failure mode: teams had piles of documents, but their only options were brittle templates, noisy OCR, or RAG systems that could find text but not create structure. That breaks down fast in real workflows. An invoice changes layout. A receipt is blurry. A resume comes in a weird format. A contract has the right clause on page 7, but the business needs the field, the table, and the trend line — not another chat answer. Sifter is my attempt to make documents behave like data. You define the schema, Sifter extracts records with citations, and then you can query, filter, aggregate, export, or dashboard the result. I’m launching this because I want feedback from people who live in document chaos every day: finance ops, recruiting ops, and engineers building workflows on top of uploaded files. I’d love to know where the extraction breaks, what schemas you’d want first, and whether the “document database” framing resonates or if it needs to be sharper.
Pinned maker comment
I’m especially looking for feedback on the first 10 minutes of use: does the schema flow feel obvious, does the citation layer build trust, and which wedge should be the homepage hero first — invoices, resumes, or contracts?
Meta
Invoice chaos is not an OCR problem.
Targeting finance ops managers at 50-500 employee companies. Hypothesis: teams with mixed invoice formats do not need another parser; they need structured rows they can query. Sifter turns invoices, receipts, and bills into a database with citations, filters, and dashboards.
Google Search
Parse PDFs into structured records
Targeting engineers searching for document extraction APIs. Hypothesis: builders want schema-driven extraction that works across layout changes, not template maintenance. Sifter supports PDFs, scans, images, SDKs, webhooks, MCP, and self-hosting.
Reddit Promoted
Resumes are still being manually sorted.
Targeting talent ops and startup teams drowning in CVs. Hypothesis: resume parsing fails when layouts vary, so teams need queryable fields with citations instead of brittle parsers. Sifter turns folders of resumes into structured data you can filter and export.
Subreddits
r/SideProject
Share the build, the problem, and a short demo of turning a folder of PDFs into a queryable table
Rules: No pure promo posts; show product and process, keep it honest, and engage in comments
r/indiehackers
Post the technical and business lesson: why template extractors fail and how schema-driven extraction changes the workflow
Rules: Focus on learnings and founder journey, not just a launch link
r/microsaas
Show how Sifter can be turned into a niche SaaS for invoices, resumes, or contracts
Rules: Keep it relevant to small software products and avoid generic marketing
r/EntrepreneurRideAlong
Document the launch process and ask for feedback on positioning and pricing
Rules: Posts should be transparent progress updates, not obvious ads
r/datascience
Technical post about extracting structured datasets from messy documents with citations and schema definitions
Rules: Must be technical, educational, and not framed as a sales pitch
Communities
Post a build log, a concrete workflow demo, and one lesson about why RAG fails for structured document work. Reply to every comment with specifics, not marketing.
Launch with a technical angle: open-source, self-hosted, schema-driven extraction, and the database framing. Keep the post factual and let the comments do the work.
Share a before/after example of turning messy documents into a dataset that can be filtered and aggregated. Focus on outcomes like faster reporting and fewer manual errors.
Cold outreach template
{firstName} — saw {context} and thought of Sifter because it turns messy document folders into queryable rows, not another brittle parser. If you’re still manually dealing with invoices/resumes/contracts that change format, I can show you a 2-minute demo. Want me to send it?
Product Hunt timing
Launch on Tuesday at 12:01am Pacific Time. That gives you the full day to compound votes while US buyers are awake, and it avoids weekend lag for technical founders, ops leads, and engineers who tend to browse PH during work hours.
Indie Hackers post ideas
- 01Why I built an anti-template document extractor
- 02From dark folder to queryable database: the product demo
- 03What broke in RAG for invoices, resumes, and contracts
Competitor alternatives
Current tone of voice
Confident, technical, and slightly provocative; it opens with the line "Your documents are a dark database" and repeatedly uses sharp contrasts like "RAG was built for retrieval. Sifter was built for this."
Your kit is ready. Sign up free to unlock, takes 10 seconds.
7 more X posts · 2 LinkedIn · Product Hunt copy · ad hooks · 100-user playbook · landing critique
