Every operations team has that one process — the one where someone copies numbers from PDFs into a spreadsheet, row by row, for hours. Invoices, contracts, purchase orders, bank statements. The data is right there in the document, but getting it out feels like archaeology.
We have built PDF data extraction workflows for clients processing anywhere from 200 to 8,000 documents per month. The pattern is almost always the same: a trigger picks up the file, an OCR service reads it, a language model structures the output, and the result lands in a spreadsheet or database. The specifics vary, but the architecture does not.
This tutorial walks you through building that exact workflow from scratch using n8n (self-hosted), AWS Textract for OCR, and GPT-4o for intelligent field extraction. By the end, you will have a working pipeline that processes PDFs automatically and costs under $50/month for most SMB volumes.
What You Will Learn
- How to set up an end-to-end PDF data extraction workflow in n8n
- When to use AWS Textract vs Google Document AI vs dedicated SaaS tools
- How to write structured extraction prompts for GPT-4o that produce clean JSON
- Real cost math: what this actually costs per document at various volumes
- Common failure modes and how to handle them before they bite you in production
Prerequisites
- A self-hosted n8n instance (if you do not have one, check our guide on how to self-host n8n on AWS or DigitalOcean)
- An AWS account with Textract enabled
- An OpenAI API key with GPT-4o access
- A Google Sheet or Airtable base for output (or any database)
- Basic familiarity with n8n node configuration
Step 1: Define Your Document Schema
Before building anything, write down exactly what fields you need from each document type. This sounds obvious, but skipping this step is the number one reason extraction workflows produce messy data.
For invoices, your schema might look like this:
{
"vendor_name": "string",
"invoice_number": "string",
"invoice_date": "YYYY-MM-DD",
"due_date": "YYYY-MM-DD",
"subtotal": "number",
"tax_amount": "number",
"total_amount": "number",
"currency": "string",
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"amount": "number"
}
]
}
For contracts, you might need: parties involved, effective date, termination date, key obligations, payment terms, and governing law.
The specificity matters because your GPT-4o prompt will reference this schema directly. Vague schemas produce vague outputs.
Step 2: Set Up the n8n Workflow Trigger
You need a way to get PDFs into the workflow. Three options work well:
Option A — Google Drive Trigger: Set up an n8n Google Drive trigger node watching a specific folder. When someone drops a PDF into that folder, the workflow fires. This is the easiest approach for teams already using Google Workspace.
Option B — Webhook Trigger: Create a webhook node in n8n. Your application or form sends a POST request with the PDF as a file attachment or a URL to the file. Best for API-first setups.
Option C — Email Trigger (IMAP): Monitor a dedicated inbox (e.g., invoices@yourcompany.com). When an email with a PDF attachment arrives, extract the attachment and process it. Surprisingly common in accounting workflows.
For this tutorial, we will use the Google Drive trigger. In n8n, add a Google Drive Trigger node, authenticate it with your Google account, and set it to watch a folder called "Incoming Invoices" for new files.
Step 3: Extract Text with AWS Textract
This is where the actual OCR happens. AWS Textract is our go-to for document processing, and as an AWS Partner, we have run enough documents through it to know its strengths and limitations.
Add an HTTP Request node in n8n to call the Textract API. Here is the configuration:
// n8n Function node to call Textract
const AWS = require('aws-sdk');
const textract = new AWS.Textract({
region: 'ap-south-1',
accessKeyId: $env.AWS_ACCESS_KEY,
secretAccessKey: $env.AWS_SECRET_KEY
});
const params = {
Document: {
Bytes: Buffer.from($binary.data, 'base64')
},
FeatureTypes: ['TABLES', 'FORMS']
};
const result = await textract.analyzeDocument(params).promise();
// Extract text blocks
const textBlocks = result.Blocks
.filter(block => block.BlockType === 'LINE')
.map(block => block.Text)
.join('\n');
return [{ json: { extractedText: textBlocks, rawBlocks: result.Blocks } }];
Why Textract over alternatives?
We have tested Google Document AI, Azure Form Recognizer, and Textract side by side on the same set of 500 invoices from Indian and GCC vendors. Textract performed best on scanned documents with mixed English and regional language text. Google Document AI was slightly better on clean, digitally-generated PDFs. Azure was solid across the board but the pricing is less transparent.
For most of our clients in India, UAE, and Singapore, Textract wins because of the AWS ecosystem integration and the pricing model we already operate within as an AWS Partner.
Step 4: Structure the Data with GPT-4o
Raw OCR output is messy. It is a wall of text with no structure. This is where GPT-4o earns its keep.
Add a second Function node or an OpenAI node in n8n with this prompt:
You are a document data extraction assistant. Extract the following fields from this invoice text and return ONLY valid JSON with no additional text.
Schema:
{
"vendor_name": "string",
"invoice_number": "string",
"invoice_date": "YYYY-MM-DD",
"due_date": "YYYY-MM-DD or null",
"subtotal": number,
"tax_amount": number,
"total_amount": number,
"currency": "INR/USD/AED/SGD",
"line_items": [
{
"description": "string",
"quantity": number,
"unit_price": number,
"amount": number
}
]
}
Rules:
- If a field is not found, use null
- Dates must be in YYYY-MM-DD format
- All monetary values must be numbers (no currency symbols)
- If line items cannot be clearly identified, return an empty array
Document text:
{{extractedText}}
Critical gotcha: Always validate the JSON output before passing it downstream. GPT-4o occasionally wraps the response in markdown code fences or adds explanatory text. Add a Function node after the OpenAI call:
let rawOutput = $json.choices[0].message.content;
// Strip markdown code fences if present
rawOutput = rawOutput.replace(/```json\n?/g, '').replace(/```\n?/g, '').trim();
try {
const parsed = JSON.parse(rawOutput);
return [{ json: parsed }];
} catch (e) {
// Log the failed parse for debugging
return [{ json: { error: 'JSON parse failed', raw: rawOutput } }];
}
Step 5: Route the Output
Now you have clean, structured JSON. Send it wherever it needs to go:
Google Sheets: Add a Google Sheets node. Map each JSON field to a column. This is the fastest setup for teams that live in spreadsheets.
Airtable: Use the Airtable node for teams that need relational data. Invoice records can link to vendor records, project records, and so on.
Your database: Use the HTTP Request node or a dedicated database node (Postgres, MySQL) to insert records directly.
CRM or ERP: For enterprise setups, send extracted data to Zoho, SAP, or whatever system your finance team uses via API.
Add an IF node before routing to handle the error case — if the JSON parse failed, send the document to a manual review queue instead of the automated pipeline.
Real Cost Breakdown
Here is what this workflow actually costs in production. These numbers are from our own deployments.
AWS Textract pricing (as of 2025-2026):
- Text detection: $1.50 per 1,000 pages
- Form and table extraction: $15.00 per 1,000 pages
- For invoices with tables, you need form extraction: roughly $0.015 per page
GPT-4o pricing:
- Input: $2.50 per million tokens
- Output: $10.00 per million tokens
- Average invoice OCR text: ~800 tokens input, ~400 tokens output
- Cost per document: ~$0.006
n8n self-hosted: $12-20/month on DigitalOcean (unlimited executions)
Total cost per document: approximately $0.02 Monthly cost for 5,000 documents: ~$100 (Textract) + ~$30 (GPT-4o) + $15 (server) = $145/month
Compare this to SaaS alternatives:
- Rossum: Starts at $300/month, enterprise pricing can exceed $2,000/month
- Nanonets: $499/month for their business plan
- Docsumo: Custom pricing, typically $500-$1,500/month for comparable volume
- Docparser: $99/month for 500 docs, $499/month for 5,000 docs
For SMBs processing under 10,000 documents per month, the custom n8n + Textract stack costs 60-80% less than any SaaS alternative. Where SaaS wins is when you need zero technical setup and have budget but not engineering time.
Common Issues and Fixes
Scanned PDFs with low DPI fail OCR: If the source document was scanned at below 200 DPI, Textract accuracy drops significantly. Add a pre-processing step using ImageMagick or Sharp to upscale the image before sending it to Textract. A quick fix: convert input.pdf -density 300 -quality 90 output.pdf
Multi-page documents need chunking: Textract handles multi-page PDFs natively, but GPT-4o has context limits. For documents over 15 pages, chunk the text by page and process each page separately, then merge the results.
Table extraction is harder than text: OCR engines are great at reading lines of text. They struggle with complex tables — merged cells, rotated headers, tables that span pages. For table-heavy documents, use Textract's AnalyzeDocument with TABLES feature type specifically, then pass the structured table data to GPT-4o rather than raw text.
Currency and locale issues: Indian invoices use lakhs and crores notation. UAE invoices use AED with different decimal formatting. Build locale-aware validation into your JSON parsing step. We typically add a normalization function that converts all amounts to a standard format.
Rate limiting: Both Textract and OpenAI have rate limits. If you are processing a large batch, add a Wait node in n8n between documents (2-3 seconds is usually enough) to avoid hitting API throttles.
When NOT to Build This Yourself
We are honest about this: the DIY approach is not always the right call.
Use a SaaS tool when:
- Your team has zero technical capacity for maintenance
- You process fewer than 200 documents per month (the math does not justify custom work)
- Your documents are highly standardized and a template-based parser like Docparser handles them perfectly
- You need SOC 2 or HIPAA compliance out of the box and do not want to manage it yourself
Build custom when:
- You process 1,000+ documents per month and cost matters
- Your documents vary significantly in format and layout
- You need the data in a specific system (ERP, custom database) that SaaS tools do not integrate with natively
- You want full control over your data and processing logic
- You already run n8n or have engineering capacity
Results from Production
When we deployed a similar workflow for a client processing vendor invoices across India and UAE, the results were concrete: manual data entry time dropped from approximately 130 hours per month to under 15 hours per month (only for exception handling and validation). Error rates went from an estimated 4-5% with manual entry to under 1% with AI extraction plus human review on flagged items.
The workflow paid for itself within the first month.
Frequently Asked Questions
Written by

Founder & CEO
Rishabh Sethia is the founder and CEO of Innovatrix Infotech, a Kolkata-based digital engineering agency. He leads a team that delivers web development, mobile apps, Shopify stores, and AI automation for startups and SMBs across India and beyond.
Connect on LinkedIn