Rules vs. LLMs: A Practical Battle for B2B Document Extraction

By ● min read

Introduction

In the world of B2B operations, extracting structured data from documents like purchase orders, invoices, and shipping notes is a perennial challenge. Many teams have relied on rule-based approaches—often built on top of OCR engines like Tesseract—for years. But with the rise of large language models (LLMs), a new contender has emerged: using models like LLaMA 3, deployed locally via Ollama, to perform the same extraction tasks. I built both versions of a document extractor for a realistic B2B order scenario and put them head-to-head. This article dives into the results, trade-offs, and key takeaways from this practical comparison.

Rules vs. LLMs: A Practical Battle for B2B Document Extraction — Source: towardsdatascience.com

The Challenge: Extracting B2B Orders

Our testbed was a set of scanned PDF purchase orders from a mid-sized manufacturing company. Each document contained fields like order number, customer name, line items (part numbers, quantities, prices), shipping address, and totals. The goal: extract these fields accurately and consistently, with minimal human intervention. We developed two pipelines:

Rule-based pipeline: Using pytesseract for OCR, followed by regex patterns and heuristics to locate and extract each field.
LLM-based pipeline: Using Ollama to run LLaMA 3 locally, feeding the raw OCR output (or entire image) to the model and prompting it to return structured JSON.

Both solutions were tested on a sample of 50 documents with varying layouts, font sizes, and print quality.

Rule-Based Approach: Tried and True

OCR with pytesseract

The first step was converting pages to images and running Tesseract via pytesseract. For clean digital PDFs, accuracy was high—over 95% on character-level recognition. However, scanned documents with low contrast, skew, or handwritten notes introduced noise.

Rule-Based Parsing

After OCR, we applied a cascade of regex patterns and position-based heuristics. For example, the order number was expected to appear after the phrase "Order #" within the top 15% of the page. Line items were detected by looking for tabular structures (multiple lines with numeric columns).

Strengths:

Deterministic: Same input always yields same output.
Fast: Processing took 1-2 seconds per page.
Low cost: No GPU or API fees—only CPU compute and Tesseract.

Weaknesses:

Brittle: Slight layout changes (e.g., new logo, repositioned fields) broke extraction.
High maintenance: Each new document template required manual rule updates.
Limited context understanding: Could not infer missing fields or correct OCR errors based on semantics (e.g., "300" vs "3OO").

LLM-Based Approach: The New Kid on the Block

Local LLM with Ollama and LLaMA 3

We set up Ollama on a modest machine (16GB RAM, no dedicated GPU) and pulled LLaMA 3 (8B parameters). The OCR output text was passed as part of a prompt instructing the model to extract fields in JSON format. We also experimented with passing the raw image as base64 into vision-enabled LLMs (like LLaVA via Ollama), but that dramatically increased latency.

Prompt example: "You are an order extraction assistant. Given the following raw text from a purchase order, output a JSON object with keys: order_number, customer_name, line_items (list of objects with part_number, quantity, unit_price), shipping_address, total_amount. Only return valid JSON."

Performance and Accuracy

The LLM pipeline took 5–15 seconds per page (mostly due to generation time). Accuracy was impressive: 92% for exact field matches (vs 88% for rules). More importantly, it handled layout variation gracefully. For instance, one vendor placed the order number at the bottom right; the LLM still found it, whereas the rule-based system missed it entirely.

Strengths:

Flexible: Works across multiple layouts without explicit rules.
Contextual correction: Spotted OCR errors (e.g., corrected "3OO" to "300" based on numeric context).
Easier maintenance: Update the prompt rather than dozens of regex patterns.

Weaknesses:

Slower: 3–10x slower than rule-based.
Resource-hungry: Requires GPU for reasonable speed; CPU-only was painful.
Non-deterministic: Multiple runs can produce slightly different outputs (temperature settings help but don't eliminate variability).
Cost: Even local deployment consumes electricity and time; cloud LLMs would add per‑call charges.

Head-to-Head Comparison

The table below summarizes key metrics across our test set of 50 B2B order documents:

Metric	Rule-Based	LLM-Based (LLaMA 3)
Field extraction accuracy	88%	92%
Handling layout variation	Poor (breaks on new templates)	Good (adapts to most changes)
Processing time per page	~1.5s	~8s (CPU)
Maintenance effort (per new template)	High (2–4 hours)	Low (10 min prompt tuning)
Determinism	Yes	No (~2% output variation)
Infrastructure cost	Very low (CPU only)	Moderate (GPU advised)
Hallucination risk	None	Low but present (e.g., inventing line items)

Both approaches had near-perfect precision on the fields they did extract. The LLM made 3% hallucinated extractions (fields not present in document), while rule-based never hallucinated but missed 10% of fields entirely.

When to Use Which

Go Rule-Based If:

Documents are highly standardized (e.g., same form from a single supplier).
You need sub-second latency and can't afford GPU.
Deterministic output is critical (e.g., for audit trails).
You have a small number of templates (<5) and limited budget for LLM serving.

Go LLM-Based If:

You face diverse layouts from many suppliers.
Accuracy on messy scans is more important than speed.
You can accept slight variability and have capacity to post‑process.
Your team wants to avoid maintaining complex regex scripts.

Conclusion

Building the same B2B document extractor twice—once with rules and once with an LLM—reveals that neither approach is universally superior. Rule-based systems remain unbeatable for speed, cost, and determinism when document templates are stable. LLMs, on the other hand, offer flexibility and higher accuracy on varied, noisy inputs, at the cost of speed and hardware requirements. A hybrid strategy could be the best of both worlds: run a fast rule-based extractor first, then fall back to an LLM when confidence is low. In our B2B scenario, the LLM approach proved more robust overall, but the right choice depends on your specific constraints.

If you're considering automating document extraction, build a small prototype with both methods on your own documents. The gap between them is narrowing—and in many cases, LLMs are already the pragmatic winner for modern data pipelines.

Tags: