Rules vs. LLMs: A Practical Battle for B2B Document Extraction

By ● min read

Introduction

In the world of B2B operations, extracting structured data from documents like purchase orders, invoices, and shipping notes is a perennial challenge. Many teams have relied on rule-based approaches—often built on top of OCR engines like Tesseract—for years. But with the rise of large language models (LLMs), a new contender has emerged: using models like LLaMA 3, deployed locally via Ollama, to perform the same extraction tasks. I built both versions of a document extractor for a realistic B2B order scenario and put them head-to-head. This article dives into the results, trade-offs, and key takeaways from this practical comparison.

Rules vs. LLMs: A Practical Battle for B2B Document Extraction
Source: towardsdatascience.com

The Challenge: Extracting B2B Orders

Our testbed was a set of scanned PDF purchase orders from a mid-sized manufacturing company. Each document contained fields like order number, customer name, line items (part numbers, quantities, prices), shipping address, and totals. The goal: extract these fields accurately and consistently, with minimal human intervention. We developed two pipelines:

Both solutions were tested on a sample of 50 documents with varying layouts, font sizes, and print quality.

Rule-Based Approach: Tried and True

OCR with pytesseract

The first step was converting pages to images and running Tesseract via pytesseract. For clean digital PDFs, accuracy was high—over 95% on character-level recognition. However, scanned documents with low contrast, skew, or handwritten notes introduced noise.

Rule-Based Parsing

After OCR, we applied a cascade of regex patterns and position-based heuristics. For example, the order number was expected to appear after the phrase "Order #" within the top 15% of the page. Line items were detected by looking for tabular structures (multiple lines with numeric columns).

Strengths:

Weaknesses:

LLM-Based Approach: The New Kid on the Block

Local LLM with Ollama and LLaMA 3

We set up Ollama on a modest machine (16GB RAM, no dedicated GPU) and pulled LLaMA 3 (8B parameters). The OCR output text was passed as part of a prompt instructing the model to extract fields in JSON format. We also experimented with passing the raw image as base64 into vision-enabled LLMs (like LLaVA via Ollama), but that dramatically increased latency.

Prompt example: "You are an order extraction assistant. Given the following raw text from a purchase order, output a JSON object with keys: order_number, customer_name, line_items (list of objects with part_number, quantity, unit_price), shipping_address, total_amount. Only return valid JSON."

Performance and Accuracy

The LLM pipeline took 5–15 seconds per page (mostly due to generation time). Accuracy was impressive: 92% for exact field matches (vs 88% for rules). More importantly, it handled layout variation gracefully. For instance, one vendor placed the order number at the bottom right; the LLM still found it, whereas the rule-based system missed it entirely.

Strengths:

Weaknesses:

Rules vs. LLMs: A Practical Battle for B2B Document Extraction
Source: towardsdatascience.com

Head-to-Head Comparison

The table below summarizes key metrics across our test set of 50 B2B order documents:

MetricRule-BasedLLM-Based (LLaMA 3)
Field extraction accuracy88%92%
Handling layout variationPoor (breaks on new templates)Good (adapts to most changes)
Processing time per page~1.5s~8s (CPU)
Maintenance effort (per new template)High (2–4 hours)Low (10 min prompt tuning)
DeterminismYesNo (~2% output variation)
Infrastructure costVery low (CPU only)Moderate (GPU advised)
Hallucination riskNoneLow but present (e.g., inventing line items)

Both approaches had near-perfect precision on the fields they did extract. The LLM made 3% hallucinated extractions (fields not present in document), while rule-based never hallucinated but missed 10% of fields entirely.

When to Use Which

Go Rule-Based If:

Go LLM-Based If:

Conclusion

Building the same B2B document extractor twice—once with rules and once with an LLM—reveals that neither approach is universally superior. Rule-based systems remain unbeatable for speed, cost, and determinism when document templates are stable. LLMs, on the other hand, offer flexibility and higher accuracy on varied, noisy inputs, at the cost of speed and hardware requirements. A hybrid strategy could be the best of both worlds: run a fast rule-based extractor first, then fall back to an LLM when confidence is low. In our B2B scenario, the LLM approach proved more robust overall, but the right choice depends on your specific constraints.

If you're considering automating document extraction, build a small prototype with both methods on your own documents. The gap between them is narrowing—and in many cases, LLMs are already the pragmatic winner for modern data pipelines.

Tags:

Recommended

Discover More

Tracking Your Brand's AI Citation Rate: A Step-by-Step GuideUnpacking WhatsApp's Liquid Glass Redesign: What's Coming to Chat Screens?Supply-Chain Attack on Daemon Tools: A Month-Long Compromise Exposed5 Key Insights Into Neowiz's AI Creator Hiring for Future ProjectsMastering GDB: How Source-Tracking Breakpoints Simplify Debugging