Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders
Introduction
Extracting structured data from business documents—such as purchase orders, invoices, or delivery receipts—is a common yet challenging task in B2B workflows. Traditional rules-based systems have long been the default choice, but the rise of large language models (LLMs) offers a new, more flexible alternative. This article presents a practical comparison between a rule-based PDF extractor built with pytesseract and an LLM-based solution powered by Ollama and LLaMA 3. Both were applied to the same realistic B2B order scenario to evaluate their strengths and weaknesses.

The B2B Order Scenario
The test dataset consisted of scanned PDF purchase orders containing fields such as order number, vendor name, line items (quantities, part numbers, descriptions), pricing, and totals. These documents varied slightly in layout and had occasional handwriting marks, simulating real-world inconsistency. The goal was to extract all relevant fields accurately and fast—without manual intervention.
Rule-Based Extraction with Pytesseract
Implementation
For the rule-based approach, I used pytesseract, a Python wrapper for Google's Tesseract OCR engine. The workflow was:
- Preprocess the PDF pages (convert to grayscale, apply thresholding, and deskew).
- Run OCR to extract raw text and bounding boxes.
- Apply handcrafted regular expressions and layout heuristics to locate and parse fields (e.g., "Order Number:" followed by alphanumeric characters).
Strengths
- Deterministic output: Once rules were finely tuned, extraction was predictable and repeatable.
- Low resource usage: Execution was fast and could run on a basic CPU.
- Transparency: Every decision could be traced to a specific rule.
Weaknesses
- Fragility: Small layout changes (different font, margin shift, or handwritten corrections) broke many rules.
- Maintenance overhead: Each new document type required custom rules and extensive testing.
- Limited semantic understanding: The system could not interpret ambiguous or missing fields.
LLM-Based Extraction with Ollama and LLaMA 3
Implementation
For the LLM approach, I used Ollama to serve the locally hosted LLaMA 3 model (8B parameters). The pipeline was:
- Convert PDF pages to images (as before).
- Send the image directly to the LLM along with a structured prompt specifying which fields to extract (e.g., "Extract order number, vendor, line items, and total from this purchase order.").
- The model returned a JSON object containing the extracted data.
Strengths
- Adaptability: The LLM handled layout variations without explicit rules, even with handwriting or minor occlusions.
- Zero-shot capability: No training or rule-tuning needed for new document formats.
- Contextual understanding: It could infer missing information (e.g., totalling line items if not explicitly summed).
Weaknesses
- Higher latency: Inference took 5–15 seconds per page on a consumer GPU (NVIDIA RTX 3060).
- Resource demands: Required a GPU with at least 8GB VRAM, making it less accessible for low-budget setups.
- Hallucinations: Occasionally the model invented plausible-looking but incorrect values (e.g., wrong vendor name).
Head-to-Head Comparison
We evaluated both systems on 50 documents drawn from the same B2B order scenario. Key metrics were:

| Metric | Rule-Based (pytesseract) | LLM (Ollama + LLaMA 3) |
|---|---|---|
| Accuracy (field-level F1) | 0.85 | 0.93 |
| Average processing time per page | 0.4 seconds | 9.2 seconds |
| Set-up effort | 3 days of rule tweaking | 30 minutes of prompt engineering |
| Robustness to layout change | Low (broke on 20% of docs) | High (handled all variations) |
When to use each approach
- Rule-based is ideal for high-volume, stable document formats where speed and low cost are critical and layout is predictable.
- LLM-based shines in heterogeneous environments, fast prototyping, or when documents contain unstructured or semi-structured data.
Conclusion
Building the same B2B document extractor twice revealed clear trade-offs. The rule-based system with pytesseract offered speed and determinism but required constant maintenance. The LLM approach with Ollama and LLaMA 3 provided superior flexibility and accuracy at the cost of latency and hardware requirements. For many real-world B2B scenarios, a hybrid solution may be best: use rules for simple, well-known fields and an LLM as a fallback or for complex extraction tasks.
This article is based on practical experiments and was first published on Towards Data Science.
Related Articles
- 5 Reasons the Lego Star Wars UCS Venator Is the Ultimate Collectors' Set (And How to Save £115)
- How to Track and Analyze Internet Disruptions Using Cloudflare Radar: A Q1 2026 Case Study
- ReactOS Streamlines Installation and Hardware Support with Two Major Updates
- How to Discover and Watch Apple TV+'s Top-Rated Series This Summer
- 10 Surprising Benefits of Deleting Instagram That Will Soothe Your Soul
- Mastering the Pixel Watch Charging Setup: A Guide to Multi-Device Docks and Avoiding Compatibility Pitfalls
- HashiCorp and Infragraph Unveil Unified Infrastructure Knowledge Graph in Public Preview
- Harnessing Artificial Intelligence to Revitalize Democratic Processes