Zero-LLM structured extraction pipeline — post-processes Firecrawl output into clean, structured dicts for Sibyl ingestion
Find a file
Glow Uncut 1d64727738 Initial commit: zero-LLM structured extraction pipeline
Pipeline modules:
- metadata.py: OG/JSON-LD/meta tag extraction from HTML
- fields.py: Regex-based price, date, and spec extraction
- quality.py: Content quality scoring (word count, link density, etc.)
- dedup.py: Pure-Python SimHash deduplication
- pipeline.py: Pipeline orchestrator chaining all modules

66 tests, all passing. No external LLM dependency.
2026-06-14 17:58:56 +00:00
extractor Initial commit: zero-LLM structured extraction pipeline 2026-06-14 17:58:56 +00:00
tests Initial commit: zero-LLM structured extraction pipeline 2026-06-14 17:58:56 +00:00
.gitignore Initial commit: zero-LLM structured extraction pipeline 2026-06-14 17:58:56 +00:00
pyproject.toml Initial commit: zero-LLM structured extraction pipeline 2026-06-14 17:58:56 +00:00
README.md Initial commit: zero-LLM structured extraction pipeline 2026-06-14 17:58:56 +00:00

sibyl-extractor

Zero-LLM structured extraction pipeline — post-processes Firecrawl output into clean, structured dicts for Sibyl ingestion.

Pipeline

Firecrawl HTML → Metadata parser (OG/JSON-LD/meta)
               → Field extractors (regex prices, dates, specs)
               → Quality gate (word count, link density, etc.)
               → SimHash dedup
               → Structured dict → Sibyl

No LLM involved. All extraction is heuristic/regex-based — runs in <50ms per page on CPU.

Installation

pip install -e ".[dev]"

Usage

from extractor import ExtractionPipeline

pipeline = ExtractionPipeline()
result = pipeline.process(
    url="https://example.com/product",
    html="<html>...",
    markdown="# Product..."
)

# result is an ExtractedPage with structured fields
print(result.title)
print(result.prices)
print(result.specs)
print(result.quality.passes_quality_gate)

Quality Gate Thresholds

Metric Threshold Skipped if...
Word count < 100 Thin content
Link density > 0.4 Link farm / nav-only
Headings 0 and wc < 300 Poorly structured
Sentence var < 0.5 Computer-generated noise

Tests

pytest -v

Project Structure

extractor/
├── __init__.py    # Public API
├── types.py       # Data types (ExtractedPage, QualityScore, etc.)
├── metadata.py    # HTML/OG/JSON-LD metadata extraction
├── fields.py      # Regex-based field extraction (prices, dates, specs)
├── quality.py     # Content quality scoring engine
├── dedup.py       # SimHash-based deduplication
└── pipeline.py    # Pipeline orchestrator
tests/
├── test_metadata.py
├── test_fields.py
├── test_quality.py
└── test_dedup.py