Zero-LLM structured extraction pipeline — post-processes Firecrawl output into clean, structured dicts for Sibyl ingestion
- Python 100%
Pipeline modules: - metadata.py: OG/JSON-LD/meta tag extraction from HTML - fields.py: Regex-based price, date, and spec extraction - quality.py: Content quality scoring (word count, link density, etc.) - dedup.py: Pure-Python SimHash deduplication - pipeline.py: Pipeline orchestrator chaining all modules 66 tests, all passing. No external LLM dependency. |
||
|---|---|---|
| extractor | ||
| tests | ||
| .gitignore | ||
| pyproject.toml | ||
| README.md | ||
sibyl-extractor
Zero-LLM structured extraction pipeline — post-processes Firecrawl output into clean, structured dicts for Sibyl ingestion.
Pipeline
Firecrawl HTML → Metadata parser (OG/JSON-LD/meta)
→ Field extractors (regex prices, dates, specs)
→ Quality gate (word count, link density, etc.)
→ SimHash dedup
→ Structured dict → Sibyl
No LLM involved. All extraction is heuristic/regex-based — runs in <50ms per page on CPU.
Installation
pip install -e ".[dev]"
Usage
from extractor import ExtractionPipeline
pipeline = ExtractionPipeline()
result = pipeline.process(
url="https://example.com/product",
html="<html>...",
markdown="# Product..."
)
# result is an ExtractedPage with structured fields
print(result.title)
print(result.prices)
print(result.specs)
print(result.quality.passes_quality_gate)
Quality Gate Thresholds
| Metric | Threshold | Skipped if... |
|---|---|---|
| Word count | < 100 | Thin content |
| Link density | > 0.4 | Link farm / nav-only |
| Headings | 0 and wc < 300 | Poorly structured |
| Sentence var | < 0.5 | Computer-generated noise |
Tests
pytest -v
Project Structure
extractor/
├── __init__.py # Public API
├── types.py # Data types (ExtractedPage, QualityScore, etc.)
├── metadata.py # HTML/OG/JSON-LD metadata extraction
├── fields.py # Regex-based field extraction (prices, dates, specs)
├── quality.py # Content quality scoring engine
├── dedup.py # SimHash-based deduplication
└── pipeline.py # Pipeline orchestrator
tests/
├── test_metadata.py
├── test_fields.py
├── test_quality.py
└── test_dedup.py