glow-all/sibyl-extractor

Fork 0

Zero-LLM structured extraction pipeline — post-processes Firecrawl output into clean, structured dicts for Sibyl ingestion

Python 100%

Find a file

Glow Uncut 1d64727738 Initial commit: zero-LLM structured extraction pipeline Pipeline modules: - metadata.py: OG/JSON-LD/meta tag extraction from HTML - fields.py: Regex-based price, date, and spec extraction - quality.py: Content quality scoring (word count, link density, etc.) - dedup.py: Pure-Python SimHash deduplication - pipeline.py: Pipeline orchestrator chaining all modules 66 tests, all passing. No external LLM dependency.		2026-06-14 17:58:56 +00:00
extractor	Initial commit: zero-LLM structured extraction pipeline	2026-06-14 17:58:56 +00:00
tests	Initial commit: zero-LLM structured extraction pipeline	2026-06-14 17:58:56 +00:00
.gitignore	Initial commit: zero-LLM structured extraction pipeline	2026-06-14 17:58:56 +00:00
pyproject.toml	Initial commit: zero-LLM structured extraction pipeline	2026-06-14 17:58:56 +00:00
README.md	Initial commit: zero-LLM structured extraction pipeline	2026-06-14 17:58:56 +00:00

README.md

sibyl-extractor

Zero-LLM structured extraction pipeline — post-processes Firecrawl output into clean, structured dicts for Sibyl ingestion.

Pipeline

Firecrawl HTML → Metadata parser (OG/JSON-LD/meta)
               → Field extractors (regex prices, dates, specs)
               → Quality gate (word count, link density, etc.)
               → SimHash dedup
               → Structured dict → Sibyl

No LLM involved. All extraction is heuristic/regex-based — runs in <50ms per page on CPU.

Installation

pip install -e ".[dev]"

Usage

from extractor import ExtractionPipeline

pipeline = ExtractionPipeline()
result = pipeline.process(
    url="https://example.com/product",
    html="<html>...",
    markdown="# Product..."
)

# result is an ExtractedPage with structured fields
print(result.title)
print(result.prices)
print(result.specs)
print(result.quality.passes_quality_gate)

Quality Gate Thresholds

Metric	Threshold	Skipped if...
Word count	< 100	Thin content
Link density	> 0.4	Link farm / nav-only
Headings	0 and wc < 300	Poorly structured
Sentence var	< 0.5	Computer-generated noise

Tests

pytest -v

Project Structure

extractor/
├── __init__.py    # Public API
├── types.py       # Data types (ExtractedPage, QualityScore, etc.)
├── metadata.py    # HTML/OG/JSON-LD metadata extraction
├── fields.py      # Regex-based field extraction (prices, dates, specs)
├── quality.py     # Content quality scoring engine
├── dedup.py       # SimHash-based deduplication
└── pipeline.py    # Pipeline orchestrator
tests/
├── test_metadata.py
├── test_fields.py
├── test_quality.py
└── test_dedup.py