Polite web → Sibyl pipeline: Firecrawl → sibyl-extractor → quality gate → Sibyl DB

Python 100%

Find a file

Glow Uncut ea10caee37 Initial commit: polite web → Sibyl pipeline Modules: - pipeline.py: MiningPipeline orchestrator (Firecrawl → extractor → quality → store) - politeness.py: Domain-level rate limiting with exponential backoff - sibyl_store.py: Stores structured ExtractedPage data into Sibyl DB - cli.py: CLI entry point with argparse Firecrawl handles robots.txt/caching/retry natively. This adds domain-level throttling on top (configurable requests/sec per domain). Tests: 10/10 passing		2026-06-14 18:19:19 +00:00
tests	Initial commit: polite web → Sibyl pipeline	2026-06-14 18:19:19 +00:00
web_miner	Initial commit: polite web → Sibyl pipeline	2026-06-14 18:19:19 +00:00
.gitignore	Initial commit: polite web → Sibyl pipeline	2026-06-14 18:19:19 +00:00
pyproject.toml	Initial commit: polite web → Sibyl pipeline	2026-06-14 18:19:19 +00:00
README.md	Initial commit: polite web → Sibyl pipeline	2026-06-14 18:19:19 +00:00

README.md

sibyl-web-miner

Polite web → Sibyl pipeline. Firecrawl → sibyl-extractor → quality gate → Sibyl DB.

Pipeline

URL list → DomainThrottle → Firecrawl scrape
                          → sibyl-extractor (metadata, fields, quality)
                          → Quality gate
                          → Sibyl DB (reference_documents table)

Politeness

Domain-level rate limiting: configurable requests/sec per domain
Exponential backoff: failed domains get progressively longer cooldowns (2x multiplier, 5min cap)
Firecrawl handles: robots.txt, caching, retry
Configurable page delay: minimum gap between pages

Installation

pip install -e /opt/sibyl-extractor   # Dependency: the structured extractor
pip install -e ".[dev]"

Usage

# Single URL
python -m web_miner.cli https://example.com/page

# Multiple URLs
python -m web_miner.cli https://site1.com https://site2.com

# URL file
python -m web_miner.cli --file urls.txt

# With options
python -m web_miner.cli \
  --file urls.txt \
  --rate 2.0 \          # 2 requests/sec per domain
  --delay 1.0 \         # 1 second between pages
  --min-words 200 \     # Skip pages with <200 words
  --max 50 \            # Stop after 50 pages

# JSON output (for cron)
python -m web_miner.cli --file urls.txt --json

Firecrawl

The pipeline calls a self-hosted Firecrawl API at http://localhost:3002 by default. Override with --firecrawl-url or configure via the MiningConfig.

Cron

Set up a cron job:

# Run daily at 6am
hermes cron create \
  --name "web-miner" \
  --schedule "0 6 * * *" \
  --prompt "Run /opt/sibyl-web-miner/web_miner/cli.py --file /opt/sibyl-web-miner/urls.txt --max 100 --json"

Tests

pytest -v