Polite web → Sibyl pipeline: Firecrawl → sibyl-extractor → quality gate → Sibyl DB
Find a file
Glow Uncut ea10caee37 Initial commit: polite web → Sibyl pipeline
Modules:
- pipeline.py: MiningPipeline orchestrator (Firecrawl → extractor → quality → store)
- politeness.py: Domain-level rate limiting with exponential backoff
- sibyl_store.py: Stores structured ExtractedPage data into Sibyl DB
- cli.py: CLI entry point with argparse

Firecrawl handles robots.txt/caching/retry natively.
This adds domain-level throttling on top (configurable requests/sec per domain).

Tests: 10/10 passing
2026-06-14 18:19:19 +00:00
tests Initial commit: polite web → Sibyl pipeline 2026-06-14 18:19:19 +00:00
web_miner Initial commit: polite web → Sibyl pipeline 2026-06-14 18:19:19 +00:00
.gitignore Initial commit: polite web → Sibyl pipeline 2026-06-14 18:19:19 +00:00
pyproject.toml Initial commit: polite web → Sibyl pipeline 2026-06-14 18:19:19 +00:00
README.md Initial commit: polite web → Sibyl pipeline 2026-06-14 18:19:19 +00:00

sibyl-web-miner

Polite web → Sibyl pipeline. Firecrawl → sibyl-extractor → quality gate → Sibyl DB.

Pipeline

URL list → DomainThrottle → Firecrawl scrape
                          → sibyl-extractor (metadata, fields, quality)
                          → Quality gate
                          → Sibyl DB (reference_documents table)

Politeness

  • Domain-level rate limiting: configurable requests/sec per domain
  • Exponential backoff: failed domains get progressively longer cooldowns (2x multiplier, 5min cap)
  • Firecrawl handles: robots.txt, caching, retry
  • Configurable page delay: minimum gap between pages

Installation

pip install -e /opt/sibyl-extractor   # Dependency: the structured extractor
pip install -e ".[dev]"

Usage

# Single URL
python -m web_miner.cli https://example.com/page

# Multiple URLs
python -m web_miner.cli https://site1.com https://site2.com

# URL file
python -m web_miner.cli --file urls.txt

# With options
python -m web_miner.cli \
  --file urls.txt \
  --rate 2.0 \          # 2 requests/sec per domain
  --delay 1.0 \         # 1 second between pages
  --min-words 200 \     # Skip pages with <200 words
  --max 50 \            # Stop after 50 pages

# JSON output (for cron)
python -m web_miner.cli --file urls.txt --json

Firecrawl

The pipeline calls a self-hosted Firecrawl API at http://localhost:3002 by default. Override with --firecrawl-url or configure via the MiningConfig.

Cron

Set up a cron job:

# Run daily at 6am
hermes cron create \
  --name "web-miner" \
  --schedule "0 6 * * *" \
  --prompt "Run /opt/sibyl-web-miner/web_miner/cli.py --file /opt/sibyl-web-miner/urls.txt --max 100 --json"

Tests

pytest -v