Polite web → Sibyl pipeline: Firecrawl → sibyl-extractor → quality gate → Sibyl DB
- Python 100%
Modules: - pipeline.py: MiningPipeline orchestrator (Firecrawl → extractor → quality → store) - politeness.py: Domain-level rate limiting with exponential backoff - sibyl_store.py: Stores structured ExtractedPage data into Sibyl DB - cli.py: CLI entry point with argparse Firecrawl handles robots.txt/caching/retry natively. This adds domain-level throttling on top (configurable requests/sec per domain). Tests: 10/10 passing |
||
|---|---|---|
| tests | ||
| web_miner | ||
| .gitignore | ||
| pyproject.toml | ||
| README.md | ||
sibyl-web-miner
Polite web → Sibyl pipeline. Firecrawl → sibyl-extractor → quality gate → Sibyl DB.
Pipeline
URL list → DomainThrottle → Firecrawl scrape
→ sibyl-extractor (metadata, fields, quality)
→ Quality gate
→ Sibyl DB (reference_documents table)
Politeness
- Domain-level rate limiting: configurable requests/sec per domain
- Exponential backoff: failed domains get progressively longer cooldowns (2x multiplier, 5min cap)
- Firecrawl handles: robots.txt, caching, retry
- Configurable page delay: minimum gap between pages
Installation
pip install -e /opt/sibyl-extractor # Dependency: the structured extractor
pip install -e ".[dev]"
Usage
# Single URL
python -m web_miner.cli https://example.com/page
# Multiple URLs
python -m web_miner.cli https://site1.com https://site2.com
# URL file
python -m web_miner.cli --file urls.txt
# With options
python -m web_miner.cli \
--file urls.txt \
--rate 2.0 \ # 2 requests/sec per domain
--delay 1.0 \ # 1 second between pages
--min-words 200 \ # Skip pages with <200 words
--max 50 \ # Stop after 50 pages
# JSON output (for cron)
python -m web_miner.cli --file urls.txt --json
Firecrawl
The pipeline calls a self-hosted Firecrawl API at http://localhost:3002 by default.
Override with --firecrawl-url or configure via the MiningConfig.
Cron
Set up a cron job:
# Run daily at 6am
hermes cron create \
--name "web-miner" \
--schedule "0 6 * * *" \
--prompt "Run /opt/sibyl-web-miner/web_miner/cli.py --file /opt/sibyl-web-miner/urls.txt --max 100 --json"
Tests
pytest -v