BatchCrawler — Queue-basiertes Bulk-Scraping mit N parallelen Workern #116
Labels
No labels
bug
docs
feature
housekeeping
html-spec
performance
react-compat
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
glow-all/true-headless-browser#116
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problembeschreibung
Für großvolumiges Scraping (10.000+ Seiten) brauchen wir ein Queue-basiertes Crawling-System:
Architektur
API Design
Akzeptanzkriterien (Phase 1 — Core Queue)
BatchCrawler.add()fügt URL zur QueueBatchCrawler.addBulk(urls)fügt mehrere URLsBatchCrawler.start()startet Verarbeitungpage.goto(url)aufrequestDelayzwischen RequestsCrawlResultenthält Grunddaten (url, status, domSize, duration)CrawlSummarynach completionAkzeptanzkriterien (Phase 2 — DOM Extraction)
page.extractJSONLD()page.extractText()— clean text ohne HTMLAkzeptanzkriterien (Phase 3 — Advanced)
max 10 req/s pro domainBetroffene Dateien
src/crawler/batch-crawler.tssrc/crawler/url-queue.tssrc/crawler/extractors.tssrc/pages/page.tstests/unit/crawler.test.tsTestplan (20 Tests)
Dependencies
Performance Target
Risks