BatchCrawler — Queue-basiertes Bulk-Scraping mit N parallelen Workern #116

Open
opened 2026-06-19 20:55:56 +00:00 by Artur · 0 comments
Owner

Problembeschreibung

Für großvolumiges Scraping (10.000+ Seiten) brauchen wir ein Queue-basiertes Crawling-System:

  1. URL Queue — Priorisierte Warteschlange
  2. Concurrent Workers — N parallele Page-Instanzen
  3. Error Handling — Retry + Fallback + Dead Letter Queue
  4. Result Pipeline — Strukturierte Extraktion + Speicherung
  5. Rate Limiting — Verzögerung zwischen Requests

Architektur

flowchart LR
    A[Seed URLs] --> B[URL Queue]
    B --> C[Dispatcher]
    C --> D[Worker 1]
    C --> E[Worker 2]
    C --> F[Worker N]
    D --> G[(Results)]
    E --> G
    F --> G
    D -->|429/503| H[Retry Queue]
    F -->|Error > 3x| I[Dead Letter Queue]
    B <--> H

API Design

class BatchCrawler {
  constructor(options: {
    workers: number;        // Parallele Worker (default: 5)
    sessionPool?: SessionPool;
    requestDelay?: number;  // Verzögerung zwischen Requests (ms)
    onResult?: (result: CrawlResult) => void;
    onError?: (error: CrawlError) => void;
  });

  add(url: string, priority?: number): void;
  addBulk(urls: string[]): void;
  start(): Promise<CrawlSummary>;
  pause(): void;
  resume(): void;
  stop(): void;
}

interface CrawlResult {
  url: string;
  status: number;
  dom: string;
  domSize: number;
  duration: number;
  errors: { type: string; message: string }[];
  network: { requests: number; bytes: number };
  extracted: {
    title?: string;
    description?: string;
    links: string[];
    meta: Record<string, string>;
    jsonld?: any[];
    text?: string;
  };
}

interface CrawlSummary {
  total: number;
  success: number;
  failed: number;
  duration: number;
  errors: { url: string; error: string }[];
}

Akzeptanzkriterien (Phase 1 — Core Queue)

  • BatchCrawler.add() fügt URL zur Queue
  • BatchCrawler.addBulk(urls) fügt mehrere URLs
  • BatchCrawler.start() startet Verarbeitung
  • N Workers parallel (konfigurierbar)
  • Worker ruft page.goto(url) auf
  • Error-Handling: 3 Retries, dann Dead Letter
  • requestDelay zwischen Requests
  • CrawlResult enthält Grunddaten (url, status, domSize, duration)
  • CrawlSummary nach completion

Akzeptanzkriterien (Phase 2 — DOM Extraction)

  • Auto-Extraction: title, description, links, meta tags
  • JSON-LD Extraction: page.extractJSONLD()
  • Text Extraction: page.extractText() — clean text ohne HTML
  • Result Pipeline: Custom Extractors
crawler.addExtractor('product', (page) => ({
  name: page.query('h1.product-title')?.text,
  price: page.query('.price')?.text,
  inStock: page.query('.add-to-cart') !== null,
}));

Akzeptanzkriterien (Phase 3 — Advanced)

  • Priorisierte Queue (numeric priority)
  • Rate Limiting pro Domain: max 10 req/s pro domain
  • Pause/Resume (für manuelle Intervention)
  • State Persistence: Queue speichert/restored bei Crash
  • SessionPool Integration
  • Webhook für Results

Betroffene Dateien

Datei Änderung Aufwand
src/crawler/batch-crawler.ts Neu: BatchCrawler Klasse ~300 Zeilen
src/crawler/url-queue.ts Neu: Priorisierte Queue ~100 Zeilen
src/crawler/extractors.ts Neu: DOM Extraction Helpers ~150 Zeilen
src/pages/page.ts extractJSONLD, extractText Methoden ~50 Zeilen
tests/unit/crawler.test.ts Neu: 20 Tests ~500 Zeilen

Testplan (20 Tests)

# Test Phase
BC01 add() → Queue hat 1 Item 1
BC02 addBulk(10 URLs) → Queue hat 10 1
BC03 start() → alle URLs verarbeitet 1
BC04 N Workers parallel (count) 1
BC05 3 Retries bei Error → Dead Letter 1
BC06 requestDelay eingehalten 1
BC07 CrawlResult.failed zählt Errors 1
BC08 crawlSummary.total = success + failed 1
BC09 page.goto() wird pro URL aufgerufen 1
BC10 Empty Queue → sofort done 1
BC11 extractTitle() funktioniert 2
BC12 extractLinks() findet alle 2
BC13 extractJSONLD() parst JSON-LD 2
BC14 extractText() gibt clean text 2
BC15 Custom Extractor Pipeline 2
BC16 Priority Queue (high first) 3
BC17 Rate Limiting pro Domain 3
BC18 pause/resume 3
BC19 State Persistence (save/load) 3
BC20 SessionPool Integration 3

Dependencies

  • SessionPool (Issue Session Pools) — optional, für Proxy-Rotation

Performance Target

Metrik Ziel
URLs pro Sekunde (1 Worker) ~5 URLs/s
URLs pro Sekunde (5 Worker) ~20 URLs/s
Max URLs in Queue 1.000.000
RAM pro 100.000 Queue-Items ~50MB

Risks

Risk Impact Mitigation
Out-of-Memory bei 1M URLs Hoch Stream-basierte Queue (SQLite)
Rate-Limiting blockiert alle Worker Mittel Pro-Domain Limiter, nicht global
Crawl-Ergebnisse zu groß Mittel Result-Streaming, nicht RAM
Crash in Mitte des Crawls Mittel State Persistence + Resume
## Problembeschreibung Für großvolumiges Scraping (10.000+ Seiten) brauchen wir ein Queue-basiertes Crawling-System: 1. **URL Queue** — Priorisierte Warteschlange 2. **Concurrent Workers** — N parallele Page-Instanzen 3. **Error Handling** — Retry + Fallback + Dead Letter Queue 4. **Result Pipeline** — Strukturierte Extraktion + Speicherung 5. **Rate Limiting** — Verzögerung zwischen Requests ### Architektur ```mermaid flowchart LR A[Seed URLs] --> B[URL Queue] B --> C[Dispatcher] C --> D[Worker 1] C --> E[Worker 2] C --> F[Worker N] D --> G[(Results)] E --> G F --> G D -->|429/503| H[Retry Queue] F -->|Error > 3x| I[Dead Letter Queue] B <--> H ``` ### API Design ```ts class BatchCrawler { constructor(options: { workers: number; // Parallele Worker (default: 5) sessionPool?: SessionPool; requestDelay?: number; // Verzögerung zwischen Requests (ms) onResult?: (result: CrawlResult) => void; onError?: (error: CrawlError) => void; }); add(url: string, priority?: number): void; addBulk(urls: string[]): void; start(): Promise<CrawlSummary>; pause(): void; resume(): void; stop(): void; } interface CrawlResult { url: string; status: number; dom: string; domSize: number; duration: number; errors: { type: string; message: string }[]; network: { requests: number; bytes: number }; extracted: { title?: string; description?: string; links: string[]; meta: Record<string, string>; jsonld?: any[]; text?: string; }; } interface CrawlSummary { total: number; success: number; failed: number; duration: number; errors: { url: string; error: string }[]; } ``` ## Akzeptanzkriterien (Phase 1 — Core Queue) - [ ] `BatchCrawler.add()` fügt URL zur Queue - [ ] `BatchCrawler.addBulk(urls)` fügt mehrere URLs - [ ] `BatchCrawler.start()` startet Verarbeitung - [ ] N Workers parallel (konfigurierbar) - [ ] Worker ruft `page.goto(url)` auf - [ ] Error-Handling: 3 Retries, dann Dead Letter - [ ] `requestDelay` zwischen Requests - [ ] `CrawlResult` enthält Grunddaten (url, status, domSize, duration) - [ ] `CrawlSummary` nach completion ## Akzeptanzkriterien (Phase 2 — DOM Extraction) - [ ] Auto-Extraction: title, description, links, meta tags - [ ] JSON-LD Extraction: `page.extractJSONLD()` - [ ] Text Extraction: `page.extractText()` — clean text ohne HTML - [ ] Result Pipeline: Custom Extractors ```ts crawler.addExtractor('product', (page) => ({ name: page.query('h1.product-title')?.text, price: page.query('.price')?.text, inStock: page.query('.add-to-cart') !== null, })); ``` ## Akzeptanzkriterien (Phase 3 — Advanced) - [ ] Priorisierte Queue (numeric priority) - [ ] Rate Limiting pro Domain: `max 10 req/s pro domain` - [ ] Pause/Resume (für manuelle Intervention) - [ ] State Persistence: Queue speichert/restored bei Crash - [ ] SessionPool Integration - [ ] Webhook für Results ## Betroffene Dateien | Datei | Änderung | Aufwand | |---|---|---| | `src/crawler/batch-crawler.ts` | Neu: BatchCrawler Klasse | ~300 Zeilen | | `src/crawler/url-queue.ts` | Neu: Priorisierte Queue | ~100 Zeilen | | `src/crawler/extractors.ts` | Neu: DOM Extraction Helpers | ~150 Zeilen | | `src/pages/page.ts` | extractJSONLD, extractText Methoden | ~50 Zeilen | | `tests/unit/crawler.test.ts` | Neu: 20 Tests | ~500 Zeilen | ## Testplan (20 Tests) | # | Test | Phase | |---|---|---| | BC01 | add() → Queue hat 1 Item | 1 | | BC02 | addBulk(10 URLs) → Queue hat 10 | 1 | | BC03 | start() → alle URLs verarbeitet | 1 | | BC04 | N Workers parallel (count) | 1 | | BC05 | 3 Retries bei Error → Dead Letter | 1 | | BC06 | requestDelay eingehalten | 1 | | BC07 | CrawlResult.failed zählt Errors | 1 | | BC08 | crawlSummary.total = success + failed | 1 | | BC09 | page.goto() wird pro URL aufgerufen | 1 | | BC10 | Empty Queue → sofort done | 1 | | BC11 | extractTitle() funktioniert | 2 | | BC12 | extractLinks() findet alle <a href> | 2 | | BC13 | extractJSONLD() parst JSON-LD | 2 | | BC14 | extractText() gibt clean text | 2 | | BC15 | Custom Extractor Pipeline | 2 | | BC16 | Priority Queue (high first) | 3 | | BC17 | Rate Limiting pro Domain | 3 | | BC18 | pause/resume | 3 | | BC19 | State Persistence (save/load) | 3 | | BC20 | SessionPool Integration | 3 | ## Dependencies - **SessionPool** (Issue Session Pools) — optional, für Proxy-Rotation ## Performance Target | Metrik | Ziel | |---|---| | URLs pro Sekunde (1 Worker) | ~5 URLs/s | | URLs pro Sekunde (5 Worker) | ~20 URLs/s | | Max URLs in Queue | 1.000.000 | | RAM pro 100.000 Queue-Items | ~50MB | ## Risks | Risk | Impact | Mitigation | |---|---|---| | Out-of-Memory bei 1M URLs | Hoch | Stream-basierte Queue (SQLite) | | Rate-Limiting blockiert alle Worker | Mittel | Pro-Domain Limiter, nicht global | | Crawl-Ergebnisse zu groß | Mittel | Result-Streaming, nicht RAM | | Crash in Mitte des Crawls | Mittel | State Persistence + Resume |
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
glow-all/true-headless-browser#116
No description provided.