CLI + Corpus Runner: index.ts, corpus format, output files, --verbose, --single, --url #15

Closed
opened 2026-06-17 13:37:44 +00:00 by Artur · 1 comment
Owner

Goal

Implement the CLI entry point and corpus runner.

What to Build

index.ts (CLI Entry Point)

Usage:
  bun run crawl:strict                 # Run default corpus
  bun run crawl:strict <corpus.json>   # Run specific corpus file
  bun run crawl:strict --all           # Run all corpus files in corpus/
  bun run crawl:strict <corpus.json> --single <index>  # Run single entry
  bun run crawl:strict <url> --url     # Run single URL directly
  bun run crawl:strict --help          # Show help

Options:
  --timeout <ms>     Per-page timeout (default: 30000)
  --verbose          Print API access log
  --output <dir>     Output directory (default: ./output/)
  --json             Output as JSON

Corpus Format

[
  {
    "url": "https://example.com",
    "description": "Simple HTML page",
    "timeout": 10000,
    "expect": {
      "domContains": ["Hello World"],
      "networkRequestCount": { "min": 1, "max": 10 }
    }
  }
]

CLI Behavior

for each entry in corpus:
  print: [1/N] https://example.com — description
  try:
    result = await runPage(entry.url, { timeout: entry.timeout })
    print:   ✓ passed
    print:   DOM size: X chars
    print:   Network requests: Y
    print:   APIs used: Z
    print:   Timing: parse=Xms scripts=Yms stable=Zms
  catch err:
    print:   ✗ FAILED: error message
    print:   Stack trace (--verbose)

At end:
  print: ✓ N/N pages passed
  or:
  print: ✗ N pages failed. Implement missing APIs and rerun.
  print: Top missing APIs:
    - window.navigator.mediaDevices (used 3 times)
    - window.HTMLVideoElement (used 2 times)

Output Files

output/
  <timestamp>/
    example.com/
      dom.html          → Final DOM serialization
      network.json      → Network log
      cookies.json      → Cookie jar dump
      storage.json      → Storage dump
      apis-used.json    → APIs accessed
      timing.json       → Timing info
      error.log         → Error log (if failed)
      summary.json      → All data in one file

scripts/corpus/default.json

[
  {
    "url": "https://www.google.com",
    "description": "Google homepage",
    "timeout": 15000
  },
  {
    "url": "data:text/html,<script>document.title='test'</script>",
    "description": "Minimal inline script",
    "timeout": 5000
  }
]

Tests

Test Verifies
cli.basic.test.ts CLI runs with no args shows usage
cli.corpus.test.ts CLI runs with corpus file
cli.single.test.ts CLI runs with --single flag
cli.url.test.ts CLI runs with --url flag
cli.timeout.test.ts --timeout option works
cli.output.test.ts Output files written correctly
cli.summary.test.ts Summary printed at end
cli.error-handling.test.ts Failing page shows error, doesn't crash CLI

Definition of Done

  • index.ts with full CLI
  • Corpus format with options
  • Output directory with files
  • scripts/corpus/default.json
  • --verbose mode shows top missing APIs
  • All tests pass
  • 100% line + branch coverage
## Goal Implement the CLI entry point and corpus runner. ## What to Build ### index.ts (CLI Entry Point) ``` Usage: bun run crawl:strict # Run default corpus bun run crawl:strict <corpus.json> # Run specific corpus file bun run crawl:strict --all # Run all corpus files in corpus/ bun run crawl:strict <corpus.json> --single <index> # Run single entry bun run crawl:strict <url> --url # Run single URL directly bun run crawl:strict --help # Show help Options: --timeout <ms> Per-page timeout (default: 30000) --verbose Print API access log --output <dir> Output directory (default: ./output/) --json Output as JSON ``` ### Corpus Format ```json [ { "url": "https://example.com", "description": "Simple HTML page", "timeout": 10000, "expect": { "domContains": ["Hello World"], "networkRequestCount": { "min": 1, "max": 10 } } } ] ``` ### CLI Behavior ``` for each entry in corpus: print: [1/N] https://example.com — description try: result = await runPage(entry.url, { timeout: entry.timeout }) print: ✓ passed print: DOM size: X chars print: Network requests: Y print: APIs used: Z print: Timing: parse=Xms scripts=Yms stable=Zms catch err: print: ✗ FAILED: error message print: Stack trace (--verbose) At end: print: ✓ N/N pages passed or: print: ✗ N pages failed. Implement missing APIs and rerun. print: Top missing APIs: - window.navigator.mediaDevices (used 3 times) - window.HTMLVideoElement (used 2 times) ``` ### Output Files ``` output/ <timestamp>/ example.com/ dom.html → Final DOM serialization network.json → Network log cookies.json → Cookie jar dump storage.json → Storage dump apis-used.json → APIs accessed timing.json → Timing info error.log → Error log (if failed) summary.json → All data in one file ``` ### scripts/corpus/default.json ``` [ { "url": "https://www.google.com", "description": "Google homepage", "timeout": 15000 }, { "url": "data:text/html,<script>document.title='test'</script>", "description": "Minimal inline script", "timeout": 5000 } ] ``` ## Tests | Test | Verifies | |------|----------| | cli.basic.test.ts | CLI runs with no args shows usage | | cli.corpus.test.ts | CLI runs with corpus file | | cli.single.test.ts | CLI runs with --single flag | | cli.url.test.ts | CLI runs with --url flag | | cli.timeout.test.ts | --timeout option works | | cli.output.test.ts | Output files written correctly | | cli.summary.test.ts | Summary printed at end | | cli.error-handling.test.ts | Failing page shows error, doesn't crash CLI | ## Definition of Done - [ ] index.ts with full CLI - [ ] Corpus format with options - [ ] Output directory with files - [ ] scripts/corpus/default.json - [ ] --verbose mode shows top missing APIs - [ ] All tests pass - [ ] 100% line + branch coverage
Author
Owner

CLI + Corpus Runner: index.ts, corpus format, output files. Implementiert in cli.ts. corpus/ vorhanden.

CLI + Corpus Runner: index.ts, corpus format, output files. ✅ Implementiert in cli.ts. corpus/ vorhanden.
Artur closed this issue 2026-06-18 06:28:04 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
glow-all/true-headless-browser#15
No description provided.