Changes for version 0.005 - 2026-06-12

  • Detect: removed every body-text and HTML/title fingerprint from the block classifier. is_good now discards a page only on thin_html, an HTTP push-back status, or a final_url that landed on a known WAF/captcha challenge endpoint. The old size-independent arms -- $RE_WALL matching a bare `__cf_` substring, the "Just a moment"/"Access denied" $RE_TITLE, and the thin-gated $RE_BLOCK / $RE_JS / $RE_CAPTCHA body phrases -- are gone: on a thin page they were redundant (thin_html already fails it) and on a full page they were pure false-positives. Real regression: www.delphin.de served a full 386 KB 200 page carrying Cloudflare's passive /cdn-cgi/challenge-platform/.../jsd/main.js beacon (a `__cf_` token), which was thrown away as bot_wall_detected across all four strategies. A content-rich 200 page is now the scrape, full stop. Drops the js_required signal/reason (a thin JS shell reads thin_content). Soft-blocks that serve one identical interstitial for every URL are caught site-level by the caller, not per-page here.

Documentation

probe Crawl4AI / CloakBrowser / proxy reachability and print the chain
run the full WWW::Crawl4AI strategy chain against one URL

Modules

Perl client and fallback orchestrator for Crawl4AI
one strategy attempt in a WWW::Crawl4AI fallback chain
UA-agnostic REST client for the Crawl4AI Docker API
breadth-first iterator for deep_crawl, separating frontier management from crawl logic
service detection and content-quality classification for Crawl4AI
structured error class for WWW::Crawl4AI
markdown field resolution across Crawl4AI response shapes
builds Crawl4AI /crawl and /md request payloads
normalized result of a WWW::Crawl4AI strategy chain
role for a single crawl strategy in the WWW::Crawl4AI fallback chain
Crawl4AI strategy with full JS rendering (wait for networkidle)
last-resort Crawl4AI strategy delegating to a user coderef
Crawl4AI strategy attaching to CloakBrowser over CDP
cheapest Crawl4AI strategy — headless text mode, no escalation
Crawl4AI strategy routing through a configured proxy
Crawl4AI strategy with enable_stealth and randomized fingerprint
ordered list of strategy objects, pluggable at construction time