NAME
Net::Async::Crawl4AI - IO::Async Crawl4AI client with an async strategy chain
VERSION
version 0.001
SYNOPSIS
use IO::Async::Loop;
use Net::Async::Crawl4AI;
my $loop = IO::Async::Loop->new;
my $crawler = Net::Async::Crawl4AI->new(
base_url => 'http://localhost:11235',
cloakbrowser_url => $ENV{CLOAKBROWSER_CDP_URL}, # optional
poll_interval => 2,
);
$loop->add($crawler);
# Async strategy chain — escalates plain -> browser -> stealth -> ...
my $result = $crawler->markdown('https://example.com')->get;
say $result->markdown;
say $result->backend; # crawl4ai_plain / crawl4ai_stealth / ...
say $result->attempts_json;
# Low-level single crawl (no chain), returns all pages.
my $pages = $crawler->crawl_once(
WWW::Crawl4AI::Request->new( urls => 'https://example.com' )
)->get;
# Submit an async crawl job and poll it to completion.
my $done = $crawler->crawl_job_and_wait('https://example.com')->get;
# { status => 'COMPLETED', pages => [...], raw => {...} }
DESCRIPTION
IO::Async-flavoured companion to WWW::Crawl4AI. It wraps a WWW::Crawl4AI orchestrator, dispatches its request builders through Net::Async::HTTP, and returns Future objects — including a fully asynchronous run of the same visible strategy chain.
The pure building blocks (request building, page normalization, content classification via WWW::Crawl4AI::Detect, and WWW::Crawl4AI::Attempt / WWW::Crawl4AI::Result history) are shared with the synchronous WWW::Crawl4AI, so $crawler->markdown(...)->get produces the same WWW::Crawl4AI::Result the sync facade would — only non-blocking.
Must $loop->add($crawler) before use — it is a IO::Async::Notifier subclass. Without this the internal Net::Async::HTTP has no loop and requests will hang.
Constructor parameters
base_url, api_token, cloakbrowser_url, proxy_url, callback,
fallback, timeout, min_markdown, client
All forwarded to the underlying WWW::Crawl4AI. Or pass a pre-built instance as crawl4ai => $www.
Async-only keys: poll_interval, http (pre-built Net::Async::HTTP), delay_sub (CodeRef → Future, for retry/poll delays; mainly a test hook).
The retry policy (max_attempts, retry_backoff, retry_statuses, on_retry) lives on the underlying WWW::Crawl4AI::Client.
Future contract
Endpoint Futures fail as Future->fail($error, 'crawl4ai') where $error is a WWW::Crawl4AI::Error. crawl/markdown never fail for per-strategy errors: each failed strategy is an entry in the attempt history, and an all-strategies-failed run resolves to a WWW::Crawl4AI::Result with ok => 0.
crawl4ai
The underlying WWW::Crawl4AI orchestrator.
client
The underlying WWW::Crawl4AI::Client (used for request builders, response parsers and retry configuration).
poll_interval
Read/write accessor for the default job-status poll interval in seconds.
available_backends
Arrayref of backend names currently in the chain.
http
The underlying Net::Async::HTTP (lazily built and parented to this notifier).
do_request
Low-level: dispatch an HTTP::Request (typically built via $self->client->foo_request) through Net::Async::HTTP with the retry policy applied. Returns a Future of HTTP::Response.
crawl_once
$crawler->crawl_once($request, $backend?) → Future[\@pages]
Low-level single POST /crawl. Resolves to the arrayref of normalized pages (no chain, no classification). $request is a WWW::Crawl4AI::Request or a payload hashref.
md
$crawler->md($url_or_request, %opts) → Future[$markdown]
POST /md. Resolves to the markdown payload.
job_submit
$crawler->job_submit($request) → Future[{ task_id, raw }]
POST /crawl/job. Resolves to { task_id => ..., raw => {...} }.
job_status
$crawler->job_status($task_id) → Future[{ status, pages, raw }]
GET /crawl/job/$task_id. Resolves to { status, pages, raw }; fails with a type=job WWW::Crawl4AI::Error when the job reports FAILED.
health
Resolves to 1 if the Crawl4AI server answers GET /health, else 0. Never fails.
screenshot
$crawler->screenshot($url, wait_for => 2, wait_for_images => 1) → Future[$png_bytes]
$crawler->pdf($url) → Future[$pdf_bytes]
html
$crawler->html($url) → Future[$html]
execute_js
$crawler->execute_js($url, $script_or_arrayref) → Future[\%page]
llm
$crawler->llm($url, $query, %opts) → Future[$answer]
token
$crawler->token($email) → Future[\%token]
Future-returning single-URL action endpoints, mirroring WWW::Crawl4AI::Client: screenshot/pdf resolve to raw bytes, html to the preprocessed HTML, execute_js to a normalized page with js_result, llm to an answer string (needs a server-side LLM provider), and token to a JWT hash. They do not run the strategy chain.
crawl_job_and_wait
$crawler->crawl_job_and_wait($url_or_request, %opts) → Future[\%status]
Submits a crawl job (POST /crawl/job) and polls job_status every poll_interval seconds (override per call with poll_interval => N) until it reports COMPLETED. Resolves to the final status hash ({ status, pages, raw }); fails with a type=job WWW::Crawl4AI::Error on a failed job.
crawl
markdown
my $result = $crawler->markdown('https://example.com')->get;
my $result = $crawler->crawl( url => 'https://example.com' )->get;
Run the strategy chain asynchronously and resolve to a WWW::Crawl4AI::Result. Same chain, same result object as the synchronous "markdown" in WWW::Crawl4AI. Accepts a single positional URL or named arguments with a url key.
The Future never fails for per-strategy errors: each failed strategy is an entry in the attempt history. An all-strategies-failed run resolves to a Result with ok => 0.
deep_crawl
my $results = $crawler->deep_crawl('https://example.com')->get;
my $results = $crawler->deep_crawl(
'https://example.com',
max_pages => 50,
max_depth => 3,
same_host => 1,
concurrency => 8, # async-only
url_filter => sub { $_[0] !~ m{/login} },
on_page => sub { my ( $result, $depth ) = @_; ... },
min_markdown => 200, # any crawl() option is forwarded
)->get;
Asynchronous breadth-first crawl that follows the "urls" in WWW::Crawl4AI::Result of each good page. Resolves to a Future of an arrayref of WWW::Crawl4AI::Result in breadth-first order: the start URL first, then deeper pages grouped by depth (the list is reordered back to enqueue order, so a faster page completing first does not jump the queue). Same semantics as "deep_crawl" in WWW::Crawl4AI, but each depth level's frontier is crawled concurrently (up to concurrency, default 4) instead of one page at a time.
Options: max_pages (default 25), max_depth (default 2, start URL is depth 0), same_host (default true), concurrency (default 4, async-only), url_filter (($url) -> bool), on_page (($result, $depth)). URLs are deduplicated with the fragment stripped. Any remaining options are forwarded to each "crawl".
SEE ALSO
WWW::Crawl4AI, WWW::Crawl4AI::Client, WWW::Crawl4AI::Result, IO::Async, Net::Async::HTTP, Future, https://github.com/unclecode/crawl4ai
SUPPORT
Issues
Please report bugs and feature requests on GitHub at https://github.com/Getty/p5-net-async-crawl4ai/issues.
CONTRIBUTING
Contributions are welcome! Please fork the repository and submit a pull request.
AUTHOR
Torsten Raudssus <torsten@raudssus.de> https://raudss.us/
COPYRIGHT AND LICENSE
This software is copyright (c) 2026 by Torsten Raudssus.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.