NAME

Net::Async::Crawl4AI - IO::Async Crawl4AI client with an async strategy chain

VERSION

version 0.001

SYNOPSIS

use IO::Async::Loop;
use Net::Async::Crawl4AI;

my $loop = IO::Async::Loop->new;
my $crawler = Net::Async::Crawl4AI->new(
  base_url         => 'http://localhost:11235',
  cloakbrowser_url => $ENV{CLOAKBROWSER_CDP_URL},   # optional
  poll_interval    => 2,
);
$loop->add($crawler);

# Async strategy chain — escalates plain -> browser -> stealth -> ...
my $result = $crawler->markdown('https://example.com')->get;
say $result->markdown;
say $result->backend;        # crawl4ai_plain / crawl4ai_stealth / ...
say $result->attempts_json;

# Low-level single crawl (no chain), returns all pages.
my $pages = $crawler->crawl_once(
  WWW::Crawl4AI::Request->new( urls => 'https://example.com' )
)->get;

# Submit an async crawl job and poll it to completion.
my $done = $crawler->crawl_job_and_wait('https://example.com')->get;
# { status => 'COMPLETED', pages => [...], raw => {...} }

DESCRIPTION

IO::Async-flavoured companion to WWW::Crawl4AI. It wraps a WWW::Crawl4AI orchestrator, dispatches its request builders through Net::Async::HTTP, and returns Future objects — including a fully asynchronous run of the same visible strategy chain.

The pure building blocks (request building, page normalization, content classification via WWW::Crawl4AI::Detect, and WWW::Crawl4AI::Attempt / WWW::Crawl4AI::Result history) are shared with the synchronous WWW::Crawl4AI, so $crawler->markdown(...)->get produces the same WWW::Crawl4AI::Result the sync facade would — only non-blocking.

Must $loop->add($crawler) before use — it is a IO::Async::Notifier subclass. Without this the internal Net::Async::HTTP has no loop and requests will hang.

Constructor parameters

base_url, api_token, cloakbrowser_url, proxy_url, callback,
fallback, timeout, min_markdown, client

All forwarded to the underlying WWW::Crawl4AI. Or pass a pre-built instance as crawl4ai => $www.

Async-only keys: poll_interval, http (pre-built Net::Async::HTTP), delay_sub (CodeRef → Future, for retry/poll delays; mainly a test hook).

The retry policy (max_attempts, retry_backoff, retry_statuses, on_retry) lives on the underlying WWW::Crawl4AI::Client.

Future contract

Endpoint Futures fail as Future->fail($error, 'crawl4ai') where $error is a WWW::Crawl4AI::Error. crawl/markdown never fail for per-strategy errors: each failed strategy is an entry in the attempt history, and an all-strategies-failed run resolves to a WWW::Crawl4AI::Result with ok => 0.

crawl4ai

The underlying WWW::Crawl4AI orchestrator.

client

The underlying WWW::Crawl4AI::Client (used for request builders, response parsers and retry configuration).

poll_interval

Read/write accessor for the default job-status poll interval in seconds.

available_backends

Arrayref of backend names currently in the chain.

http

The underlying Net::Async::HTTP (lazily built and parented to this notifier).

do_request

Low-level: dispatch an HTTP::Request (typically built via $self->client->foo_request) through Net::Async::HTTP with the retry policy applied. Returns a Future of HTTP::Response.

crawl_once

$crawler->crawl_once($request, $backend?) → Future[\@pages]

Low-level single POST /crawl. Resolves to the arrayref of normalized pages (no chain, no classification). $request is a WWW::Crawl4AI::Request or a payload hashref.

md

$crawler->md($url_or_request, %opts) → Future[$markdown]

POST /md. Resolves to the markdown payload.

job_submit

$crawler->job_submit($request) → Future[{ task_id, raw }]

POST /crawl/job. Resolves to { task_id => ..., raw => {...} }.

job_status

$crawler->job_status($task_id) → Future[{ status, pages, raw }]

GET /crawl/job/$task_id. Resolves to { status, pages, raw }; fails with a type=job WWW::Crawl4AI::Error when the job reports FAILED.

health

Resolves to 1 if the Crawl4AI server answers GET /health, else 0. Never fails.

screenshot

$crawler->screenshot($url, wait_for => 2, wait_for_images => 1) → Future[$png_bytes]

pdf

$crawler->pdf($url) → Future[$pdf_bytes]

html

$crawler->html($url) → Future[$html]

execute_js

$crawler->execute_js($url, $script_or_arrayref) → Future[\%page]

llm

$crawler->llm($url, $query, %opts) → Future[$answer]

token

$crawler->token($email) → Future[\%token]

Future-returning single-URL action endpoints, mirroring WWW::Crawl4AI::Client: screenshot/pdf resolve to raw bytes, html to the preprocessed HTML, execute_js to a normalized page with js_result, llm to an answer string (needs a server-side LLM provider), and token to a JWT hash. They do not run the strategy chain.

crawl_job_and_wait

$crawler->crawl_job_and_wait($url_or_request, %opts) → Future[\%status]

Submits a crawl job (POST /crawl/job) and polls job_status every poll_interval seconds (override per call with poll_interval => N) until it reports COMPLETED. Resolves to the final status hash ({ status, pages, raw }); fails with a type=job WWW::Crawl4AI::Error on a failed job.

crawl

markdown

my $result = $crawler->markdown('https://example.com')->get;
my $result = $crawler->crawl( url => 'https://example.com' )->get;

Run the strategy chain asynchronously and resolve to a WWW::Crawl4AI::Result. Same chain, same result object as the synchronous "markdown" in WWW::Crawl4AI. Accepts a single positional URL or named arguments with a url key.

The Future never fails for per-strategy errors: each failed strategy is an entry in the attempt history. An all-strategies-failed run resolves to a Result with ok => 0.

deep_crawl

my $results = $crawler->deep_crawl('https://example.com')->get;
my $results = $crawler->deep_crawl(
  'https://example.com',
  max_pages   => 50,
  max_depth   => 3,
  same_host   => 1,
  concurrency => 8,                                  # async-only
  url_filter  => sub { $_[0] !~ m{/login} },
  on_page     => sub { my ( $result, $depth ) = @_; ... },
  min_markdown => 200,            # any crawl() option is forwarded
)->get;

Asynchronous breadth-first crawl that follows the "urls" in WWW::Crawl4AI::Result of each good page. Resolves to a Future of an arrayref of WWW::Crawl4AI::Result in breadth-first order: the start URL first, then deeper pages grouped by depth (the list is reordered back to enqueue order, so a faster page completing first does not jump the queue). Same semantics as "deep_crawl" in WWW::Crawl4AI, but each depth level's frontier is crawled concurrently (up to concurrency, default 4) instead of one page at a time.

Options: max_pages (default 25), max_depth (default 2, start URL is depth 0), same_host (default true), concurrency (default 4, async-only), url_filter (($url) -> bool), on_page (($result, $depth)). URLs are deduplicated with the fragment stripped. Any remaining options are forwarded to each "crawl".

SUPPORT

Issues

Please report bugs and feature requests on GitHub at https://github.com/Getty/p5-net-async-crawl4ai/issues.

CONTRIBUTING

Contributions are welcome! Please fork the repository and submit a pull request.

AUTHOR

Torsten Raudssus <torsten@raudssus.de> https://raudss.us/

COPYRIGHT AND LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install Net::Async::Crawl4AI, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Net::Async::Crawl4AI

CPAN shell

perl -MCPAN -e shell
install Net::Async::Crawl4AI

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)