The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

WWW::Mixi::Scraper - yet another mixi scraper

SYNOPSIS

    use WWW::Mixi::Scraper;
    my $mixi = WWW::Mixi::Scraper->new(
      email => 'foo@bar.com', password => 'password'
    );

    my @list = $mixi->parse('http://mixi.jp/new_friend_diary.pl');
    my @list = $mixi->new_friend_diary->parse;

    my @list = $mixi->parse('http://mixi.jp/new_bbs.pl?page=2');
    my @list = $mixi->new_bbs->parse( page => 2 );

    my $diary = $mixi->parse('/view_diary.pl?id=0&owner_id=0');
    my $diary = $mixi->view_diary->parse( id => 0, owner_id => 0 );

    my @comments = @{ $diary->{comments} };

    # for testing
    my $html = read_file('/some/where/mixi.html');
    my $diary = $mixi->parse('/view_diary.pl', html => $html );
    my $diary = $mixi->view_diary->parse( html => $html );

DESCRIPTION

This is yet another 'mixi' (the largest SNS in Japan) scraper, powered by Web::Scraper. Though APIs are different and incompatible with precedent WWW::Mixi, I'm loosely trying to keep correspondent return values look similar as of writing this (this may change in the future).

WWW::Mixi::Scraper is also pluggable, so if you want to scrape something it can't handle now, add your WWW::Mixi::Scraper::Plugin::<PLfileBasenameInCamel>, and it'll work for you.

DIFFERENCES BETWEEN THE TWO

WWW::Mixi has much longer history and is full-stack. The data it returns tends to be more complete, fine-tuned, and raw in many ways (including encoding). However, it tends to suffer from minor html changes as it heavily relies on regexes, and maybe it is too monolithic.

In contrast, WWW::Mixi::Scraper hopefully tends to survive minor html changes as it relies on XPath. And basically it uses decoded perl strings, not octets. It's smaller, and pluggable. However, its data is more or less pre-processed and tends to lose some aspects such as proper line breaks. Also, it may be easier to be polluted with garbages (partly because mixi doesn't rely much on CSS; it's hard to locate exact area to scrape by XPath). And it may be harder to understand and maintain XPath rules.

Which to choose? It depends. For now, ::Scraper is too limited, but if all you want is rough data to tell you who updated, or what was updated, ::Scraper may be a good option.

METHODS

new

creates an object. You can pass an optional hash. Important keys are:

email, password

the ones you use to login.

would be passed to WWW::Mechanize. If your cookie_jar has valid cookies for mixi, you don't need to write your email/password in your scripts.

Other options would be passed to Mech, too.

parse

takes a uri and returns scraped data, which is mainly an array, sometimes a hash reference, or possibly something else, according to the plugin that does actual scraping. You can pass an optional hash, which eventually override query parameters of the uri. An exception is 'html', which wouldn't go into the uri but provide raw html string to the scraper (mainly to test).

TO DO

More scraper plugins, various levels of caching, password obfuscation, some getters of minor information such as pager, counter, and image/thumbnail sources, and maybe more docs?

Anyway, as this is a 'scraper', I don't include 'post' related methods here. If you insist, use WWW::Mechanize object hidden in the stash, or WWW::Mixi.

SEE ALSO

WWW::Mixi, Web::Scraper, WWW::Mechanize

AUTHOR

Kenichi Ishigaki, <ishigaki at cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2007 by Kenichi Ishigaki.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.