The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Novel::Robot

download novel /bbs thread 小说/贴子下载器

site

support novel/forum website 支持小说/贴子站点

Novel::Robot::Parser

type

support robot ouput file type, 支持小说输出形式

Novel::Robot::Packer

INSTALL

for example, on debian, 以debian环境为例

    $ apt-get install parallel ansible calibre cpanminus
    $ cpanm Novel::Robot

example

    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t epub
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -o mytest.epub
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -o abc.epub -i 1-3

ARG

    -s : site, 指定站点

    -u : book url,小说url

    -w : writer name 作者名
    -b : book name,书名
    -f : txt file / txt file dir, 指定文本文件来源(可以是单个目录或文件)
    -c : firefox cookies.sqlite / netscape HTTP cookie file / cookie string, details in Novel::Robot::Browser

    -t : packer type (html/txt), 小说保存类型,例如txt/html
    -o : output packer filename, 保存的小说文件名

    -i : min_item_num-max_item_num, 只取 x-y 章/楼
    -j : min_page_num-max_page_num, 只取 x-y 页

    -D : only print info, not download, 只输出信息,不下载
    -v : verbose
    --progress: 显示进度条(默认不显示)

    --use_chrome      : use chrome dump_dom, only support GET method
    --with_toc        : 小说保存时是否生成目录(默认是)
    --grep_content    : 提取关键字
    --filter_content  : 过滤关键字
    --only_poster     : 贴子只看楼主
    --min_content_word_num : 贴子每楼层最小字数
    --max_process_num : 进程个数 
    --chapter_regex: 指定分割章节的正则表达式(例如:"第[ \\t\\d]+章")

    --item_list_path : xpath to extract item_list, 提取item_list的路径
    --content_path    : xpath to extract content, 提取content的路径
    --writer_path     : xpath to extract writer, 提取writer的路径
    --book_path       : xpath to extract book, 提取book的路径
    --content_regex   : regex to extract content, 提取content的正则
    --writer_regex    : regex to extract writer, 提取writer的正则
    --book_regex      : regex to extract book, 提取book的正则

    -B : board_url / board_id  版块url,或版块编号
    -q : query type, 查询类型
    -k : query keyword, 查询关键字

download novel

download novel from url

下载小说

    get_novel.pl -u [url] -t [type] -i [min_item_num-max_item_num] --cookie [cookie] -o [dst_file/dst_dir]
    get_novel.pl -u [小说目录页url] -t [目标文件类型] -i [起止章节号] --cookie [cookie登录信息] -o [目标文件名/目标文件夹]

    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t txt
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html

    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html -i 3
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html -i 3-4
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html -i 3-
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html -i -3

use chrome

阿里文学

    get_novel.pl -u "https://www.aliwx.com.cn/chapter?bid=7964189" --use_chrome --writer_path "//div[@class='chapterbox']//span" --book_regex "<title>(.+?)-" --content_path "//div[@class='chapter-content']"

公众号

    get_novel.pl -u "https://mp.weixin.qq.com/s?xxxx" --use_chrome --content_path "//div[@id='js_content']" --item_list_path "//div[@id='js_content']//a"  -w somewriter -b somebook

以firefox浏览器为例,先登录对应站点,然后用 cookies.txt 扩展导出cookies.txt,则可以下载当前登录账号所购买的小说; 也可直接指定firefox配置目录下的cookies.sqlite文件; 或者直接使用m.jjwxc.net的cookie字符串

绿晋江VIP

    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=217747" -i 33-34 --cookie cookies.txt
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=217747" -i 33-34 --cookie ~/.mozilla/firefox/*/cookies.sqlite
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=217747" -i 33-34 --cookie "name1=value1; name2=value2"

download by site, writer, book

    get_novel.pl -s lofter -t epub -w chuweizhiyu -b 时之足
    get_novel.pl -s lofter -t epub -w chuweizhiyu -b 时之足 -i 3-
    get_novel.pl -s lofter -t epub -w chuweizhiyu -b 时之足 -i 3-5

parse txt

parse chapter name with regex, convert txt to ebook

可指定章节标题的正则式,把txt文件转成电子书

    get_novel.pl -s txt -w [writer] -b [book] -f [txt_file/directory] -t [type] -r [chapter_regex]
    get_novel.pl -s txt -w [作者] -b [书名] -f [txt文件或目录] -t [目标文件类型] -r [章节标题匹配的正则式]

    get_novel.pl -s txt -w 牵机 -b 断情逐妖记 -f dq1.txt -t html
    get_novel.pl -s txt -w 牵机 -b 断情逐妖记 -f dq1.txt,dq2.txt,dir1 -r "第[ \\t\\d]+章" -t html
    get_novel.pl -s txt -f 飘灯-像妖怪一样自由.txt -t html
    get_novel.pl -s txt -f 飘灯-风尘叹.txt -t epub
    get_novel.pl -s txt -f fct.txt -w 飘灯 -b 风尘叹 -t epub

only print info

only print info, but not download, 输出小说信息(不下载)

    get_novel.pl -u [url] -D 1
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838"  -D 1

convert ebook

use calibre to convert novel file into epub/epub/..., default filename format is [writer]-[bookname].[type]

使用calibre将下载的 html格式 的小说转换成 其他格式的电子书,例如epub、epub等等。如果未指定writer及book选项,则需要将html源文件名称设置为 [作者-书名]

    conv_novel.pl -f [input_file] -o [output_file] -t [type] -w [writer] -b [book]
    conv_novel.pl -f [源文件] -o [目标文件] -t [目标文件类型(小写)] -w [作者] -b [书名]

    conv_novel.pl -f 天平-风起阿房.html -t epub
    conv_novel.pl -f mxj.html -w 施定柔 -b 迷侠记 -t epub

send email

download/convert novel, use calibre-smtp to send ebook to email address : xxx@kindle.com

下载小说并使用calibre-smtp推送到指定邮箱

local smtp service 本地已安装smtp服务

    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t epub -T "xxx@kindle.com" -F "yyy@somesite.cn"
    get_novel.pl -f fct.txt -w 飘灯 -b 风尘叹 -t epub -T "xxx@kindle.com" -F "yyy@somesite.cn"

remote smtp service 使用远程smtp服务

    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t epub -T "xxx@kindle.com" -i 1-3  -M smtp.src.com -p 587 -U xxx -P somepwd -F yyy@somesite.cn
    get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t epub -T "xxx@kindle.com" -M smtp.qq.com -p 587 -U yyy -P 'aaaaaaaaaaaaagga' -F yyy@qq.com
    get_novel.pl -f fct.txt -w 飘灯 -b 风尘叹 -t epub  -T "xxx@kindle.com" -M smtp.qq.com -p 587 -U yyy -P 'aaaaaaaaaaaaagga' -F yyy@qq.com

use ansible,push ebook to remote host, and then send email

使用ansible把电子书上传到远程服务器,再在远程服务器调用calibre-smtp在服务器直接smtp推送

bulk novel info

    get bulk novel info: <writer,book,url> , default option is not download

query

查询关键字,不下载

    get_novel.pl -s jjwxc -q 作者 -k 牵机 -i 1-10 -D 1
    get_novel.pl -s hjj -b 153 -q 贴子主题 -k 迷侠记 -D 1

board

批量获取版块/作者专栏的小说信息

    get_novel.pl -B "http://www.jjwxc.net/oneauthor.php?authorid=14644" -D 1
    get_novel.pl -B "http://www.jjwxc.net/oneauthor.php?authorid=14644" -i 1-3 -D 1
    get_novel.pl -B "http://bbs.jjwxc.net/board.php?board=153&page=0" -P 1-2 -D 1
    get_novel.pl -B "http://bbs.jjwxc.net/board.php?board=153&page=0" -i 1-20 -D 1

bulk download

下载专栏内的所有小说

    get_novel.pl -B "http://www.jjwxc.net/oneauthor.php?authorid=14644" -D 0

manually select some novels, use parallel for multiple novels download

手动选择下载部分小说,用parallel批量调用get_novel.pl获取epub

    get_novel.pl -s jjwxc -q 作者 -k 牵机 -D 1  > raw_booklist.txt
    awk -F, '$1=="牵机"' raw_booklist.txt > refine_booklist.txt
    parallel --colsep , get_novel.pl -u "{3}" -t epub :::: refine_booklist.txt

FUNCTION

    my $xs = Novel::Robot->new(
    site => 'jjwxc',
    type => 'html', 
    );

    my $index_url = 'http://www.jjwxc.net/onebook.php?novelid=2456';
    $xs->get_novel($index_url);