The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

EBook::Tools::Unpack - An object class for unpacking E-book files into their component parts and metadata

SYNOPSIS

 use EBook::Tools::Unpack;
 my $unpacker = EBook::Tools::Unpack->new(
    'file'     => $filename,
    'dir'      => $dir,
    'encoding' => $encoding,
    'format'   => $format,
    'raw'      => $raw,
    'author'   => $author,
    'title'    => $title,
    'opffile'  => $opffile,
    'tidy'     => $tidy,
    'nosave'   => $nosave,
    );
 $unpacker->unpack;

or, more simply:

 use EBook::Tools::Unpack;
 my $unpacker = EBook::Tools::Unpack->new('file' => 'mybook.prc');
 $unpacker->unpack;

DEPENDENCIES

Perl Modules

  • HTML::Tree

  • Image::Size

  • List::MoreUtils

  • P5-Palm

  • Palm::Doc

CONSTRUCTOR

new(%args)

Instantiates a new Ebook::Tools::Unpack object.

Arguments

  • file

    The file to unpack. Specifying this is mandatory.

  • dir

    The directory to unpack into. If not specified, defaults to the basename of the file.

  • encoding

    If specified, overrides the encoding to use when unpacking. This is normally detected from the file and does not need to be specified.

    Valid values are '1252' (specifying Windows-1252) and '65001' (specifying UTF-8).

  • key

    The decryption key to use if necessary (not yet implemented)

  • keyfile

    The file holding the decryption keys to use if necessary (not yet implemented)

  • language

    If specified, overrides the detected language information.

  • opffile

    The name of the file in which the metadata will be stored. If not specified, defaults to the value of dir with .opf appended.

  • raw

    If set true, this forces no corrections to be done on any extracted text and a lot of raw, unparsed, unmodified data to be dumped into the directory along with everything else. It's useful for debugging exactly what was in the file being unpacked, and (when combined with nosave) reducing the time needed to extract parsed data from an ebook container without actually unpacking it.

  • author

    Overrides the detected author name.

  • title

    Overrides the detected title.

  • tidy

    If set to true, the unpacker will run tidy on any HTML output files to convert them to valid XHTML. Be warned that this can occasionally change the formatting, as Tidy isn't very forgiving on certain common tricks (such as empty <pre> elements with style elements) that abuse the standard.

  • nosave

    If set to true, the unpacker will run through all of the unpacking steps except those that actually write to the disk. This is useful for testing, but also (particularly when combined with raw) can be used for extracting parsed data from an ebook container without actually unpacking it.

ACCESSOR METHODS

See "new()" for more details on what some of these mean. Note that some values cannot be autodetected until an unpack method executes.

author

dir

file

filebase

In scalar context, this is the basename of file. In list context, it actually returns the basename, directory, and extension as per fileparse from File::Basename.

format

key

keyfile

language

This returns the language specified by the user, if any. It remains undefined if the user has not requested that a language code be set even if a language was autodetected.

opffile

raw

title

This returns the title specified by the user, if any. It remains undefined if the user has not requested a title be set even if a title was autodetected.

detected

This returns a hash containing the autodetected metadata, if any.

MODIFIER METHODS

detect_format()

Attempts to automatically detect the format of the input file. Croaks if it can't. This both sets the object internal values and returns a two-scalar list, where the first scalar is the detected format and the second is a string that may contain additional detected information (such as a title or version).

This is automatically called by "new()" if the format argument is not specified.

detect_from_mobi_headers()

Detects metadata values from the MOBI headers retrieved via "unpack_mobi_header()" and "unpack_mobi_exth()" and places them into the detected attribute.

gen_opf(%args)

This generates an OPF file from detected and specified metadata. It does not honor the nosave flag, and will always write its output.

Normally this is called automatically from inside the unpack methods, but can be called manually after an unpack if the nosave flag was set to write an OPF anyway.

Returns the filename of the OPF file.

Arguments

  • opffile (optional)

    If specified, this overrides the object attribute opffile, and determines the filename to use for the generated OPF file. If not specified, and the object attribute opffile has somehow been cleared (the attribute is set during "new()"), it will be generated by looking at the htmlfile argument. If no value can be found, the method croaks. If a value was found somewhere other than the object attribute opffile, then the object attribute is updated to match.

  • textfile (optional)

    The file containing the main text of the document. If specified, the method will attempt to split metadata out of the file and add whatever remains to the manifest of the OPF.

unpack()

This is a dispatcher for the specific unpacking methods needed to unpack a particular format. Unless you feel a need to override the unpacking method specified or detected during object construction, it is probalby better to call this than the specific unpacking methods.

unpack_mobi()

Unpacks Mobipocket (.prc / .mobi) files.

unpack_mobi_record0($data)

Converts the information in the header data of PDB record 0 to entries inside the datahashes attribute.

Keys

The following keys are added to datahashes:

unpack_palmdoc()

Unpacks PalmDoc / AportisDoc (.pdb) files

usedir()

Changes the current working directory to the directory specified by the object, creating it if necessary.

PROCEDURES

No procedures are exported by default, and in fact since the final module location for some of these procedures has not yet been finalized, none are even exportable.

Consider these to be private subroutines and use at your own risk.

fix_mobi_html(%args)

Takes raw Mobipocket output text and replaces the custom tags and file position anchors

Arguments

  • textref

    A reference to the raw document text. The procedure croaks if this is not supplied.

  • encoding

    The encoding of the raw document text. Valid values are '1252' (Windows-1252) and '65001' (UTF-8). If not specified, '1252' will be assumed.

  • filename

    The name of the output HTML file (used in generating hrefs). The procedure croaks if this is not supplied.

  • nonewlines

    If this is set to true, the procedure will not attempt to insert newlines for readability. This will leave the output in a single unreadable line, but has the advantage of reducing the processing time, especially useful if tidy is going to be run on the output anyway.

hexstring($bindata)

Takes as an argument a scalar containing a sequence of binary bytes. Returns a string converting each octet of the data to its two-digit hexadecimal equivalent. There is no leading "0x" on the string.

unpack_mobi_exth($headerdata)

Takes as an argument a scalar containing the variable-length Mobipocket EXTH data from the first record. Returns an array of hashes, each hash containing the data from one EXTH record with values from that data keyed to recognizable names.

If $headerdata doesn't appear to be an EXTH header, carps a warning and returns an empty list.

See:

http://wiki.mobileread.com/wiki/MOBI

Hash keys

  • type

    A numeric value indicating the type of EXTH data in the record. See package variable %exthtypes.

  • length

    The length of the data value in bytes

  • data

    The data of the record.

unpack_mobi_header($headerdata)

Takes as an argument a scalar containing the variable-length Mobipocket-specific header data from the first record. Returns a hash containing values from that data keyed to recognizable names.

See:

http://wiki.mobileread.com/wiki/MOBI

keys

The returned hash will have the following keys (documented in the order in which they are encountered in the header):

identifier

This should always be the string 'MOBI'. If it isn't, the procedure croaks.

headerlength

This is the size of the complete header. If this value is different from the length of the argument, the procedure croaks.

type

A numeric code indicating what category of Mobipocket file this is.

encoding

A numeric code representing the encoding. Expected values are '1252' (for Windows-1252) and '65001 (for UTF-8).

The procedure carps a warning if an unexpected value is encountered.

uniqueid

This is thought to be a unique ID for the book, but its actual use is unknown.

Use with caution. This key may be renamed in the future if more information is found.

version

This is thought to be the Mobipocket format version. A second version code shows up again later as version2 which is usually the same on unprotected books but different on DRMd books.

Use with caution. This key may be renamed in the future if more information is found.

reserved

40 bytes of reserved data.

Use with caution. This key may be renamed in the future if more information is found.

nontextrecord

This is thought to be an index to the first PDB record other than the header record that does not contain the book text.

Use with caution. This key may be renamed in the future if more information is found.

titleoffset

Offset in record 0 (not from start of file) of the full title of the book.

titlelength

Length in bytes of the full title of the book

unknownlanguage

16 bits of unknown data thought to be related to the book language.

Use with caution. This key may be renamed in the future if more information is found.

region

The specific region of language. See %mobilangcodes for an exact map of values.

The bottom two bits of this value appear to be unused (i.e. all values are multiples of 4).

language

A main language code. See %mobilangcodes for an exact map of values.

unknowndilanguage

16 bits of unknown data thought to be related to the dictionary input language.

Use with caution. This key may be renamed in the future if more information is found.

dictionaryinregion

The specific region of dictionaryinlanguage. See %mobilangcodes for an exact map of values.

dictionaryinlanguage

The language code for the DictionaryInLanguage element. See %mobilangcodes for an exact map of values.

unknowndolanguage

16 bits of unknown data thought to be related to the dictionary output language.

Use with caution. This key may be renamed in the future if more information is found.

dictionaryoutregion

The specific region of dictionaryoutlanguage. See %mobilangcodes for an exact map of values.

dictionaryoutlanguage

The language code for the DictionaryOutLanguage element. See %mobilangcodes for an exact map of values.

version2

This is another Mobipocket format version related to DRM. If no DRM is present, it should be the same as version.

Use with caution. This key may be renamed in the future if more information is found.

imagerecord

This is thought to be an index to the first record containing image data.

Use with caution. This key may be renamed in the future if more information is found.

unknown96

Unsigned long int (32-bit) at offset 96.

Use with caution. This key may be renamed in the future if more information is found.

unknown100

Unsigned long int (32-bit) at offset 100.

Use with caution. This key may be renamed in the future if more information is found.

unknown104

Unsigned long int (32-bit) at offset 104.

Use with caution. This key may be renamed in the future if more information is found.

unknown108

Unsigned long int (32-bit) at offset 108.

Use with caution. This key may be renamed in the future if more information is found.

exthflags

A 32-bit bitfield related to the Mobipocket EXTH data. If bit 6 (0x40) is set, then there is at least one EXTH record.

unknown116

36 bytes of unknown data at offset 116. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

drmcode

A number thought to be related to DRM. If present and no DRM is set, contains either the value 0xFFFFFFFF (normal books) or 0x00000000 (samples). This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown156

20 bytes of unknown data at offset 156, usually zeroes. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown176

16 bits of unknown data at offset 176. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown178

16 bits of unknown data at offset 178. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown180

32 bits of unknown data at offset 180. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown184

32 bits of unknown data at offset 184. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown188

32 bits of unknown data at offset 188. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown192

32 bits of unknown data at offset 192. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown196

32 bits of unknown data at offset 180. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown200

Unknown data of unknown length running to the end of the header. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unpack_palmdoc_header

Takes as an argument a scalar containing the 16 bytes of the PalmDoc header (also used by Mobipocket). Returns a hash containing those values keyed to recognizable names.

See:

http://wiki.mobileread.com/wiki/DOC#PalmDOC

and

http://wiki.mobileread.com/wiki/MOBI

keys

The returned hash will have the following keys:

  • compression

    Possible values:

    1 - no compression
    2 - PalmDoc compression
    ?? - HuffDic?
    17480 - Mobipocket DRM?

    A warning will be carped if an unknown value is found.

  • textlength

    Uncompressed length of book text in bytes

  • textrecords

    Number of PDB records used for book text

  • recordsize

    Maximum size of each record containing book text. This should always be 2048 (for some Mobipocket files) or 4096 (for everything else). A warning will be carped if it isn't.

  • unused

    Two bytes that should always be zero. A warning will be carped if they aren't.

Note that the current position component of the header is discarded.

BUGS/TODO

  • DRM isn't handled. Infrastructure to support this via an external plug-in module may eventually be built, but it will never become part of the main module for legal reasons.

  • Mobipocket HuffDic encoding (used mostly on dictionaries) isn't supported yet.

  • Not all Mobipocket data is understood, so a conversion from OPF to Mobipocket .prc back to OPF will not result in all data being retained. Patches welcome.

  • Mobipocket EXTH subjectcode records may not end up attached to the correct subject element if the number of subject records differs from the number of subjectcode records. This is because the Mobipocket format leaves the EXTH subjectcode records completely unlinked from the subject records, and there is no way to detect if a subject with no associated subjectcode comes before a subject with an associated subjectcode.

    Fortunately, this should rarely be a problem with real data, as Mobipocket Creator only allows a single subject to be set, and the only other way to have a subjectcode attached to a subject is to manually edit the OPF file and insert an additional dc:Subject element with a BASICCode attribute.

    Mobipocket has indicated that they may move data currently in their custom elements and attributes to the standard <meta> elements in a future release, so this problem may become moot then.

  • Unit tests are incomplete

  • Documentation is incomplete. Accessors in particular could use some cleaning up.

  • Need to implement setter methods for object attributes

  • Palm::Doc is currently used for extraction, with a lot of code in this module dedicated to extracting information that it can't. It may be better to split out that code into a dedicated module to replace Palm::Doc completely.

  • PDB Bookmarks aren't supported. This is a weakness inherited from Palm::Doc, and will take a while to fix.

  • Import/extraction/unpacking is currently limited to PalmDoc and Mobipocket. Extraction from eReader and Microsoft Reader (.lit) is also eventually planned. Other formats may follow from there.

AUTHOR

Zed Pobre <zed@debian.org>

COPYRIGHT

Copyright 2008 Zed Pobre

Licensed to the public under the terms of the GNU GPL, version 2