NAME
File::Extract - Extract Text From Arbitrary File Types
SYNOPSIS
use
File::Extract;
my
$e
= File::Extract->new();
my
$r
=
$e
->extract(
$filename
);
my
$e
= File::Extract->new(
encodings
=> [...]);
my
$class
=
"MyExtractor"
;
File::Extract->register_processor(
$class
);
my
$filter
= MyCustomFilter->new;
File::Extact->register_filter(
$mime_type
=>
$filter
);
DESCRIPTION
File::Extract is a framework to extract text data out of arbitrary file types, useful to collect data for indexing.
CLASS METHODS
register_processor($class)
Registers a new text-extractor. The processor is used as the default processor for a given MIME type, but it can be overridden by specifying the 'processors' parameter
The specified class needs to implement two functions:
- mime_type(void)
-
Returns the MIME type that $class can extract files from.
- extract($file)
-
Extracts the text from $file. Returns a File::Extract::Result object.
register_filter($mime_type, $filter)
Registers a filter to be used when a particular mime type has been found.
METHODS
new(%args)
- magic
-
Returns the File::MMagic::XS object that used by the object. Use this to modify, set options, etc. E.g.:
my
$extract
= File::Extract->new(...);
$extract
->magic->add_file_ext(
t
=>
'text/perl-test'
);
$extract
->extract(...);
- filters
-
A hashref of filters to be applied before attempting to extract the text out of it.
Here's a trivial example that puts line numbers in the beginning of each line before extracting the output out of it.
use
File::Extract;
my
$extract
= File::Extract->new(
filters
=> {
'text/plain'
=> [
File::Extract::Filter::Exec->new(
cmd
=>
"perl -pe 's/^/\$. /'"
)
]
}
);
my
$r
=
$extract
->extract(
$file
);
- processors
-
A list of processors to be used for this instance. This overrides any processors that were registered previously via register_processor() class method.
- encodings
-
List of encodings that you expect your files to be in. This is used to re-encode and normalize the contents of the file via Encode::Guess.
- output_encoding
-
The final encoding that you the extracted test to be in. The default encoding is UTF8.
extract($file)
SEE ALSO
AUTHOR
Copyright 2005-2007 Daisuke Maki <daisuke@endeworks.jp>. All rights reserved.
LICENSE
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html