NAME

Treex::Block::Read::BaseReader - abstract ancestor for document readers

VERSION

version 2.20210102

DESCRIPTION

This class serves as a common ancestor for document readers that have the parameter from with a space or comma separated list of filenames to be loaded. It is designed to implement the Treex::Core::DocumentReader interface.

In derived classes you need to define the next_document method, and you can use next_filename and new_document methods.

ATTRIBUTES

from (required)

space or comma separated list of filenames, or - for STDIN

An '@' directly in front of a file name causes this file to be interpreted as a file list, with one file name per line, e.g. '@filelist.txt' causes the reader to open 'filelist.txt' and read a list of files from it. File lists may be arbitrarily mixed with regular files in the parameter.

Similarly, you can use ! for wildcard expansion, e.g. treex -Len Read::Treex from='!dir??/file*.txt'. The single quotes are needed for two reasons. First, to prevent bash from interpreting the wildcard characters. Second, to prevent bash from interpreting the exclamation mark as history expansion.

The @filelist and !wildcard conventions are used in several tools, e.g. 7z or javac.

(If you use this method via API you can specify a string array reference or a Treex::Core::Files object.)

file_stem (optional)

How to name the loaded documents. This attribute will be saved to the same-named attribute in documents and it will be used in document writers to decide where to save the files.

METHODS

next_document

This method must be overridden in derived classes. (The implementation in this class just issues fatal error.)

next_filename

returns the next filename (full path) to be loaded (from the list specified in the attribute from)

new_document($load_from?)

Returns a new empty document with pre-filled attributes loaded_from, file_stem, file_number and path which are guessed based on current_filename.

current_filename

returns the last filename returned by next_filename

is_next_document_for_this_job

Is the document that will be returned by next_document supposed to be processed by this job? This is relevant only in parallel processing, where each job has a different $jobnumber assigned.

number_of_documents

Returns the number of documents that will be read by this reader. If is_one_doc_per_file returns true, then the number of documents equals the number of files given in from. Otherwise, this method returns undef.

SEE

Treex::Block::Read::BaseTextReader Treex::Block::Read::Text

AUTHOR

Martin Popel <popel@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.