The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Genealogy::Gedcom::Reader::Lexer - An OS-independent lexer for GEDCOM data

Synopsis

Run scripts/lex.pl -help.

A typical run would be:

perl -Ilib scripts/lex.pl -i data/royal.ged -r 1 -s 1

Turn on debugging prints with:

perl -Ilib scripts/lex.pl -i data/royal.ged -r 1 -s 1 -max debug

royal.ged was downloaded from http://www.vjet.f2s.com/ftree/download.html. It's more up-to-date than the one shipped with Gedcom.

Various sample GEDCOM files may be found in the data/ directory in the distro.

Description

Genealogy::Gedcom::Reader::Lexer provides a lexer for GEDCOM data.

See the GEDCOM Specification Ged551-5.pdf.

Installation

Install Genealogy::Gedcom as you would for any Perl module:

Run:

        cpanm Genealogy::Gedcom

or run:

        sudo cpan Genealogy::Gedcom

or unpack the distro, and then either:

        perl Build.PL
        ./Build
        ./Build test
        sudo ./Build install

or:

        perl Makefile.PL
        make (or dmake or nmake)
        make test
        make install

Constructor and Initialization

new() is called as my($lexer) = Genealogy::Gedcom::Reader::Lexer -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type Genealogy::Gedcom::Reader::Lexer.

Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. input_file()]):

o input_file => $gedcom_file_name

Read the GEDCOM data from this file.

Default: ''.

o locale => $a_locale_name

Specify the locale for DateTime objects.

Default: 'en_AU'.

o logger => $logger_object

Specify a logger object.

To disable logging, just set logger to the empty string.

Default: An object of type Log::Handler.

o maxlevel => $level

This option is only used if the lexer creates an object of type Log::Handler. See Log::Handler::Levels.

Default: 'info'.

Log levels are, from highest (i.e. most output) to lowest: 'debug', 'info', 'warning', 'error'. No lower levels are used.

o minlevel => $level

This option is only used if the lexer creates an object of type Log::Handler. See Log::Handler::Levels.

Default: 'error'.

o report_items => $Boolean
o 0 => Report nothing
o 1 => Call "report()" to report, via the log, the items recognized by the lexer

This output is at log level 'info'.

Default: 0.

o strict => $Boolean

Specifies lax or strict string length checking during validation.

o 0 => String lengths can be 0, allowing blank NOTE etc records.
o 1 => String lengths must be > 0, as per the GEDCOM Specification Ged551-5.pdf.

Note: A string of length 1 - e.g. '0' - might still be an error.

Default: 0.

The upper lengths on strings are always as per the GEDCOM Specification Ged551-5.pdf. See "get_max_length($id, $line)" for details.

String lengths out of range (as with all validation failures) are reported as log messages at level 'warning'.

Methods

check_date($id, $line)

Checks the date field in the input arrayref $line, $$line[4].

$id identifies what type of record the $line is expected to be.

check_length($id, $line)

Checks the length of the data component (after the tag) on the input arrayref $line, $$line[4].

$id identifies what type of record the $line is expected to be.

cross_check_xrefs

Ensure that all xrefs point to existing records.

See "What validation is performed?" in FAQ for details.

get_gedcom_from_file()

If the caller has requested GEDCOM data be read from a file, with the input_file option to new(), this method reads that file.

Called as appropriate by "run()", if you do not suppy data with "gedcom_data([$gedcom_data])".

gedcom_data([$gedcom_data])

The [] indicate an optional parameter.

Get or set the arrayref of GEDCOM records to be processed.

This is normally only used internally, but can be used to bypass reading from a file.

Note: If supplying data this way rather than via the file, you must strip newlines etc on every line, as well as leading and trailing blanks.

get_max_length($id, $line)

Get the maximum string length of the data component (after the tag) on the given $line.

$id identifies what type of record the $line is expected to be.

get_min_length($id, $line)

Get the minimum string length of the data component (after the tag) on the given $line.

Currently, this value is actually the value of strict(), i.e. 0 or 1.

$id identifies what type of record the $line is expected to be.

input_file([$gedcom_file_name])

Here, the [] indicate an optional parameter.

Get or set the name of the file to read the GEDCOM data from.

items()

Returns a object of type Set::Array, which is an arrayref of items output by the lexer.

See the "FAQ" for details.

locale([$a_locale_name])

Here, the [] indicate an optional parameter.

Get or set the name of the locale to use for DateTime objects.

log($level, $s)

Calls $self -> logger -> $level($s).

logger([$logger_object])

Here, the [] indicate an optional parameter.

Get or set the logger object.

To disable logging, just set logger to the empty string.

maxlevel([$string])

Here, the [] indicate an optional parameter.

Get or set the value used by the logger object.

This option is only used if the lexer creates an object of type Log::Handler. See Log::Handler::Levels.

minlevel([$string])

Here, the [] indicate an optional parameter.

Get or set the value used by the logger object.

This option is only used if the lexer creates an object of type Log::Handler. See Log::Handler::Levels.

push_item($line, $type)

Pushes a hashref of components of the $line, with type $type, onto the arrayref of items returned by "items()".

See the "FAQ" for details.

renumber_items()

Scan the arrayref of hashrefs returned by items() and ensure the 'count' field is ok.

This is done in case array elements have been combined, e.g. when processing CONCs and CONTs for NOTEs.

report()

Report, via the log, the list of items recognized by the lexer.

report_items([0 or 1])

The [] indicate an optional parameter.

Get or set the value which determines whether or not to report the items recognised by the lexer.

run()

This is the only method the caller needs to call. All parameters are supplied to new(), or via previous calls to various methods.

Returns 0 for success and 1 for failure.

strict([0 or 1])

The [] indicate an optional parameter.

Get or set the value which determines whether or not to use 0 or 1 as the minimum string length.

FAQ

How are user-defined tags handled?

In the same way as GEDCOM tags.

They are defined by having a leading '_', as well as same syntax as GEDCOM files. That is:

o At level 0, they match /(_?(?:[A-Z]{3,4}))/.
o At level > 0, they match /(_?(?:ADR[123]|[A-Z]{3,5}))/.

Each user-defined tag is stand-alone, meaning they can't be extended with CONC or CONT tags in the way some GEDCOM tags can.

See data/sample.4.ged.

How are CONC and CONT tags handled?

Nothing is done with them, meaning e.g. text flowing from a NOTE (say) onto a CONC or CONT is not concatenated.

Currently then, even GEDCOM tags are stand-alone.

How is the lexed data stored in RAM?

Items are stored in an arrayref. This arrayref is available via the "items()" method.

This method returns the same data as does "items()" in Genealogy::Gedcom::Reader.

Each element in the array is a hashref of the form:

        {
        count      => $n,
        data       => $a_string
        level      => $n,
        line_count => $n,
        tag        => $a_tag,
        type       => $a_string,
        xref       => $a_string,
        }

Key-value pairs are:

o count => $n

Items are numbered from 1 up, so this is the array index + 1.

Note: Blank lines in the input file are skipped.

o data => $a_string

This is any data associated with the tag.

Given the GEDCOM record:

        1   NAME Given Name /Surname/

then data will be 'Given Name /Surname/', i.e. the text after the tag.

Given the GEDCOM record:

        1   SUBM @SUBM1@

then data will be 'SUBM1'.

As with xref (below), the '@' characters are stripped.

o level => $n

The is the level from the GEDCOM data.

o line_count => $n

This is the line number from the GEDCOM data.

o tag => $a_tag

This is the GEDCOM tag.

o type => $a_string

This is a string indicating what broad class the tag refers to. Values:

o (Empty string)

Used for various cases.

o Address
o Concat
o Continue
o Date

If the type is 'Date', then it has been successfully parsed.

If parsing failed, the value will be 'Invalid date'.

o Event
o Family
o File name
o Header
o Individual
o Invalid date

If the type is 'Date', then it has been successfully parsed.

If parsing failed, the value will be 'Invalid date'.

o Multimedia
o Note
o Place
o Repository
o Source
o Submission
o Submitter
o Trailer
o xref => $a_string

Given the GEDCOM record:

        0 @I82@ INDI

then xref will be 'I82'.

As with data (above), the '@' characters are stripped.

What validation is performed?

There is no perfect answer as to what should be a warning and what should be an error.

So, the author's philosophy is that unrecoverable states are errors, and the code calls 'die'. See "Under what circumstances does the code call 'die'?".

And, the log level 'error' is not used. All validation failures are logged at level warning, leaving interpretation up to the user. See "How does logging work?".

Details:

o Cross-references

Xrefs (pointers) are checked that they point to an xref which exists. Each dangling xref is only reported once.

o Dates are validated
o Duplicate xrefs

Xrefs which are (potentially) pointed to are checked for uniqueness.

o String lengths

Maximum string lengths are checked as per the GEDCOM Specification Ged551-5.pdf.

Minimum string lengths are checked as per the value of the 'strict' option to new().

o Strict 'v' Mandatory

Validation is mandatory, even with the 'strict' option set to 0. 'strict' only affects the minimum string length acceptable.

o Tag nesting

Tag nesting is validated by the mechanism of nested method calls, with each method (called tag_*) knowing what tags it handles, and with each nested call handling its own tags.

This process starts with the call to tag_lineage(0, $line) in method "run()".

o Unexpected tags

The lexer reports the first unexpected tag, meaning it is not a GEDCOM tag and it does not start with '_'.

All validation failures are reported as log messages at level 'warning'.

What other validation is planned?

Here are some suggestions from the mailing list:

o Mandatory sub-tags

This means check that each tag has all its mandatory sub-tags.

o Natural (not step-) parent must be older than child
o Prior art

http://www.tamurajones.net/GEDCOMValidation.xhtml.

o Specific values for data attached to tags

Many such checks are possible. E.g. Attribute type (p 43 of GEDCOM Specification) must be one of: CAST | EDUC | NATI | OCCU | PROP | RELI | RESI | TITL | FACT.

What other features are planned?

Here are some suggestions from the mailing list:

o Persistent IDs for individuals

A proposal re UUIDs.

How does logging work?

o Debugging

When new() is called as new(maxlevel => 'debug'), each method entry is logged at level 'debug'.

This has the effect of tracing all code which processes tags.

Since the default value of 'maxlevel' is 'info', all this output is suppressed by default. Such output is mainly for the author's benefit.

o Log levels

Log levels are, from highest (i.e. most output) to lowest: 'debug', 'info', 'warning', 'error'. No lower levels are used. See Log::Handler::Levels.

'maxlevel' defaults to 'info' and 'minlevel' defaults to 'error'. In this way, levels 'info' and 'warning' are reported by default.

Currently, level 'error' is not used. Fatal errors cause 'die' to be called, since they are unrecoverable. See "Under what circumstances does the code call 'die'?".

o Reporting

When new() is called as new(report_items => 1), the items are logged at level 'info'.

o Validation failures

These are reported at level 'warning'.

Under what circumstances does the code call 'die'?

o When there is a typo in the field name passed in to check_length()

This is a programming error.

o When an input file is not specified

This is a user (run time) error.

o When there is a syntax error in a GEDCOM record

This is a user (data preparation) error.

How do I change the version of the GEDCOM grammar supported?

By sub-classing.

What file charsets are supported?

ASCII - i.e. nothing else has been tested.

The code should really ought to support ANSEL (a superset of ASCII), ASCII, UTF-8 and UTF-16 (known to GEDCOM as UNICODE).

TODO

o Test input file for binary
o Test input file for non-ASCII character sets
o Test input file for size 0
o Tighten validation

Machine-Readable Change Log

The file CHANGES was converted into Changelog.ini by Module::Metadata::Changes.

Version Numbers

Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.

References

o The original Perl Gedcom
o GEDCOM
o GEDCOM Specification
o GEDCOM Validation
o GEDCOM Tags
o Usage of non-standard tags
o http://www.tamurajones.net/FTWTEXT.xhtml

This is apparently the worst offender she's seen. Search that page for 'tags'.

o http://www.tamurajones.net/GenoPro2011.xhtml
o http://www.tamurajones.net/GenoPro2007.xhtml
o http://www.tamurajones.net/TheFTWTEXTProblem.xhtml
o http://www.tamurajones.net/FiveFreakyFeaturesYourGenealogySoftwareShouldNotHave.xhtml
o http://www.tamurajones.net/TwelveOrdinaryMustHaveGenealogySoftwareFeatures.xhtml
o Other projects

Many of these are discussed on Tamura's site.

o http://bettergedcom.wikispaces.com/
o http://www.ngsgenealogy.org/cs/GenTech_Projects
o http://gdmxml.fugal.net/
o http://www.cosoft.org/genxml/
o http://www.sunflower.com/~billk/GEDC/
o http://ancestorsnow.blogspot.com/2011/07/vged.html
o http://www.tamurajones.net/GEDCOMValidation.xhtml
o http://webtrees.net/
o http://swoodbridge.com/Genealogy/lifelines/
o http://deadendssoftware.blogspot.com/
o http://www.legacyfamilytree.com/
o https://devnet.familysearch.org/docs/api-overview

The Gedcom Mailing List

Contact perl-gedcom-help@perl.org.

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=Genealogy::Gedcom.

Author

Genealogy::Gedcom::Reader::Lexer was written by Ron Savage <ron@savage.net.au> in 2011.

Home page: http://savage.net.au/index.html.

Copyright

Australian copyright (c) 2011, Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html