The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

MARC::Lint - Perl extension for checking validity of MARC records

SYNOPSIS

    use MARC::File::USMARC;
    use MARC::Lint;

    my $lint = new MARC::Lint;
    my $filename = shift;

    my $file = MARC::File::USMARC->in( $filename );
    while ( my $marc = $file->next() ) {
        $lint->check_record( $marc );

        # Print the title tag
        print $marc->title, "\n";

        # Print the errors that were found
        print join( "\n", $lint->warnings ), "\n";
    } # while

Given the following MARC record:

    LDR 00000nam  22002538a 4500
    040    _aMdSSJTT
           _cMdSSJTT
    040    _aMdSSJTT
           _beng
           _cMdSSJTT
    100 14 _aWall, Larry.
    110 1  _aO'Reilly & Associates.
    245 90 _aProgramming Perl /
           _aBig Book of Perl /
           _cLarry Wall, Tom Christiansen & Jon Orwant.
    250    _a3rd ed.
    250    _a3rd ed.
    260    _aCambridge, Mass. :
           _bO'Reilly,
           _r2000.
    590 4  _aPersonally signed by Larry.
    856 43 _uhttp://www.perl.com/

the following errors are generated:

    1XX: Only one 1XX tag is allowed, but I found 2 of them.
    100: Indicator 2 must be blank but it's "4"
    245: Indicator 1 must be 0 or 1 but it's "9"
    245: Subfield _a is not repeatable.
    040: Field is not repeatable.
    260: Subfield _r is not allowed.
    856: Indicator 2 must be blank, 0, 1, 2 or 8 but it's "3"

DESCRIPTION

Module for checking validity of MARC records. 99% of the users will want to do something like is shown in the synopsis. The other intrepid 1% will overload the MARC::Lint module's methods and provide their own special field-level checking.

What this means is that if you have certain requirements, such as making sure that all 952 tags have a certain call number in them, you can write a function that checks for that, and still get all the benefits of the MARC::Lint framework.

EXPORT

None. Everything is done through objects.

METHODS

new()

No parms needed. The MARC::Lint object is little more than a list of warnings and a bunch of rules.

warnings()

Returns a list of warnings found by check_record() and its brethren.

clear_warnings()

Clear the list of warnings for this linter object. It's automatically called when you call check_record().

warn( $str [, $str...] )

Create a warning message, built from strings passed, like a print statement.

Typically, you'll leave this to check_record(), but industrious programmers may want to do their own checking as well.

check_record( $marc )

Does all sorts of lint-like checks on the MARC record $marc, both on the record as a whole, and on the individual fields & subfields.

check_xxx( $field )

Various functions to check the different fields. If the function doesn't exist, then it doesn't get checked.

check_020()

Looks at 020$a and reports errors if the check digit is wrong. Looks at 020$z and validates number if hyphens are present.

Uses Business::ISBN to do validation. Thirteen digit checking is currently done with the internal sub _isbn13_check_digit(), based on code from Business::ISBN.

TO DO (check_020):

 Fix 13-digit ISBN checking.

_isbn13_check_digit($ean)

Internal sub to determine if 13-digit ISBN has a valid checksum. The code is taken from Business::ISBN::as_ean. It is expected to be temporary until Business::ISBN is updated to check 13-digit ISBNs itself.

check_041( $field )

Warns if subfields are not evenly divisible by 3 unless second indicator is 7 (future implementation would ensure that each subfield is exactly 3 characters unless ind2 is 7--since subfields are now repeatable. This is not implemented here due to the large number of records needing to be corrected.). Validates against the MARC Code List for Languages (http://www.loc.gov/marc/) using the MARC::Lint::CodeData data pack to MARC::Lint (%LanguageCodes, %ObsoleteLanguageCodes).

check_043( $field )

Warns if each subfield a is not exactly 7 characters. Validates each code against the MARC code list for Geographic Areas (http://www.loc.gov/marc/) using the MARC::Lint::CodeData data pack to MARC::Lint (%GeogAreaCodes, %ObsoleteGeogAreaCodes).

check_245( $field )

 -Makes sure $a exists (and is first subfield).
 -Warns if last character of field is not a period
 --Follows LCRI 1.0C, Nov. 2003 rather than MARC21 rule
 -Verifies that $c is preceded by / (space-/)
 -Verifies that initials in $c are not spaced
 -Verifies that $b is preceded by :;= (space-colon, space-semicolon, space-equals)
 -Verifies that $h is not preceded by space unless it is dash-space
 -Verifies that data of $h is enclosed in square brackets
 -Verifies that $n is preceded by . (period)
  --As part of that, looks for no-space period, or dash-space-period (for replaced elipses)
 -Verifies that $p is preceded by , (no-space-comma) when following $n and . (period) when following other subfields.
 -Performs rudimentary article check of 245 2nd indicator vs. 1st word of 245$a (for manual verification).

 Article checking is done by internal _check_article method, which should work for 130, 240, 245, 440, 630, 730, and 830.

_check_article

Check of articles is based on code from Ian Hamilton. This version is more limited in that it focuses on English, Spanish, French, Italian and German articles. Certain possible articles have been removed if they are valid English non-articles. This version also disregards 008_language/041 codes and just uses the list of articles to provide warnings/suggestions.

source for articles = http://www.loc.gov/marc/bibliographic/bdapp-e.html

Should work with fields 130, 240, 245, 440, 630, 730, and 830. Reports error if another field is passed in.

SEE ALSO

Check the docs for MARC::Record. All software links are there.

TODO

  • Subfield 6

    For subfield 6, it should always be the 1st subfield according to MARC 21 specifications. Perhaps a generic check should be added that warns if subfield 6 is not the 1st subfield.

  • Subfield 8.

    This subfield could be the 1st or 2nd subfield, so the code that checks for the 1st few subfields (check_245, check_250) should take that into account.

  • Subfield 9

    This subfield is not officially allowed in MARC, since it is locally defined. Some way needs to be made to allow messages/warnings about this subfield to be turned off (or otherwise deal with records using/allowing locally defined subfield 9).

  • 008 length and presence check

    Currently, 008 validation is not implemented in MARC::Lint, but is left to MARC::Errorchecks. It might be useful if MARC::Lint's basic validation checks included a verification that the 008 exists and is exactly 40 characters long. Additional 008-related checking and byte validation would remain in MARC::Errorchecks.

  • ISBN and ISSN checking

    020 and 022 fields are validated with the Business::ISBN and Business::ISSN modules, respectively. Business::ISBN versions between 2 and 2.02_01 are incompatible with MARC::Lint.

  • check_041 cleanup

    Splitting subfield code strings every 3 chars could probably be written more efficiently.

  • check_245 cleanup

    The article checking in particular.

  • Method for turning off checks

    Provide a way for users to skip checks more easily when using check_record, or a specific check_xxx method (e.g. skip article checking).

LICENSE

This code may be distributed under the same terms as Perl itself.

Please note that these modules are not products of or supported by the employers of the various contributors to the code.