The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Catmandu::MARC::Tutorial - A documentation-only module for new users of Catmandu::MARC

SYNOPSIS

  perldoc Catmandu::MARC::Tutorial

READING

Convert MARC21 records into JSON

The command below converts file data.mrc into JSON:

   $ catmandu convert MARC to JSON < data.mrc

Convert MARC21 records into MARC-XML

   $ catmandu convert MARC to MARC --type XML < data.mrc

Convert UNIMARC records into JSON, XML, ...

To read UNIMARC records use the RAW parser to get the correct character encoding.

   $ catmandu convert MARC --type RAW to JSON < data.mrc
   $ catmandu convert MARC --type RAW to MARC --type XML < data.mrc

Create a CSV file containing all the titles

To extract data from a MARC record on needs a Fix routine. This is a small language to manipulate data. In the example below we extract all 245 fields from MARC:

   $ catmandu convert MARC to CSV --fix 'marc_map(245,title); retain(title)' < data.mrc

The Fix marc_map puts the MARC 245 field in the title field. The Fix retain makes sure only the title field ends up in the CSV file.

Create a CSV file containing only the 245$a and 245$c subfields

The marc_map Fix can get one or more subfields to extract from MARC:

   $ catmandu convert MARC to CSV --fix 'marc_map(245ac,title); retain(title)' < data.mrc

Create a CSV file which contains a repeated field

In the example below the 650a field can be repeated in some marc records. We will join all the repetitions in an comma delimited list for each record.

   $ catmandu convert MARC to CSV --fix 'marc_map(650a,subject,join:","); retain(subject)' < data.mrc

Create a list of all ISBN numbers in the data

In the previous example we saw how all subjects can be printed using a few Fix commands. When a subject is repeated in a record, it will be written on one line joined by a comma:

    subject1
    subject2, subject3
    subject4

In the example over record 1 contained 'subject1', record 2 'subject2' and 'subject3' and record 3 'subject4'. What should we use when we want a list of all values in a long list?

In the example below we'll print all ISBN numbers in a batch of MARC records in one long list using the Text exporter:

  $ catmandu convert MARC to Text --field_sep "\n" --fix 'marc_map(020a,isbn.\$append); retain(isbn)' < data.mrc

The first new thing is the $append in the marc_map. This will create in isbn a list of all ISBN numbers found in the 020a field. Because $ signs have a special meaning on the command line they need to be escaped with a backslash \. The Text exporter with the field_sep option will make use all the list in the isbn field are written on a new line.

Create a list of all unique ISBN numbers in the data

Given the result of the previous command, it is now easy to create a unique list of ISBN numbers with the UNIX uniq command:

 $ catmandu convert MARC to Text --field_sep "\n" --fix 'marc_map(020a,isbn.\$append); retain(isbn)' < data.mrc | uniq

Create a list of the number of subjects per record

We will create a list of subjects (650a) and count the number of items in this list for each record. The CSV file will contain the _id (record identifier) and subject the number of 650a fields.

Writing all Fixes on the command line can become tedious. In Catmandu it is possible to create a Fix script which contains all the Fix commands.

Open a text editor and create the myfix.fix file with content:

    marc_map(650a,subject.$append)
    count(subject)
    retain(_id, subject)

And execute the command:

   $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

Create a list of all ISBN numbers for records with type 920a == book

In the example we need an extra condition for match the content of the 920a field against the string book.

Open a text editor and create the myfix.fix file with content:

    marc_map(020a,isbn.$append)
    marc_map(920a,type)

    select all_match(type,"book") # select only the books
    select exists(isbn)           # select only the records with ISBN numbers

    retain(isbn)                  # only keep this field

All the text after the # sign are inline code comments.

And run the command:

    $ catmandu convert MARC to Text --field_sep "\n" --fix myfix.fix < data.mrc

Show which MARC record don't contain a 900a field matching some list of values

First we need to create a list of keys that need to be matched against our MARC records. In the example below we create a CSV file with a key , value header and all the keys that are OK:

    $ cat mylist.txt
    key,value
    book,OK
    article,OK
    journal,OK

Next we create a Fix script that maps the MARC 900a field to a field called type. This type field we lookup in the mylist.txt file. If a match is found, then the type field will contain the value in the list (OK). When no match is found then the type will contain the original value. We reject all records that have OK as type and keep only the ones that weren't matched in the file.

Open a text editor and create the myfix.fix file with content:

    marc_map(900a,type)

    lookup(type,'/tmp/mylist.txt')

    reject all_match(type,OK)

    retain(_id,type)

And now run the command:

    $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

Create a CSV file of all ISSN numbers found at any MARC field

To process this information we need to create a Fix script like the one below (line numbers are added here to explain the working of this script but don't need to be included in the script):

    01: marc_map('***',text.$append)
    02:
    03: filter(text,'(\b\d{4}-?\d{3}[\dxX]\b)')
    04: replace_all(text.*,'.*(\b\d{4}-?\d{3}[\dxX]\b).*',$1)
    05:
    06: do list(path:text)
    07:   unless is_valid_issn(.)
    08:     reject()
    09:   end
    10: end
    11:
    12: vacuum()
    13:
    14: select exists(text)
    15:
    16: join_field(text,' ; ')
    17:
    18: retain(_id,text)

On line 01 all the text in the MARC record is mapped into a text array. On line 03 we filter out this array all the lines that contain an ISSN string using a regular expression. On line 04 the replace_all is used to delete everything in the text array that isn't an ISSN number. On line 06-10 we go over every ISSN string and check if it has a valid checksum and erase it when not. On line 12 we use the vacuum function to remove any remaining empty fields On line 14 we select only the records that contain a valid ISSN number On line 16 the ISSN get joined by a semicolon ';' into a long string On line 18 we keep only the record id and the ISSNs in for the report.

Run this Fix script (without the line number) using this command

    $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

Create a MARC validator

For this example we need a Fix script that contains validation rules we need to check. For instance, we require to have a 245 field and at least a 008 control field with a date filled in. This can be coded as in:

    # Check if a 245 field is present
    unless marc_has('245')
      log("no 245 field",level:ERROR)
    end

    # Check if there is more than one 245 field
    if marc_has_many('245')
      log("more than one 245 field?",level:ERROR)
    end

    # Check if in 008 position 7 to 10 contains a 4 digit number ('\d' means digit)
    unless marc_match('008/07-10','\d{4}')
      log("no 4-digit year in 008 position 7 -> 10",level:ERROR)
    end

Put this Fix script in a file myfix.fix and execute the Catmandu command with the "-D" option for logging and the Null exporter to discard the normal output

    $ catmandu -D convert MARC to Null --fix myfix.fix < data.mrc

TRANSFORMING

Add a new MARC field

In the example bellow we add new 856 field to the record with a $u subfield containing the Google homepage:

   marc_add(856,u,"http://www.google.com")

A control field can be added by using the '_' subfield

   marc_add(009,_,0123456789)

Maybe you want to copy the data from one subfield to another. Use the marc_map to store the data first in a temporary field and add it later to the new field:

   # copy a subfield
   marc_map(001,tmp)

   # maybe process the data a bit
   append(tmp,"-mytest")

   # add the contents of the tmp field to the new 009 field
   marc_add(009,_,$.tmp)

Set a MARC subfield

Set the $h subfield to a new value (or create it when it doesn't exist yet):

   marc_set(100h, test123)

Only set the 100 field if the first indicator is 3

   marc_set(100[3]h, test123)

Remove a MARC (sub)field

Remove all fields 500 , 501 , 5** :

   marc_remove(5**)

Remove all 245h fields:

   marc_remove(245h)

Append text to a MARC field

Append a period to the 500 field is there isn't already there:

  do marc_each()
    unless marc_match(500, "\.$")    # Only if the current field 500 doesn't end with a period
      marc_append(500,".")           # Add to the current 500 field a period
    end
  end

Use the Catmandu::Fix::Bind::marc_each Bind to loop over all MARC fields. In the context of the do -- end only one MARC field at a time is visible for the marc_* fixes.

The marc_each binder

All marc_* fixes will operate on all MARC fields matching a MARC path. For example,

   marc_remove(856)

will remove all 856 MARC fields. In some cases you may want to change only some of the fields in a record. You could write:

  if marc_match(856u,"google")
     marc_remove(856)
  end

in the hope it would remove the 856 fields that contain the text "google" in the $u subfield. Alas, this is not what will happen. The if condition will match when the record contains one or more 856u fields containing "google". The marc_remove Fix will delete all 856 fields. To correctly remove only the 856 fields in the context of the if statement the marc_each binder is required:

  do marc_each()
    if marc_match(856u,"google")
       marc_remove(856)
    end
  end

The marc_each will loop over all MARC fields one at a time. The if statement will only match when the current MARC field is 856 and the $u field contains "google". The marc_remove(856) will only delete the current 856 field.

In marc_each binder, it seems for all Fixes as if there is only one field at a time visible in the record. This Fix will not work:

  do marc_each()
    if marc_match(856u,"google")
       marc_remove(900)           # <-- there is only a 856 field in the current context
    end
  end

marc_copy, marc_cut and marc_paste

The Catmandu::Fix::marc_copy, Catmandu::Fix::marc_cut, Catmandu::Fix::marc_paste Fixes are needed when complicated edits are needed in MARC record.

The marc_copy fill copy parts of a MARC record matching a MARC_PATH to a temporary variable. This tempoarary variable will contain an ARRAY of HASHes containing the content of the MARC field.

For instance,

  marc_copy(650, tmp)

The tmp will contain something like:

  tmp:[
      {
          "subfields" : [
              {
                  "a" : "Perl (Computer program language)"
              }
          ],
          "ind1" : " ",
          "ind2" : "0",
          "tag" : "650"
    },
    {
          "ind1" : " ",
          "subfields" : [
              {
                  "a" : "Web servers."
              }
          ],
          "tag" : "650",
          "ind2" : "0"
    }
  ]

This structure can be edited with all the Catmandu fixes. For instance you can set the first indicator to '1':

  set_field(tmp.*.ind1 , 1)

The JSON path tmp.*.ind1 will match all the first indicators. The JSON path tmp.*.tag will match all the MARC tags. The JSON path tmp.*.subfields.*.a will match all the $a subfields. For instance, to change all 'Perl' into 'Python' in the $a subfield use this Fix:

  replace_all(tmp.*.subfields.*.a,"Perl","Python")

When the fields need to be places back into the record the marc_paste command can be used:

   marc_paste(subjects)

This will add all 650 fields in the tmp temporary variable at the end of the record. You can change the MARC fields in place using the march_each binder:

  do marc_each()
     # Select only the 650 fields
     if marc_has(650)
        # Create a working copy
        marc_copy(650,tmp)

        # Change some fields
        set_field(tmp.*.ind1 , 1)

        # Paste the result back
        marc_paste(tmp)
     end
  end

The marc_cut Fix works like marc_copy but will delete the matching MARC field from the record.

Rename MARC subfields

In the example below we rename each $1 subfield in the MARC record to $0 using the Catmandu::Fix::marc_cut, Catmandu::Fix::marc_paste and Catmandu::Fix::rename fixes:

    # For each marc field...
    do marc_each()
       # Cut the field into tmp..
       marc_cut(***,tmp)

       # Rename every 1 subfield to 0
       rename(tmp.*.subfields.*,1,0)

       # And paste it back
       marc_paste(tmp)
    end

The marc_each bind will loop over all the MARC fields. With marc_cut we store any field (*** matches every field) into a tmp field. The marc_cut creates an array structure in tmp which is easy to process using the Fix language. Using the rename function we search for all the subfields, and replace the field matching the regular expression 1 with 0. At the end, we paste back the tmp field into the record.

WRITING

Convert a MARC record into a MARC record (do nothing)

    $ catmandu convert MARC to MARC < data.mrc > output.mrc

Add a 920a field with value 'checked' to all records

    $ catmandu convert MARC to MARC --fix 'marc_add("900",a,"checked")' < data.mrc > output.mrc

Delete the 024 fields from all MARC records

    $ catmandu convert MARC to MARC --fix 'marc_remove("024")' < data.mrc > output.mrc

Set the 650p field to 'test' for all records

    $ catmandu convert MARC to MARC --fix 'marc_add("650p","test")' < data.mrc > output.mrc

Select only the records with 900a == book

    $ catmandu convert MARC to MARC --fix 'marc_map(900a,type); select all_match(type,book)' < data.mrc > output.mrc