The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

dtatw-sanitize-header.perl - make DDC/DTA-friendly TEI-headers

SYNOPSIS

 dtatw-sanitize-header.perl [OPTIONS] XML_HEADER_FILE

 General Options:
  -help                  # this help message
  -verbose LEVEL         # set verbosity level (0<=LEVEL<=1)
  -quiet                 # alias for -verbose=0
  -dta , -foreign        # do/don't warn about strict DTA header compliance (default=do)
  -max-bibl-length LEN   # trim bibl fields to maximum length LEN (default=256)

 Auxiliary DB Options:   # optional BASENAME-keyed JSON-metadata Berkeley DB
  -aux-db DBFILE         # read auxiliary DB from DBFILE (default=none)
  -aux-xpath XPATH       # append <idno type="KEY"> elements to XPATH (default='fileDesc[@n="ddc-aux"]')

 XPath Options:
  -xpath ATTR=XPATH      # prepend XPATH for attribute ATTR
  -default ATTR=VAL      # default values (for textClass* attributes)

 I/O Options:
  -blanks , -noblanks    # do/don't keep 'ignorable' whitespace in XML_HEADER_FILE file (default=don't)
  -base BASENAME         # use BASENAME to auto-compute field names (default=basename(XML_HEADER_FILE))
  -output FILE           # specify output file (default='-' (STDOUT))

OPTIONS AND ARGUMENTS

General Options

-h, -help

Display a brief usage summary and exit.

-v, -verbose LEVEL

Set verbosity level; values for LEVEL are:

 0: silent
 1: warnings only
 2: warnings and progress messages
-q, -quiet

Alis for -verbose=0

-b, -basename BASENAME

Set basename for generated header fields; default is the basename (non-directory portion) of XML_HEADER_FILE up to but not including the first dot (".") character, if any. In default -dta mode, everything after the first dot character in BASENAME will be truncated even if you specify this option; in -foreign mode, dots in basenames passed in via this option are allowed.

-dta, -nodta

Do/don't run with DTA-specific heuristics and attempt to enforce DTA-header compliance (default: do).

-foreign

Alias for -nodta.

-l, -max-bibl-len LEN

Trim sanitized XPaths to maximum length LEN characters (default=256).

Auxiliary DB Options

You can optionally use a BASENAME-keyed JSON-metadata Berkeley DB file to automatically insert additional metadata fields into an existing header.

-aux-db DBFILE

Apply auxiliary metadata from Berkeley DB file DBFILE (default=none). Keys of DBFILE should be BASENAMEs as parsed from XML_HEADER_FILE or passed in via the -basename option, and the associated values should be flat JSON objects whose keys are the names of metadata attributes for BASENAME and whose values are the values of those metadata attributes.

-aux-xpath XPATH

Append <idno type="KEY">VAL</idno> elements to XPATH (default='fileDesc[@n="ddc-aux"]') for auxiliary metadata attributes.

XPath Options

You can optionally specify source XPaths to override the defaults with the -xpath option.

-xpath ATTR=XPATH

Prepend XPATH to the builtin list of source XPaths for the attribute ATTR. Known attributes: author title date bibl shelfmark library dirname dtaid timestamp availability avail textClassDTA textClassDWDS textClassCorpus.

-default ATTR=VALUE

Default value for attribute ATTR. Only used for textClass* attributes.

I/O Options

-[no]keep-blanks

Do/don't retain all whitespace in input file (default=don't).

-o, -output OUTFILE

Write output to OUTFILE; default="-" (standard output).

-format LEVEL

Format output at libxml level LEVEL (default=1).

DESCRIPTION

dtatw-sanitize-header.perl applies some parsing and encoding heuristics to a TEI-XML header file XML_HEADER_FILE in an attempt to ensure compliance with DTA/D* header conventions for subsequent DDC indexing. For each supported metadata attribute, a corresponding header record is first sought by means of a first-match-wins XPath list. If no existing header record is found, a default (possibly empty) value is heuristically assigned, and the resulting value is inserted into the header at a conventional XPath location.

The metadata attributes currently supported are listed below; Source XPaths in the list are specified relative to the root <teiHeader> element, and unless otherwise noted, the first source XPath listed is also the target XPath, guaranteed to be exist in the output header on successful script completion.

See https://kaskade.dwds.de/dstar/doc/README.html#bibliographic_metadata_attributes for details on D* metadata attribute conventions.

author

XPath(s):

 fileDesc/titleStmt/author[@n="ddc"]                                                    ##-- ddc: canonical target (formatted)
 fileDesc/titleStmt/author                                                              ##-- new (direct, un-formatted)
 fileDesc/sourceDesc/biblFull/titleStmt/author                                          ##-- new (sourceDesc, un-formatted)
 fileDesc/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"]                     ##-- new (direct, un-formatted)
 fileDesc/sourceDesc/biblFull/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (sourceDesc, un-formatted)
 fileDesc/sourceDesc/listPerson[@type="searchNames"]/person/persName                    ##-- old

Heuristically parses and formats persName, surname, forename, and genName elements to a human-readable string. In DTA mode, defaults to the first component of the "_"-separated BASENAME.

title

XPath(s):

 fileDesc/titleStmt/title[@type="main" or @type="sub" or @type="vol"]   ##-- DTA-mode only
 fileDesc/titleStmt/title[@type="ddc"]                                  ##-- ddc: canonical target (formatted)
 fileDesc/titleStmt/title[not(@type)]
 sourceDesc[@id="orig"]/biblFull/titleStmt/title
 sourceDesc[@id="scan"]/biblFull/titleStmt/title
 sourceDesc[not(@id)]/biblFull/titleStmt/title

In DTA mode, heuristically parses and formats @type="main", @type="sub", @type="vol" elements to a human-readable string, and defaults to the second component of the "_"-separated BASENAME.

date

XPath(s):

 fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="pub"]               ##-- ddc: canonical target
 fileDesc/sourceDesc[@n="scan"]/biblFull/publicationStmt/date                           ##-- old:publDate
 fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied
 fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]
 fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"]/supplied        ##-- new:date (published, supplied)
 fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"]                 ##-- new:date (published)
 fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied                             ##-- new:date (generic, supplied)
 fileDesc/sourceDesc/biblFull/publicationStmt/date                                      ##-- new:date (generic, supplied)

Heuristically trims everything but digits and hyphens from the extracted date-string. In DTA mode, defaults to the final component of the "_"-separated BASENAME.

firstDate

XPath(s):

 fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="first"]             ##-- ddc: canonical target
 fileDesc/sourceDesc[@n="orig"]/biblFull/publicationStmt/date                           ##-- old: publDate
 fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied
 fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]
 fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"]/supplied   ##-- new:date (first, supplied)
 fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"]            ##-- new:date (first)
 fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied                             ##-- new:date (generic, supplied)
 fileDesc/sourceDesc/biblFull/publicationStmt/date                                      ##-- new:date (generic, supplied)

Heuristically trims everything but digits and hyphens from the extracted date-string. Defaults to the publication date (see above).

bibl

XPath(s):

 fileDesc/sourceDesc[@n="ddc"]/bibl     ##-- ddc:canonical target
 fileDesc/sourceDesc[@n="orig"]/bibl    ##-- old:firstBibl, target
 fileDesc/sourceDesc[@n="scan"]/bibl    ##-- old:publBibl
 fileDesc/sourceDesc/bibl               ##-- new|old:generic

Heuristically generated from author, title, and date if not set. Ensures that the first 2 XPaths are set in the output file.

shelfmark

XPath(s):

 fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno/idno[@type="shelfmark"]         ##-- ddc: canonical target
 fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno[@type="shelfmark"]              ##-- -2013-08-04
 fileDesc/sourceDesc/msDesc/msIdentifier/idno/idno[@type="shelfmark"]
 fileDesc/sourceDesc/msDesc/msIdentifier/idno[@type="shelfmark"]                        ##-- new (>=2012-07)
 fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/ident[@type="shelfmark"] ##-- old (<2012-07)

library

XPath(s):

 fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/repository                           ##-- ddc: canonical target
 fileDesc/sourceDesc/msDesc/msIdentifier/repository                                     ##-- new
 fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/name[@type="repository"] ##-- old

basename (dtadir)

XPath(s):

 fileDesc/publicationStmt[@n="ddc"]/idno[@type="basename"]      ##-- new: canonical target
 fileDesc/publicationStmt/idno/idno[@type="DTADirName"]         ##-- (>=2013-09-04)
 fileDesc/publicationStmt/idno[@type="DTADirName"]              ##-- (>=2013-09-04)
 fileDesc/publicationStmt/idno[@type="DTADIRNAME"]              ##-- new (>=2012-07)
 fileDesc/publicationStmt/idno[@type="DTADIR"]                  ##-- old (<2012-07)

Heuristically set to BASENAME if not found.

dtaid

XPath(s):

 fileDesc/publicationStmt[@n="ddc"]/idno[@type="dtaid"]         ##-- ddc: canonical target
 fileDesc/publicationStmt/idno/idno[@type="DTAID"]
 fileDesc/publicationStmt/idno[@type="DTAID"]

Defaults to "0" (zero) if unset.

timestamp

XPath(s):

 fileDesc/publicationStmt/date[@type="ddc-timestamp"]           ##-- ddc: canonical target
 fileDesc/publicationStmt/date                                  ##-- DTA mode only

Defaults to last modification time of XML_HEADER_FILE or the current time if not set.

availability (human-readable)

XPath(s):

 fileDesc/publicationStmt/availability[@type="ddc"]
 fileDesc/publicationStmt/availability

Defaults to "-" if unset.

avail (DWDS code)

XPath(s):

 fileDesc/publicationStmt/availability[@type="ddc_dwds"]
 fileDesc/publicationStmt/availability/@n

Defaults to "-" if unset.

textClass

Source XPath(s):

 profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"]
 profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"]
 profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"]
 profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"]
 profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"]
 profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"]
 profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"]
 profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"]
 profileDesc/textClass/keywords/term ##-- dwds keywords

Target XPath:

 profileDesc/textClass/classCode[@scheme="ddcTextClassDWDS"]

textClassDTA

Source XPath(s):

 profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dtamain"]
 profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dtasub"]
 profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtamain"]
 profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtasub"]

Target XPath:

 profileDesc/textClass/classCode[@scheme="ddcTextClassDTA"]

DTA corpus

Source XPath(s):

 profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"]
 profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"]

Target XPath:

 profileDesc/textClass/classCode[@scheme="ddcTextClassCorpus"]

SEE ALSO

dtatw-get-header.perl(1), ...

AUTHOR

Bryan Jurish <moocow@cpan.org>