dtatw-sanitize-header.perl - make DDC/DTA-friendly TEI-headers
dtatw-sanitize-header.perl [OPTIONS] XML_HEADER_FILE General Options: -help # this help message -verbose LEVEL # set verbosity level (0<=LEVEL<=1) -quiet # alias for -verbose=0 -dta , -foreign # do/don't warn about strict DTA header compliance (default=do) -max-bibl-length LEN # trim bibl fields to maximum length LEN (default=256) Auxiliary DB Options: # optional BASENAME-keyed JSON-metadata Berkeley DB -aux-db DBFILE # read auxiliary DB from DBFILE (default=none) -aux-xpath XPATH # append <idno type="KEY"> elements to XPATH (default='fileDesc[@n="ddc-aux"]') XPath Options: -xpath ATTR=XPATH # prepend XPATH for attribute ATTR -default ATTR=VAL # default values (for textClass* attributes) I/O Options: -blanks , -noblanks # do/don't keep 'ignorable' whitespace in XML_HEADER_FILE file (default=don't) -base BASENAME # use BASENAME to auto-compute field names (default=basename(XML_HEADER_FILE)) -output FILE # specify output file (default='-' (STDOUT))
Display a brief usage summary and exit.
Set verbosity level; values for LEVEL are:
0: silent 1: warnings only 2: warnings and progress messages
Alis for -verbose=0
Set basename for generated header fields; default is the basename (non-directory portion) of XML_HEADER_FILE up to but not including the first dot (".") character, if any. In default -dta mode, everything after the first dot character in BASENAME will be truncated even if you specify this option; in -foreign mode, dots in basenames passed in via this option are allowed.
-dta
-foreign
Do/don't run with DTA-specific heuristics and attempt to enforce DTA-header compliance (default: do).
Alias for -nodta.
-nodta
Trim sanitized XPaths to maximum length LEN characters (default=256).
You can optionally use a BASENAME-keyed JSON-metadata Berkeley DB file to automatically insert additional metadata fields into an existing header.
Apply auxiliary metadata from Berkeley DB file DBFILE (default=none). Keys of DBFILE should be BASENAMEs as parsed from XML_HEADER_FILE or passed in via the -basename option, and the associated values should be flat JSON objects whose keys are the names of metadata attributes for BASENAME and whose values are the values of those metadata attributes.
-basename
Append <idno type="KEY">VAL</idno> elements to XPATH (default='fileDesc[@n="ddc-aux"]') for auxiliary metadata attributes.
<idno type="KEY">VAL</idno>
'fileDesc[@n="ddc-aux"]'
You can optionally specify source XPaths to override the defaults with the -xpath option.
-xpath
Prepend XPATH to the builtin list of source XPaths for the attribute ATTR. Known attributes: author title date bibl shelfmark library dirname dtaid timestamp availability avail textClassDTA textClassDWDS textClassCorpus.
Default value for attribute ATTR. Only used for textClass* attributes.
Do/don't retain all whitespace in input file (default=don't).
Write output to OUTFILE; default="-" (standard output).
Format output at libxml level LEVEL (default=1).
dtatw-sanitize-header.perl applies some parsing and encoding heuristics to a TEI-XML header file XML_HEADER_FILE in an attempt to ensure compliance with DTA/D* header conventions for subsequent DDC indexing. For each supported metadata attribute, a corresponding header record is first sought by means of a first-match-wins XPath list. If no existing header record is found, a default (possibly empty) value is heuristically assigned, and the resulting value is inserted into the header at a conventional XPath location.
The metadata attributes currently supported are listed below; Source XPaths in the list are specified relative to the root <teiHeader> element, and unless otherwise noted, the first source XPath listed is also the target XPath, guaranteed to be exist in the output header on successful script completion.
<teiHeader>
See https://kaskade.dwds.de/dstar/doc/README.html#bibliographic_metadata_attributes for details on D* metadata attribute conventions.
XPath(s):
fileDesc/titleStmt/author[@n="ddc"] ##-- ddc: canonical target (formatted) fileDesc/titleStmt/author ##-- new (direct, un-formatted) fileDesc/sourceDesc/biblFull/titleStmt/author ##-- new (sourceDesc, un-formatted) fileDesc/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (direct, un-formatted) fileDesc/sourceDesc/biblFull/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (sourceDesc, un-formatted) fileDesc/sourceDesc/listPerson[@type="searchNames"]/person/persName ##-- old
Heuristically parses and formats persName, surname, forename, and genName elements to a human-readable string. In DTA mode, defaults to the first component of the "_"-separated BASENAME.
persName
surname
forename
genName
fileDesc/titleStmt/title[@type="main" or @type="sub" or @type="vol"] ##-- DTA-mode only fileDesc/titleStmt/title[@type="ddc"] ##-- ddc: canonical target (formatted) fileDesc/titleStmt/title[not(@type)] sourceDesc[@id="orig"]/biblFull/titleStmt/title sourceDesc[@id="scan"]/biblFull/titleStmt/title sourceDesc[not(@id)]/biblFull/titleStmt/title
In DTA mode, heuristically parses and formats @type="main", @type="sub", @type="vol" elements to a human-readable string, and defaults to the second component of the "_"-separated BASENAME.
@type="main"
@type="sub"
@type="vol"
fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="pub"] ##-- ddc: canonical target fileDesc/sourceDesc[@n="scan"]/biblFull/publicationStmt/date ##-- old:publDate fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"] fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"]/supplied ##-- new:date (published, supplied) fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"] ##-- new:date (published) fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied ##-- new:date (generic, supplied) fileDesc/sourceDesc/biblFull/publicationStmt/date ##-- new:date (generic, supplied)
Heuristically trims everything but digits and hyphens from the extracted date-string. In DTA mode, defaults to the final component of the "_"-separated BASENAME.
fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="first"] ##-- ddc: canonical target fileDesc/sourceDesc[@n="orig"]/biblFull/publicationStmt/date ##-- old: publDate fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"] fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"]/supplied ##-- new:date (first, supplied) fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"] ##-- new:date (first) fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied ##-- new:date (generic, supplied) fileDesc/sourceDesc/biblFull/publicationStmt/date ##-- new:date (generic, supplied)
Heuristically trims everything but digits and hyphens from the extracted date-string. Defaults to the publication date (see above).
fileDesc/sourceDesc[@n="ddc"]/bibl ##-- ddc:canonical target fileDesc/sourceDesc[@n="orig"]/bibl ##-- old:firstBibl, target fileDesc/sourceDesc[@n="scan"]/bibl ##-- old:publBibl fileDesc/sourceDesc/bibl ##-- new|old:generic
Heuristically generated from author, title, and date if not set. Ensures that the first 2 XPaths are set in the output file.
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno/idno[@type="shelfmark"] ##-- ddc: canonical target fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno[@type="shelfmark"] ##-- -2013-08-04 fileDesc/sourceDesc/msDesc/msIdentifier/idno/idno[@type="shelfmark"] fileDesc/sourceDesc/msDesc/msIdentifier/idno[@type="shelfmark"] ##-- new (>=2012-07) fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/ident[@type="shelfmark"] ##-- old (<2012-07)
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/repository ##-- ddc: canonical target fileDesc/sourceDesc/msDesc/msIdentifier/repository ##-- new fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/name[@type="repository"] ##-- old
fileDesc/publicationStmt[@n="ddc"]/idno[@type="basename"] ##-- new: canonical target fileDesc/publicationStmt/idno/idno[@type="DTADirName"] ##-- (>=2013-09-04) fileDesc/publicationStmt/idno[@type="DTADirName"] ##-- (>=2013-09-04) fileDesc/publicationStmt/idno[@type="DTADIRNAME"] ##-- new (>=2012-07) fileDesc/publicationStmt/idno[@type="DTADIR"] ##-- old (<2012-07)
Heuristically set to BASENAME if not found.
fileDesc/publicationStmt[@n="ddc"]/idno[@type="dtaid"] ##-- ddc: canonical target fileDesc/publicationStmt/idno/idno[@type="DTAID"] fileDesc/publicationStmt/idno[@type="DTAID"]
Defaults to "0" (zero) if unset.
fileDesc/publicationStmt/date[@type="ddc-timestamp"] ##-- ddc: canonical target fileDesc/publicationStmt/date ##-- DTA mode only
Defaults to last modification time of XML_HEADER_FILE or the current time if not set.
fileDesc/publicationStmt/availability[@type="ddc"] fileDesc/publicationStmt/availability
Defaults to "-" if unset.
fileDesc/publicationStmt/availability[@type="ddc_dwds"] fileDesc/publicationStmt/availability/@n
Source XPath(s):
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"] profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"] profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"] profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"] profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"] profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"] profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"] profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"] profileDesc/textClass/keywords/term ##-- dwds keywords
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassDWDS"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dtamain"] profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#dtasub"] profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtamain"] profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtasub"]
profileDesc/textClass/classCode[@scheme="ddcTextClassDTA"]
profileDesc/textClass/classCode[@scheme="https://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"] profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"]
profileDesc/textClass/classCode[@scheme="ddcTextClassCorpus"]
dtatw-get-header.perl(1), ...
Bryan Jurish <moocow@cpan.org>
To install DTA::TokWrap, copy and paste the appropriate command in to your terminal.
cpanm
cpanm DTA::TokWrap
CPAN shell
perl -MCPAN -e shell install DTA::TokWrap
For more information on module installation, please visit the detailed CPAN module installation guide.