Data::Edit::Xml::Lint - lint xml files in parallel using xmllint, report the failure rate and reprocess linted files to fix cross references.
Create some sample xml files, some with errors, lint them in parallel and retrieve the number of errors and failing files:
for my $n(1..$N) # Some projects {my $x = Data::Edit::Xml::Lint::new(); # New xml file linter my $catalog = $x->catalog = catalogName; # Use catalog if possible my $project = $x->project = projectName($n); # Project name my $file = $x->file = fileName($n); # Target file $x->source = <<END; # Sample source <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE concept PUBLIC "-//HPE//DTD HPE DITA Concept//EN" "concept.dtd" []> <concept id="$project"> <title>Project $project</title> <conbody> <p>Body of $project</p> </conbody> </concept> END $x->source =~ s/id="\w+?"//gs if addError($n); # Introduce an error into some projects $x->lint(foo=>1); # Write the source to the target file, lint using xmllint, include some attributes to be included as comments at the end of the target file } Data::Edit::Xml::Lint::wait; # Wait for lints to complete say STDERR Data::Edit::Xml::Lint::report($outDir, "xml")->print; # Report total pass fail rate }
Produces:
50 % success converting 3 projects containing 10 xml files on 2017-07-13 at 17:43:24 ProjectStatistics # Percent Pass Fail Total Project 1 33.3333 1 2 3 aaa 2 50.0000 2 2 4 bbb 3 66.6667 2 1 3 ccc FailingFiles # Errors Project File 1 1 ccc out/ccc5.xml 2 1 aaa out/aaa9.xml 3 1 bbb out/bbb1.xml 4 1 bbb out/bbb7.xml 5 1 aaa out/aaa3.xml
Once a file has been linted, it can be reread with read to obtain details about the xml including any id attributes defined (see: idDefs below) and any labels that refer to these id attributes (see: labelDefs below). Such labels provide additional identities for a node beyond that provided by the id attribute.
{catalog => "/home/phil/hp/dtd/Dtd_2016_07_12/catalog-hpe.xml", definition => "bbb", docType => "<!DOCTYPE concept PUBLIC \"-//HPE//DTD HPE DITA Concept//EN\" \"concept.dtd\" []>", errors => 1, file => "out/bbb1.xml", foo => 1, header => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>", idDefs => { bbb => 1, c1 => 1 }, labelDefs => { bbb => "bbb", c1 => "c1", conbody1 => "c1", conbody2 => "c1", concept1 => "bbb", concept2 => "bbb", }, labels => "bbb concept1 concept2", project => "bbb", sha256 => "b00cdebf2e1837fa15140d25315e5558ed59eb735b5fad4bade23969babf9531", source => "..." }
In order to fix references between files, a list of files can be relinted which performs the following actions:
reads the specified files via read
constructs an id map to locate an ids from labels defined in the specified files
Reparses each of the specified files to build a parse tree representing the xml in each file.
Calls a user supplied sub passing it the parse tree for each specified file and the id map. The sub should traverse the parse tree fixing attributes which make references between the files using the supplied id map.
Writes any modified parse trees back to the originating file thus fixing the changes
Version 20190708.
The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.
Construct a new linter
Create a new xml linter - call this method statically as in Data::Edit::Xml::Lint and then fill in the relevant Attributes.
This is a static method and so should be invoked as:
Data::Edit::Xml::Lint::new
Attributes describing a lint.
Optional author of the xml - only needed if you want to generate an SDL file map.
Optional catalog file containing the locations of the DTDs used to validate the xml or use dtds to supply a DTD instead.
Number of compressed errors discovered.
Text of compressed errors.
Optional Dita topic type(concept|task|troubleshooting|reference) of the xml - only needed if you want to generate an SDL file map.
The second line: the document type extracted from the source.
Optional directory containing the DTDs used to validate the xml.
Total number of uncompressed lint errors detected by xmllint over all files.
Text of uncompressed lint errors detected by xmllint over all files.
File that the xml should be written to or read from by lint, read or relint.
File number - assigned by the caller to help debugging transformations.
The file and line number of the caller so we can identify which request for lint gave rise to a particular file
Guid or id of the outermost tag - if not supplied the first definition encountered in each file will be used on the basis that all Dita topics require an id.
The first line: the xml header extracted from source.
{id} = count - the number of times this id is defined in the xml contained in this file.
The file from which this xml was obtained.
{label or id} = id - the id of the node containing a label defined on the xml.
Optional parse tree to supply labels for the current source as the labels are present in the parse tree not in the string representing the parse tree.
Date the lint was performed by lint. We avoid adding a time as well because this then induces much longer sync times with AWS S3.
Preferred representation of the xml source, used by relint to supply a preferred representation for the source.
Maximum number of xmllint processes to run in parallel - 8 by default if linting in parallel is being used. Linting in parallel is pointless if each file is already being converted in parallel. Conversely, linting in parallel is helpful if the xml files are being converted serially.
Optional project name to allow error counts to be aggregated by project and to allow id and labels to be scoped to the files contained in each project.
List of projects in which this file is reused, which can be set via reuseFileInProject every time you discover another project in which a file is reused.
The source Xml to be written to file and linted.
Optional title of the xml - only needed if you want to generate an SDL file map.
Lint xml files in parallel
Squeeze a string so it can be safely stored inside blank separated list inside an xml comment.
Parameter Description 1 $ref String to squeeze
Data::Edit::Xml::Lint::squeezeDitaRef
Store just the attributes in a file so that they can be retrieved later to process non xml objects referenced in the xml - like images
Parameter Description 1 $lint Linter 2 %attributes Attributes to be recorded as xml comments
Reread a linted xml file and extract the attributes associated with the lint
Parameter Description 1 $file File containing xml
Example:
if (1) {my $x = Data::Edit::Xml::new(<<END); <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd" []> <concept id="c1"> <title/> <conbody> </conbody> </concept> END $x->addLabels_c2_c3_c4; $x->createGuidId; is_deeply [$x->getLabels], [qw(c1 c2 c3 c4)]; my $l = new; # Linter $l->catalog = $catalog; # Catalog $l->ditaType = -t $x; # Topic type $l->file = fpf($outDir, q(zzz.dita)); # Output file $l->guid = $x->id; # Guid $l->inputFile = q(zzz.xml); # Add source file information $l->labels = $x; # Add label information to the output file so when all the files are written they can be retargeted by Data::Edit::Xml::Lint $l->project = q(aaa); # Group files into Id scopes $l->title = q(test lint); # Title $l->source = $x->ditaPrettyPrintWithHeaders; # Source from parse tree $l->lint; my $m = &𝗿𝗲𝗮𝗱($l->file); my $y = &reload($l->file); ok $l->source eq $m->source; ok -p $x eq -p $y; is_deeply [$x->getLabels], [$y->getLabels]; clearFolder($outDir, 1e2); }
Data::Edit::Xml::Lint::read
Reload a parse tree from a linted file restoring any labels and return the parse tree or undef if the file is not a lint file.
Parameter Description 1 $file File to read
Get all the attributes minus the source of all the linted files in the specified folder
Parameter Description 1 $folder Folder to search
if (1) {my @a = 𝗹𝗶𝗻𝘁𝗔𝘁𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝘀($outDir); ok $_->project eq q(aaa) for @a; }
Data::Edit::Xml::Lint::lintAttributes
Locate all the labels or id in the specified files, analyze the map of labels and ids with analysisSub parse each file, process each parse with processSub, then "lint" in lint the reprocessed xml back to the original file - this allows you to reprocess the contents of each file with knowledge of where labels or id are located in the other files associated with a project. The analysisSub(linkmap = {project}{labels or id>}=[file, id]) should return true if the processing of each file is to be performed subsequently. The processSub(parse tree representation of a file, id and label mapping, reloaded linter) should return true if a lint is required to save the results after each file has been processed else false. Optionally, the analysisSub may set the preferredSource attribute to indicate the preferred representation of the xml.
Parameter Description 1 $processes Maximum number of processes to use 2 $analysisSub Analysis 𝘀𝘂𝗯 3 $processSub Process 𝘀𝘂𝗯 4 @foldersAndExtensions Folders and extensions of files to process (recursively)
Data::Edit::Xml::Lint::relint
Return the unique (file, leading id) of the specified link in the link map or () if no such definition exists
Parameter Description 1 $linkMap Link map 2 $link Label
Data::Edit::Xml::Lint::resolveUniqueLink
Return a url encoded string
Parameter Description 1 $s String
Data::Edit::Xml::Lint::urlEncode
Given a specified $linkReturn the unique (file, leading id, topic ) of the specified link in the link map or () if no such definition exists
Parameter Description 1 $linkMap Link map 2 $fileToGuid File map 3 $link Label 4 $sourceFile File we are resolving from
Data::Edit::Xml::Lint::resolveDitaLink
Record the reuse of the specified file in the specified project
Parameter Description 1 $file Name of file that is being reused 2 $project Name of project in which it is reused
Data::Edit::Xml::Lint::reuseFileInProject
Return the unique definition of the specified link in the link map or undef if no such definition exists
Parameter Description 1 $fileToGuids File to guids map 2 $file File
Data::Edit::Xml::Lint::resolveFileToGuid
Return ([project; source label or id; targets count]*) of all labels or id that have multiple definitions
Parameter Description 1 $labelDefs Label definitions
Data::Edit::Xml::Lint::multipleLabelDefs
Return a report showing labels and id with multiple definitions in each project ordered by most defined
Parameter Description 1 $labelDefs Label and Id definitions
Data::Edit::Xml::Lint::multipleLabelDefsReport
Return ([project; label or id]*) of all labels or ids that have a single definition
Data::Edit::Xml::Lint::singleLabelDefs
Return a report showing label or id with just one definitions ordered by project, label name
Data::Edit::Xml::Lint::singleLabelDefsReport
Methods for reporting the results of linting several files
Analyze the results of prior lints and return a hash reporting various statistics and a printable report
Parameter Description 1 $outputDirectory Directory to search 2 $filter Optional regular expression to filter files
Data::Edit::Xml::Lint::report
Fix the dita xref href attributes in the corpus determined by foldersAndExtensions.
Parameter Description 1 $maximumNumberOfProcesses Maximum number of processes to run in parallel 2 @foldersAndExtensions Folders and file extensions to process.
𝗳𝗶𝘅𝗗𝗶𝘁𝗮𝗫𝗿𝗲𝗳𝗛𝗿𝗲𝗳𝘀(1, $outDir, "xml");
Compressed errors over all files
Array of [number of errors, project, files] ordered from least to most errors
{docType}++ - Hash of document types encountered
[Projects with xmllint errors]
File selection filter
Number of files encountered
Number of projects defined - each project can contain zero or more files
[Projects with no xmllint errors]
Total number of passes as a percentage of all input files
A printable report of the above
Timestamp of report
Total number of errors summed file by file
Number of compressed errors
Total number of errors
Lint a files, using xmllint and update the source file with the results in text format so as to be be easy to search with grep.
Compress the errors so we count the ones that do not look similar. Errors typically occupy three lines with the last line containing ^ at the end to mark the location of the error.
Parameter Description 1 @errors Errors
Data::Edit::Xml::Lint::compressErrors
Format the attributes section of the output file
Parameter Description 1 $attributes Hash of attributes
Record the reuse of an item in the named project
Parameter Description 1 $project Name of the project in which it is reused
Data::Edit::Xml::Lint::reuseInProject
Count the number of targets this link resolves to.
Data::Edit::Xml::Lint::countLinkTargets
Format a fraction as a percentage to 4 decimal places
Parameter Description 1 $p Pass 2 $f Fail
Data::Edit::Xml::Lint::p4
Create a test file
Parameter Description 1 $project Project name 2 $source Source of topic 3 $target Target of references 4 $additional Additional text for topic
Create some tests
1 author - Optional author of the xml - only needed if you want to generate an SDL file map.
2 catalog - Optional catalog file containing the locations of the DTDs used to validate the xml or use dtds to supply a DTD instead.
3 compressedErrors - Compressed errors over all files
4 compressedErrorText - Text of compressed errors.
5 compressErrors - Compress the errors so we count the ones that do not look similar.
6 countLinkTargets - Count the number of targets this link resolves to.
7 createTest - Create a test file
8 createTests - Create some tests
9 ditaType - Optional Dita topic type(concept|task|troubleshooting|reference) of the xml - only needed if you want to generate an SDL file map.
10 docType - The second line: the document type extracted from the source.
11 docTypes - Array of [number of errors, project, files] ordered from least to most errors
12 dtds - Optional directory containing the DTDs used to validate the xml.
13 errors - Total number of uncompressed lint errors detected by xmllint over all files.
14 errorText - Text of uncompressed lint errors detected by xmllint over all files.
15 failingFiles - {docType}++ - Hash of document types encountered
16 failingProjects - [Projects with xmllint errors]
17 file - File that the xml should be written to or read from by lint, read or relint.
18 fileNumber - File number - assigned by the caller to help debugging transformations.
19 filter - File selection filter
20 fixDitaXrefHrefs - Fix the dita xref href attributes in the corpus determined by foldersAndExtensions.
21 formatAttributes - Format the attributes section of the output file
22 guid - Guid or id of the outermost tag - if not supplied the first definition encountered in each file will be used on the basis that all Dita topics require an id.
23 header - The first line: the xml header extracted from source.
24 idDefs - {id} = count - the number of times this id is defined in the xml contained in this file.
25 inputFile - The file from which this xml was obtained.
26 labelDefs - {label or id} = id - the id of the node containing a label defined on the xml.
27 labels - Optional parse tree to supply labels for the current source as the labels are present in the parse tree not in the string representing the parse tree.
28 lineNumber - The file and line number of the caller so we can identify which request for lint gave rise to a particular file
29 lint - Lint a files, using xmllint and update the source file with the results in text format so as to be be easy to search with grep.
30 lintAttributes - Get all the attributes minus the source of all the linted files in the specified folder
31 linted - Date the lint was performed by lint.
32 multipleLabelDefs - Return ([project; source label or id; targets count]*) of all labels or id that have multiple definitions
33 multipleLabelDefsReport - Return a report showing labels and id with multiple definitions in each project ordered by most defined
34 new - Create a new xml linter - call this method statically as in Data::Edit::Xml::Lint and then fill in the relevant Attributes.
35 nolint - Store just the attributes in a file so that they can be retrieved later to process non xml objects referenced in the xml - like images
36 numberOfFiles - Number of files encountered
37 numberOfProjects - Number of projects defined - each project can contain zero or more files
38 p4 - Format a fraction as a percentage to 4 decimal places
39 passingProjects - [Projects with no xmllint errors]
40 passRatePercent - Total number of passes as a percentage of all input files
41 preferredSource - Preferred representation of the xml source, used by relint to supply a preferred representation for the source.
42 print - A printable report of the above
43 processes - Maximum number of xmllint processes to run in parallel - 8 by default if linting in parallel is being used.
44 project - Optional project name to allow error counts to be aggregated by project and to allow id and labels to be scoped to the files contained in each project.
45 read - Reread a linted xml file and extract the attributes associated with the lint
46 relint - Locate all the labels or id in the specified files, analyze the map of labels and ids with analysisSub parse each file, process each parse with processSub, then "lint" in lint the reprocessed xml back to the original file - this allows you to reprocess the contents of each file with knowledge of where labels or id are located in the other files associated with a project.
47 reload - Reload a parse tree from a linted file restoring any labels and return the parse tree or undef if the file is not a lint file.
48 report - Analyze the results of prior lints and return a hash reporting various statistics and a printable report
49 resolveDitaLink - Given a specified $linkReturn the unique (file, leading id, topic ) of the specified link in the link map or () if no such definition exists
50 resolveFileToGuid - Return the unique definition of the specified link in the link map or undef if no such definition exists
51 resolveUniqueLink - Return the unique (file, leading id) of the specified link in the link map or () if no such definition exists
52 reusedInProject - List of projects in which this file is reused, which can be set via reuseFileInProject every time you discover another project in which a file is reused.
53 reuseFileInProject - Record the reuse of the specified file in the specified project
54 reuseInProject - Record the reuse of an item in the named project
55 singleLabelDefs - Return ([project; label or id]*) of all labels or ids that have a single definition
56 singleLabelDefsReport - Return a report showing label or id with just one definitions ordered by project, label name
57 source - The source Xml to be written to file and linted.
58 squeezeDitaRef - Squeeze a string so it can be safely stored inside blank separated list inside an xml comment.
59 timestamp - Timestamp of report
60 title - Optional title of the xml - only needed if you want to generate an SDL file map.
61 totalCompressedErrors - Number of compressed errors
62 totalCompressedErrorsFileByFile - Total number of errors summed file by file
63 totalErrors - Total number of errors
64 urlEncode - Return a url encoded string
This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:
sudo cpan install Data::Edit::Xml::Lint
philiprbrenan@gmail.com
http://www.appaapps.com
Copyright (c) 2016-2019 Philip R Brenan.
This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.
To install Data::Edit::Xml::Lint, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Data::Edit::Xml::Lint
CPAN shell
perl -MCPAN -e shell install Data::Edit::Xml::Lint
For more information on module installation, please visit the detailed CPAN module installation guide.