The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

Name

Data::Edit::Xml::Lint - lint xml files in parallel using xmllint and report the failure rate

Synopsis

Linting and reporting

Create some sample xml files, some with errors, lint them in parallel and retrieve the number of errors and failing files:

  for my $n(1..$N)                                                              # Some projects
   {my $x = Data::Edit::Xml::Lint::new();                                       # New xml file linter

    my $catalog = $x->catalog = catalogName;                                    # Use catalog if possible
    my $project = $x->project = projectName($n);                                # Project name
    my $file    = $x->file    =    fileName($n);                                # Target file

    $x->source = <<END;                                                         # Sample source
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//HPE//DTD HPE DITA Concept//EN" "concept.dtd" []>
<concept id="$project">
 <title>Project $project</title>
 <conbody>
   <p>Body of $project</p>
 </conbody>
</concept>
END

    $x->source =~ s/id="\w+?"//gs if addError($n);                              # Introduce an error into some projects

    $x->lint(foo=>1);                                                           # Write the source to the target file, lint using xmllint, include some attributes to be included as comments at the end of the target file
   }

  Data::Edit::Xml::Lint::wait;                                                  # Wait for lints to complete

  say STDERR Data::Edit::Xml::Lint::report($outDir, "xml")->print;              # Report total pass fail rate
 }

Produces:

 50 % success converting 3 projects containing 10 xml files on 2017-07-13 at 17:43:24

 ProjectStatistics
    #  Percent   Pass  Fail  Total  Project
    1  33.3333      1     2      3  aaa
    2  50.0000      2     2      4  bbb
    3  66.6667      2     1      3  ccc

 FailingFiles
    #  Errors  Project       File
    1       1  ccc           out/ccc5.xml
    2       1  aaa           out/aaa9.xml
    3       1  bbb           out/bbb1.xml
    4       1  bbb           out/bbb7.xml
    5       1  aaa           out/aaa3.xml

Rereading

Once a file has been linted, it can be reread with read to obtain details about the xml including any id attributes defined (see: idDefs below) and any labels that refer to these id attributes (see: labelDefs below). Such labels provide additional identities for a node beyond that provided by the id attribute.

  {catalog    => "/home/phil/hp/dtd/Dtd_2016_07_12/catalog-hpe.xml",
   definition => "bbb",
   docType    => "<!DOCTYPE concept PUBLIC \"-//HPE//DTD HPE DITA Concept//EN\" \"concept.dtd\" []>",
   errors     => 1,
   file       => "out/bbb1.xml",
   foo        => 1,
   header     => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>",
   idDefs     => { bbb => 1, c1 => 1 },
   labelDefs  => {
                   bbb => "bbb",
                   c1 => "c1",
                   conbody1 => "c1",
                   conbody2 => "c1",
                   concept1 => "bbb",
                   concept2 => "bbb",
                 },
   labels     => "bbb concept1 concept2",
   project    => "bbb",
   sha256     => "b00cdebf2e1837fa15140d25315e5558ed59eb735b5fad4bade23969babf9531",
   source     => "..."
  }

ReLinting

In order to fix references between files, a list of files can be relinted which performs the following actions:

  1. reads the specified files via read

  2. constructs an id map to locate an ids from labels defined in the specified files

  3. Reparses each of the specified files to build a parse tree representing the xml in each file.

  4. Calls a user supplied sub passing it the parse tree for each specified file and the id map. The sub should traverse the parse tree fixing attributes which make references between the files using the supplied id map.

  5. Writes any modified parse trees back to the originating file thus fixing the changes

Description

The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.

Constructor

Construct a new linter

new()

Create a new xml linter - call this method statically as in Data::Edit::Xml::Lint

Attributes

Attributes describing a lint

author :lvalue

Optional author of the xml - only needed if you want to generate an SDL file map

catalog :lvalue

Optional catalog file containing the locations of the DTDs used to validate the xml

compressedErrors :lvalue

Number of compressed errors

compressedErrorText :lvalue

Text of compressed errors

ditaType :lvalue

Optional Dita topic type(concept|task|troubleshooting|reference) of the xml - only needed if you want to generate an SDL file map

docType :lvalue

The second line: the document type extracted from the source

dtds :lvalue

Optional directory containing the DTDs used to validate the xml

errors :lvalue

Number of uncompressed lint errors detected by xmllint

errorText :lvalue

Text of uncompressed lint errors detected by xmllint

file :lvalue

File that the xml will be written to and read from by lint, read or relint

fileNumber :lvalue

File number - assigned early on by the caller to help debugging transformations

guid :lvalue

Guid for outermost tag - only required if you want to generate an SD file map

header :lvalue

The first line: the xml header extracted from source

idDefs :lvalue

{id} = count - the number of times this id is defined in the xml contained in this file

labelDefs :lvalue

{label or id} = id - the id of the node containing a label defined on the xml

labels :lvalue

Optional parse tree to supply labels for the current source as the labels are present in the parse tree not in the string representing the parse tree

linted :lvalue

Date the lint was performed by lint

preferredSource :lvalue

Preferred representation of the xml source, used by relint to supply a preferred representation for the source

processes :lvalue

Maximum number of xmllint processes to run in parallel - 8 by default

project :lvalue

Optional project name to allow error counts to be aggregated by project and to allow id and labels to be scoped to the files contained in each project

reusedInProject :lvalue

List of projects in which this file is reused

sha256 :lvalue

Sha256 hash of the string containing the xml processed by lint or read

source :lvalue

The source Xml to be linted

title :lvalue

Optional title of the xml - only needed if you want to generate an SDL file map

Lint

Lint xml files in parallel

lint($@)

Store some xml in a files, apply xmllint in parallel and update the source file with the results

     Parameter    Description                                
  1  $lint        Linter                                     
  2  %attributes  Attributes to be recorded as xml comments  

lintNOP($@)

Store some xml in a files, apply xmllint in single and update the source file with the results

     Parameter    Description                                
  1  $lint        Linter                                     
  2  %attributes  Attributes to be recorded as xml comments  

Report

Methods for reporting the results of linting several files

report($$)

Analyse the results of prior lints and return a hash reporting various statistics and a printable report

     Parameter         Description                                  
  1  $outputDirectory  Directory to search                          
  2  $filter           Optional regular expression to filter files  

Attributes

compressedErrors :lvalue

Compressed errors over all files

failingFiles :lvalue

Array of [number of errors, project, files] ordered from least to most errors

failingProjects :lvalue

[Projects with xmllint errors]

filter :lvalue

File selection filter

numberOfFiles :lvalue

Number of files encountered

numberOfProjects :lvalue

Number of projects defined - each project can contain zero or more files

passingProjects :lvalue

[Projects with no xmllint errors]

passRatePercent :lvalue

Total number of passes as a percentage of all input files

A printable report of the above

timestamp :lvalue

Timestamp of report

totalCompressedErrorsFileByFile :lvalue

Total number of errros summed file by file

totalCompressedErrors :lvalue

Number of compressed errors

totalErrors :lvalue

Total number of errors

Private Methods

lintOP($$@)

Store some xml in a files, apply xmllint in parallel or single and update the source file with the results

     Parameter    Description                                
  1  $inParallel  In parallel or not                         
  2  $lint        Linter                                     
  3  %attributes  Attributes to be recorded as xml comments  

p4($$)

Format a fraction as a percentage to 4 decimal places

     Parameter  Description  
  1  $p         Pass         
  2  $f         Fail         

Index

1 author

2 catalog

3 compressedErrors

4 compressedErrorText

5 ditaType

6 docType

7 dtds

8 errors

9 errorText

10 failingFiles

11 failingProjects

12 file

13 fileNumber

14 filter

15 guid

16 header

17 idDefs

18 labelDefs

19 labels

20 lint

21 linted

22 lintNOP

23 lintOP

24 new

25 numberOfFiles

26 numberOfProjects

27 p4

28 passingProjects

29 passRatePercent

30 preferredSource

31 print

32 processes

33 project

34 report

35 reusedInProject

36 sha256

37 source

38 timestamp

39 title

40 totalCompressedErrors

41 totalCompressedErrorsFileByFile

42 totalErrors

Installation

This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:

  sudo cpan install Data::Edit::Xml::Lint

Author

philiprbrenan@gmail.com

http://www.appaapps.com

Copyright

Copyright (c) 2016-2018 Philip R Brenan.

This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.