The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Data::Edit::Xml::Reuse - Reuse Xml via the Dita conref facility.

Synopsis

Reusing Identical Content

Data::Edit::Xml::Reuse scans an entire document corpus looking for opportunities to reuse identical Xml via the Dita conref facility. Duplicated identical content is moved to a separate Xml file called the dictionary. Duplicated content in the corpus is replaced with references to the singular content in the dictionary. Larger blocks of identical content are favored over smaller blocks of content where possible.

Data::Edit::Xml::Reuse provides parameters that qualify the minimum size of a block of content and the minimum number of references to a block of content to be moved to the dictionary.

The following example checks the a corpus of Dita Xml documents held in folder inputFolder. A copy of the corpus with a conref replacing each block of identical content under the table and p tags is placed in the outputFolder as long as such content is at least 32 characters long and has a minimum of 4 references to it:

  use Data::Edit::Xml::Reuse;

  my $x = Data::Edit::Xml::Reuse::reuse
   (inputFolder       => q(in),
    outputFolder      => q(out),
    reportsFolder     => q(reports),
    minimumLength     => 32,
    minimumReferences => 4,
    tags              => {map {$_=>1} qw(table p)},
   );

The actual number of times each block of content was reused can be found in report:

 lists/reused_content_by_tag.txt

in the reportsFolder.

Matching Similar Content

Optionally, Data::Edit::Xml::Reuse will also report similar content using the:

  matchSimilarTagContent => 0.9,

keyword. Content under the specified tags that matches to the specified level of confidence between 0 and 1 is assigned a guid id attribute and written to report:

  similar/tag_blocks_by_vocabulary.txt

in the reportsFolder.

The tags containing similar content will have this guid listed on their xtrf attribute making it easy to locate related content using grep.

The report, combined with the id and xtrf attributes, helps identify similar text, in situ, perhaps to be standardized further and eventually reused.

Description

Reuse Xml via the Dita conref facility.

Version 20191221.

The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.

Reuse Xml

Reuse Xml via Dita conrefs.

reuse(%)

Check Xml for reuse opportunities.

     Parameter    Description
  1  %attributes  Reuse attributes

Example:

  if (1) {
    owf($in[0], <<END);                                                           # Base file
    <concept id="c">
      <title>Ordering information</title>
      <conbody>
        <p>For further information, please visit our web site or contact your local sales company.</p>
        <p>Made in Sweden</p>
        <table>
          <tbody>
            <row>
              <entry><p>Ice Associates</p></entry>
              <entry><p>North Pole 1</p></entry>
            </row>
          </tbody>
        </table>
        <p>aaa bbb ccc ddd eee</p>
      </conbody>
    </concept>
  END

    owf($in[1], <<END);                                                           # Similar file
    <concept id="c">
      <title>Ordering information</title>
      <conbody>
        <p>For further information, please visit our web site or contact your local sales company.</p>
        <p>Copyright © 2018 - 2019. All rights reserved.</p>
        <p>Made in Norway</p>
        <table>
          <tbody>
            <row>
              <entry><p>Ice Associates</p></entry>
              <entry><p>North Pole 1</p></entry>
            </row>
          </tbody>
        </table>
        <p>aaa bbb ccc ddd fff</p>
      </conbody>
    </concept>
  END

    my $dictionary  = fpf($outputFolder, qw(dictionary xml));                     # Dictionary file name

    my $r = Data::Edit::Xml::Reuse::𝗿𝗲𝘂𝘀𝗲
     (dictionary             => $dictionary,                                      # Reuse request
      inputFolder            => $inputFolder,
      matchSimilarTagContent => 0.5,
      outputFolder           => $outputFolder,
      reportsFolder          => $reportsFolder,
      tags                   => {map {$_=>1} qw(p table)},
     );

    ok readFile($dictionary) eq <<END;                                            # Resulting dictionary
  <concept id="dictionary">
    <title>Dictionary</title>
    <conbody>
      <p id="GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51">For further information, please visit our web site or contact your local sales company.</p>
      <table id="GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4">
        <tbody>
          <row>
            <entry>
              <p>Ice Associates</p>
            </entry>
            <entry>
              <p>North Pole 1</p>
            </entry>
          </row>
        </tbody>
      </table>
    </conbody>
  </concept>
  END

    if (my $h =                                                                   # Similar XML report
      readFile(fpe($reportsFolder, qw(similar tag_blocks_by_vocabulary txt))))
     {ok index($h, <<END) > 0;
     Similar  Tag_Content          Md5Sum
  1        2  aaa bbb ccc ddd eee  GUID-3c8810e0-d8aa-0484-84b8-a57230b756de
  2           aaa bbb ccc ddd fff  GUID-7472a890-4587-8393-9c34-0aa3859d2e21
  END
     }

    ok readFile(fpe($testFolder, qw(out 1 xml))) eq <<END;                        # Deduplicated XML file - Sweden
  <concept id="c">
    <title>Ordering information</title>
    <conbody>
      <p conref="dictionary/xml#dictionary/GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51"/>
  <!-- For further information, please visit our web site or contact your local sales company. -->
      <p>Made in Sweden</p>
      <table conref="dictionary/xml#dictionary/GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4"/>
  <!-- <tbody><row><entry><p>Ice Associates</p></entry><entry><p>North Pole 1</p></entry></row></tbody> -->
      <p id="GUID-3c8810e0-d8aa-0484-84b8-a57230b756de" xtrf="GUID-7472a890-4587-8393-9c34-0aa3859d2e21">aaa bbb ccc ddd eee</p>
    </conbody>
  </concept>
  END

    ok readFile(fpe($testFolder, qw(out 2 xml))) eq <<END;                        # Deduplicated XML file - Norway
  <concept id="c">
    <title>Ordering information</title>
    <conbody>
      <p conref="dictionary/xml#dictionary/GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51"/>
  <!-- For further information, please visit our web site or contact your local sales company. -->
      <p>Copyright © 2018 - 2019. All rights reserved.</p>
      <p>Made in Norway</p>
      <table conref="dictionary/xml#dictionary/GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4"/>
  <!-- <tbody><row><entry><p>Ice Associates</p></entry><entry><p>North Pole 1</p></entry></row></tbody> -->
      <p id="GUID-7472a890-4587-8393-9c34-0aa3859d2e21" xtrf="GUID-3c8810e0-d8aa-0484-84b8-a57230b756de">aaa bbb ccc ddd fff</p>
    </conbody>
  </concept>
  END

   }

Data::Edit::Xml::Reuse Definition

Attributes used by the reuser.

Input fields

dictionary - The dictionary file into which to store the duplicate Xml.

fileExtensions - The extensions of the Xml files to examine in the inputFolder.

getFileUrl - An optional url to retrieve a specified file from the server running xref used in generating html reports. The complete url is obtained by appending the fully qualified file name to this value.

htmlFolder - Folder into which to write reports as html.

inputFolder - A folder containing the Xml files with extensions named in fileExtensions to be analyzed for reuse.

matchSimilarTagContent - Confidence level between 0 and 1: match content under tags with this level of confidence.

maximumColumnWidth - Truncate columns in text reports to this length or allow any width if undef.

minimumLength - The minimum length content must have to be considered for matching.

minimumReferences - The minimum number of references content must have before it can be reused.

outputFolder - A folder into which to write the deduplicated Xml.

reportsFolder - A folder into which reports will be written.

tags - {tag=>1} only consider tags that appear as keys in this hash with truthful values.

Output fields

inputFiles - The files selected from inputFolder for analysis because their extensions matched fileExtensions.

matchBlocks - [[md5, content]*] blocks of content that match with the confidence level expressed by matchSimilarContent

matchInBlock - {md5 => matchBlocks} : index into matchBlocks by md5 sum.

reusableContent - {tag}{md5sum}{content}++ potentially reusable content.

timeEnded - Time the run ended.

timeStart - Time the run started.

Private Methods

newReuse(%)

Create a new cross reuser.

     Parameter    Description
  1  %attributes  Attributes

formatTables($$%)

Format reports.

     Parameter  Description
  1  $reuse     Reuser
  2  $data      Table to be formatted
  3  %options   Options

loadInputFiles($)

Load the names of the files to be processed.

     Parameter  Description
  1  $reuse     Cross referencer

ffc($$)

First few characters of a string with white space normalized.

     Parameter  Description
  1  $reuse     Reuser
  2  $string    String

reuseParams($)

Tabulate reuse parameters.

     Parameter  Description
  1  $reuse     Reuser

analyzeOneFile($$)

Analyze one input file.

     Parameter  Description
  1  $reuse     Reuser
  2  $file      File to analyze

analyzeInputFiles($)

Analyze the input files.

     Parameter  Description
  1  $reuse     Reuser

conRefOneFile($$)

Conref one file.

     Parameter  Description
  1  $reuse     Reuser
  2  $file      File to analyze

conRef($)

Replace common text with conrefs.

     Parameter  Description
  1  $reuse     Cross referencer

reportSimilarContent($)

Report content likely to be similar on the basis of their vocabulary.

     Parameter  Description
  1  $reuse     Reuser

Index

1 analyzeInputFiles - Analyze the input files.

2 analyzeOneFile - Analyze one input file.

3 conRef - Replace common text with conrefs.

4 conRefOneFile - Conref one file.

5 ffc - First few characters of a string with white space normalized.

6 formatTables - Format reports.

7 loadInputFiles - Load the names of the files to be processed.

8 newReuse - Create a new cross reuser.

9 reportSimilarContent - Report content likely to be similar on the basis of their vocabulary.

10 reuse - Check Xml for reuse opportunities.

11 reuseParams - Tabulate reuse parameters.

Installation

This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:

  sudo cpan install Data::Edit::Xml::Reuse

Author

philiprbrenan@gmail.com

http://www.appaapps.com

Copyright

Copyright (c) 2016-2019 Philip R Brenan.

This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.