Data::Edit::Xml::Reuse - Reuse Xml via the Dita conref facility.
Data::Edit::Xml::Reuse scans an entire document corpus looking for opportunities to reuse identical Xml via the Dita conref facility. Duplicated identical content is moved to a separate Xml file called the dictionary. Duplicated content in the corpus is replaced with references to the singular content in the dictionary. Larger blocks of identical content are favored over smaller blocks of content where possible.
Data::Edit::Xml::Reuse provides parameters that qualify the minimum size of a block of content and the minimum number of references to a block of content to be moved to the dictionary.
The following example checks the a corpus of Dita Xml documents held in folder inputFolder. A copy of the corpus with a conref replacing each block of identical content under the table and p tags is placed in the outputFolder as long as such content is at least 32 characters long and has a minimum of 4 references to it:
use Data::Edit::Xml::Reuse; my $x = Data::Edit::Xml::Reuse::reuse (inputFolder => q(in), outputFolder => q(out), reportsFolder => q(reports), minimumLength => 32, minimumReferences => 4, tags => {map {$_=>1} qw(table p)}, );
The actual number of times each block of content was reused can be found in report:
lists/reused_content_by_tag.txt
in the reportsFolder.
Optionally, Data::Edit::Xml::Reuse will also report similar content using the:
matchSimilarTagContent => 0.9,
keyword. Content under the specified tags that matches to the specified level of confidence between 0 and 1 is assigned a guid id attribute and written to report:
similar/tag_blocks_by_vocabulary.txt
The tags containing similar content will have this guid listed on their xtrf attribute making it easy to locate related content using grep.
The report, combined with the id and xtrf attributes, helps identify similar text, in situ, perhaps to be standardized further and eventually reused.
Reuse Xml via the Dita conref facility.
Version 20191221.
The following sections describe the methods in each functional area of this module. For an alphabetic listing of all methods by name see Index.
Reuse Xml via Dita conrefs.
Check Xml for reuse opportunities.
Parameter Description 1 %attributes Reuse attributes
Example:
if (1) { owf($in[0], <<END); # Base file <concept id="c"> <title>Ordering information</title> <conbody> <p>For further information, please visit our web site or contact your local sales company.</p> <p>Made in Sweden</p> <table> <tbody> <row> <entry><p>Ice Associates</p></entry> <entry><p>North Pole 1</p></entry> </row> </tbody> </table> <p>aaa bbb ccc ddd eee</p> </conbody> </concept> END owf($in[1], <<END); # Similar file <concept id="c"> <title>Ordering information</title> <conbody> <p>For further information, please visit our web site or contact your local sales company.</p> <p>Copyright © 2018 - 2019. All rights reserved.</p> <p>Made in Norway</p> <table> <tbody> <row> <entry><p>Ice Associates</p></entry> <entry><p>North Pole 1</p></entry> </row> </tbody> </table> <p>aaa bbb ccc ddd fff</p> </conbody> </concept> END my $dictionary = fpf($outputFolder, qw(dictionary xml)); # Dictionary file name my $r = Data::Edit::Xml::Reuse::𝗿𝗲𝘂𝘀𝗲 (dictionary => $dictionary, # Reuse request inputFolder => $inputFolder, matchSimilarTagContent => 0.5, outputFolder => $outputFolder, reportsFolder => $reportsFolder, tags => {map {$_=>1} qw(p table)}, ); ok readFile($dictionary) eq <<END; # Resulting dictionary <concept id="dictionary"> <title>Dictionary</title> <conbody> <p id="GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51">For further information, please visit our web site or contact your local sales company.</p> <table id="GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4"> <tbody> <row> <entry> <p>Ice Associates</p> </entry> <entry> <p>North Pole 1</p> </entry> </row> </tbody> </table> </conbody> </concept> END if (my $h = # Similar XML report readFile(fpe($reportsFolder, qw(similar tag_blocks_by_vocabulary txt)))) {ok index($h, <<END) > 0; Similar Tag_Content Md5Sum 1 2 aaa bbb ccc ddd eee GUID-3c8810e0-d8aa-0484-84b8-a57230b756de 2 aaa bbb ccc ddd fff GUID-7472a890-4587-8393-9c34-0aa3859d2e21 END } ok readFile(fpe($testFolder, qw(out 1 xml))) eq <<END; # Deduplicated XML file - Sweden <concept id="c"> <title>Ordering information</title> <conbody> <p conref="dictionary/xml#dictionary/GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51"/> <!-- For further information, please visit our web site or contact your local sales company. --> <p>Made in Sweden</p> <table conref="dictionary/xml#dictionary/GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4"/> <!-- <tbody><row><entry><p>Ice Associates</p></entry><entry><p>North Pole 1</p></entry></row></tbody> --> <p id="GUID-3c8810e0-d8aa-0484-84b8-a57230b756de" xtrf="GUID-7472a890-4587-8393-9c34-0aa3859d2e21">aaa bbb ccc ddd eee</p> </conbody> </concept> END ok readFile(fpe($testFolder, qw(out 2 xml))) eq <<END; # Deduplicated XML file - Norway <concept id="c"> <title>Ordering information</title> <conbody> <p conref="dictionary/xml#dictionary/GUID-63271233-3e86-9ac1-5fe8-086bc8b37b51"/> <!-- For further information, please visit our web site or contact your local sales company. --> <p>Copyright © 2018 - 2019. All rights reserved.</p> <p>Made in Norway</p> <table conref="dictionary/xml#dictionary/GUID-e7a012a5-8ff5-6f12-5b96-0137c5c0a0b4"/> <!-- <tbody><row><entry><p>Ice Associates</p></entry><entry><p>North Pole 1</p></entry></row></tbody> --> <p id="GUID-7472a890-4587-8393-9c34-0aa3859d2e21" xtrf="GUID-3c8810e0-d8aa-0484-84b8-a57230b756de">aaa bbb ccc ddd fff</p> </conbody> </concept> END }
Attributes used by the reuser.
dictionary - The dictionary file into which to store the duplicate Xml.
fileExtensions - The extensions of the Xml files to examine in the inputFolder.
getFileUrl - An optional url to retrieve a specified file from the server running xref used in generating html reports. The complete url is obtained by appending the fully qualified file name to this value.
htmlFolder - Folder into which to write reports as html.
inputFolder - A folder containing the Xml files with extensions named in fileExtensions to be analyzed for reuse.
matchSimilarTagContent - Confidence level between 0 and 1: match content under tags with this level of confidence.
maximumColumnWidth - Truncate columns in text reports to this length or allow any width if undef.
minimumLength - The minimum length content must have to be considered for matching.
minimumReferences - The minimum number of references content must have before it can be reused.
outputFolder - A folder into which to write the deduplicated Xml.
reportsFolder - A folder into which reports will be written.
tags - {tag=>1} only consider tags that appear as keys in this hash with truthful values.
inputFiles - The files selected from inputFolder for analysis because their extensions matched fileExtensions.
matchBlocks - [[md5, content]*] blocks of content that match with the confidence level expressed by matchSimilarContent
matchInBlock - {md5 => matchBlocks} : index into matchBlocks by md5 sum.
reusableContent - {tag}{md5sum}{content}++ potentially reusable content.
timeEnded - Time the run ended.
timeStart - Time the run started.
Create a new cross reuser.
Parameter Description 1 %attributes Attributes
Format reports.
Parameter Description 1 $reuse Reuser 2 $data Table to be formatted 3 %options Options
Load the names of the files to be processed.
Parameter Description 1 $reuse Cross referencer
First few characters of a string with white space normalized.
Parameter Description 1 $reuse Reuser 2 $string String
Tabulate reuse parameters.
Parameter Description 1 $reuse Reuser
Analyze one input file.
Parameter Description 1 $reuse Reuser 2 $file File to analyze
Analyze the input files.
Conref one file.
Replace common text with conrefs.
Report content likely to be similar on the basis of their vocabulary.
1 analyzeInputFiles - Analyze the input files.
2 analyzeOneFile - Analyze one input file.
3 conRef - Replace common text with conrefs.
4 conRefOneFile - Conref one file.
5 ffc - First few characters of a string with white space normalized.
6 formatTables - Format reports.
7 loadInputFiles - Load the names of the files to be processed.
8 newReuse - Create a new cross reuser.
9 reportSimilarContent - Report content likely to be similar on the basis of their vocabulary.
10 reuse - Check Xml for reuse opportunities.
11 reuseParams - Tabulate reuse parameters.
This module is written in 100% Pure Perl and, thus, it is easy to read, comprehend, use, modify and install via cpan:
sudo cpan install Data::Edit::Xml::Reuse
philiprbrenan@gmail.com
http://www.appaapps.com
Copyright (c) 2016-2019 Philip R Brenan.
This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.
To install Data::Edit::Xml::Reuse, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Data::Edit::Xml::Reuse
CPAN shell
perl -MCPAN -e shell install Data::Edit::Xml::Reuse
For more information on module installation, please visit the detailed CPAN module installation guide.