The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

MsOffice::Word::Surgeon - tamper wit the guts of Microsoft docx documents

SYNOPSIS

  my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename);

  # extract plain text
  my $text = $surgeon->plain_text;

  # anonymize
  my %alias = ('Claudio MONTEVERDI' => 'A_____', 'Heinrich SCHÜTZ' => 'B_____');
  my $pattern = join "|", keys %alias;
  my $replacement_callback = sub {
    my %args =  @_;
    my $replacement = $surgeon->change(to_delete  => $args{matched},
                                       to_insert  => $alias{$args{matched}},
                                       run        => $args{run},
                                       xml_before => $args{xml_before},
                                      );
    return $replacement;
  };
  $surgeon->replace(qr[$pattern], $replacement_callback);

  # save the result
  $surgeon->overwrite; # or ->save_as($new_filename);

DESCRIPTION

Purpose

This module supports a few operations for modifying or extracting text from Microsoft Word documents in '.docx' format -- therefore the name 'surgeon'. Since a surgeon does not give life, there is no support for creating fresh documents; if you have such needs, use one of the other packages listed in the "SEE ALSO" section.

Some applications for this module are :

  • content extraction in plain text format;

  • unlinking fields (equivalent of performing Ctrl-Shift-F9 on the whole document)

  • regex replacements within text, for example for :

    • anonymization, i.e. replacement of names or adresses by aliases;

    • templating, i.e. replacement of special markup by contents coming from a data tree (see also MsOffice::Word::Template).

  • pretty-printing the internal XML structure

Operating mode

The format of Microsoft .docx documents is described in http://www.ecma-international.org/publications/standards/Ecma-376.htm and http://officeopenxml.com/. An excellent introduction can be found at https://www.toptal.com/xml/an-informal-introduction-to-docx. Internally, a document is a zipped archive, where the member named word/document.xml stores the main document contents, in XML format.

The present module does not parse all details of the whole XML structure because it only focuses on text nodes (those that contain literal text) and run nodes (those that contain text formatting properties). All remaining XML information, for example for representing sections, paragraphs, tables, etc., is stored as opaque XML fragments; these fragments are re-inserted at proper places when reassembling the whole document after having modified some text nodes.

METHODS

Constructor

new

  my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename);
  # or simply : ->new($filename);

Builds a new surgeon instance, initialized with the contents of the given filename.

Contents restitution

contents

Returns a Perl string with the current internal XML representation of the document contents.

original_contents

Returns a Perl string with the XML representation of the document contents, as it was in the ZIP archive before any modification.

indented_contents

Returns an indented version of the XML contents, suitable for inspection in a text editor. This is produced by "toString" in XML::LibXML::Document and therefore is returned as an encoded byte string, not a Perl string.

plain_text

Returns the text contents of the document, without any markup. Paragraphs and breaks are converted to newlines, all other formatting instructions are ignored.

runs

Returns a list of MsOffice::Word::Surgeon::Run objects. Each of these objects holds an XML fragment; joining all fragments restores the complete document.

  my $contents = join "", map {$_->as_xml} $self->runs;

Modifying contents

cleanup_XML

  $surgeon->cleanup_XML;

Apply several other methods for removing unnecessary nodes within the internal XML. This method successively calls "reduce_all_noises", "unlink_fields", "suppress_bookmarks" and "merge_runs".

reduce_noise

  $surgeon->reduce_noise($regex1, $regex2, ...);

This method is used for removing unnecessary information in the XML markup. It applies the given list of regexes to the whole document, suppressing matches. The final result is put back into $self->contents. Regexes may be given either as qr/.../ references, or as names of builtin regexes (described below). Regexes are applied to the whole XML contents, not only to run nodes.

noise_reduction_regex

  my $regex = $surgeon->noise_reduction_regex($regex_name);

Returns the builtin regex corresponding to the given name. Known regexes are :

  proof_checking       => qr(<w:(?:proofErr[^>]+|noProof/)>),
  revision_ids         => qr(\sw:rsid\w+="[^"]+"),
  complex_script_bold  => qr(<w:bCs/>),
  page_breaks          => qr(<w:lastRenderedPageBreak/>),
  language             => qr(<w:lang w:val="[^/>]+/>),
  empty_run_props      => qr(<w:rPr></w:rPr>),
  soft_hyphens         => qr(<w:softHyphen/>),

reduce_all_noises

  $surgeon->reduce_all_noises;

Applies all regexes from the previous method.

  my @names_of_ASK_fields = $self->unlink_fields;

Removes all fields from the document, just leaving the current value stored in each field. This is the equivalent of performing Ctrl-Shift-F9 on the whole document.

The return value is a list of names of ASK fields within the document. Such names should then be passed to the "suppress_bookmarks" method (see below).

suppress_bookmarks

  $surgeon->suppress_bookmarks(@names_to_erase);

Removes bookmarks markup in the document. This is useful because MsWord may silently insert bookmarks in unexpected places; therefore some searches within the text may fail because of such bookmarks.

By default, this method only removes the bookmarks markup, leaving intact the contents of the bookmark. However, when the name of a bookmark belongs to the list @names_to_erase, the contents is also removed. Currently this is used for suppressing ASK fields, because such fields contain a bookmark content that is never displayed by MsWord.

merge_runs

  $surgeon->merge_runs(no_caps => 1); # optional arg

Walks through all runs of text within the document, trying to merge adjacent runs when possible (i.e. when both runs have the same properties, and there is no other XML node inbetween).

This operation is a prerequisite before performing replace operations, because documents edited in MsWord often have run boundaries across sentences or even in the middle of words; so regex searches can only be successful if those artificial boundaries have been removed.

If the argument no_caps => 1 is present, the merge operation will also convert runs with the w:caps property, putting all letters into uppercase and removing the property; this makes more merges possible.

replace

  $surgeon->replace($pattern, $replacement, %replacement_args);

Replaces all occurrences of $pattern regex within the text nodes by the given $replacement. This is not exactly like a search-replace operation performed within MsWord, because the search does not cross boundaries of text nodes. In order to maximize the chances of successful replacements, the "cleanup_XML" method is automatically called before starting the operation.

The argument $pattern can be either a string or a reference to a regular expression. It should not contain any capturing parentheses, because that would perturb text splitting operations.

The argument $replacement can be either a fixed string, or a reference to a callback subroutine that will be called for each match.

The %replacement_args hash can be used to pass information to the callback subroutine. That hash will be enriched with three entries :

matched

The string that has been matched by $pattern.

run

The run object in which this text resides.

xml_before

The XML fragment (possibly empty) found before the matched text .

The callback subroutine may return either plain text or structured XML. See the "SYNOPSIS" for an example of a replacement callback.

The following special keys within %replacement_args are interpreted by the replace() method itself, and therefore are not passed to the callback subroutine :

keep_xml_as_is

if true, no call is made to the "cleanup_XML" method before performing the replacements

dont_overwrite_contents

if true, the internal XML contents is not modified in place; the new XML after performing replacements is merely returned to the caller.

change

  my $xml = $surgeon->change(
    to_delete   => $text_to_delete,
    to_insert   => $text_to_insert,
    author      => $author_string,
    date        => $date_string,
    run         => $run_object,
    xml_before  => $xml_string,
  );

This method generates markup for MsWord tracked changes. Users can then manually review those changes within MsWord and accept or reject them. This is best used in collaboration with the "replace" method : the replacement callback can call $self->change(...) to generate tracked change marks in the document.

All parameters are optional, but either to_delete or to_insert (or both) must be present. The parameters are :

to_delete

The string of text to delete (usually this will be the matched argument passed to the replacement callback).

to_insert

The string of new text to insert.

author

A short string that will be displayed by MsWord as the "author" of this tracked change.

date

A date (and optional time) in ISO format that will be displayed by MsWord as the date of this tracked change. The current date and time will be used by default.

run

A reference to the MsOffice::Word::Surgeon::Run object surrounding this tracked change. The formatting properties of that run will be copied into the <w:r> nodes of the deleted and inserted text fragments.

xml_before

An optional XML fragment to be inserted before the <w:t> node of the inserted text

This method delegates to the MsOffice::Word::Surgeon::Change class for generating the XML markup.

SEE ALSO

The https://metacpan.org/pod/Document::OOXML distribution on CPAN also manipulates docx documents, but with another approach : internally it uses XML::LibXML and XPath expressions for manipulating XML nodes. The API has some intersections with the present module, but there are also some differences : Document::OOXML has more support for styling, while MsOffice::Word::Surgeon has more flexible mechanisms for replacing text fragments.

Other programming languages also have packages for dealing with docx documents; here are some references :

https://docs.microsoft.com/en-us/office/open-xml/word-processing

The C# Open XML SDK from Microsoft

http://www.ericwhite.com/blog/open-xml-powertools-developer-center/

Additional functionalities built on top of the XML SDK.

https://poi.apache.org

An open source Java library from the Apache foundation.

https://www.docx4java.org/trac/docx4j

Another open source Java library, competitor to Apache POI.

https://phpword.readthedocs.io/en/latest/

A PHP library dealing not only with Microsoft OOXML documents but also with OASIS and RTF formats.

https://pypi.org/project/python-docx/

A Python library, documented at https://python-docx.readthedocs.io/en/latest/.

As far as I can tell, most of these libraries provide objects and methods that closely reflect the complete XML structure : for example they have classes for paragraphes, styles, fonts, inline shapes, etc.

The present module is much simpler but also much more limited : it was optimised for dealing with the text contents and offers no support for presentation or paging features. However, it has the rare advantage of providing an API for regex substitutions within Word documents.

The MsOffice::Word::Template module relies on the present module, together with the Perl Template Toolkit, to implement a templating system for Word documents.

AUTHOR

Laurent Dami, <dami AT cpan DOT org<gt>

COPYRIGHT AND LICENSE

Copyright 2019, 2020 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.