The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

MsOffice::Word::Surgeon - tamper with the guts of Microsoft docx documents, with regexes

SYNOPSIS

  my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename);

  # extract plain text
  my $main_text    = $surgeon->document->plain_text;
  my @header_texts = map {$surgeon->part($_)->plain_text} $surgeon->headers;

  # anonymize
  my %alias = ('Claudio MONTEVERDI' => 'A_____', 'Heinrich SCHÜTZ' => 'B_____');
  my $pattern = join "|", keys %alias;
  my $replacement_callback = sub {
    my %args =  @_;
    my $replacement = $surgeon->new_revision(to_delete  => $args{matched},
                                             to_insert  => $alias{$args{matched}},
                                             run        => $args{run},
                                             xml_before => $args{xml_before},
                                            );
    return $replacement;
  };
  $surgeon->document->replace(qr[$pattern], $replacement_callback);

  # save the result
  $surgeon->overwrite; # or ->save_as($new_filename);

VERSION

WARNING: this is version 2.0. Due to internal refactorings, some changes made to the application programming interface (API) are incompatible with version 1. Client programs may need some minor adaptations.

DESCRIPTION

Purpose

This module supports a few operations for modifying or extracting text from Microsoft Word documents in '.docx' format -- therefore the name 'surgeon'. Since a surgeon does not give life, there is no support for creating fresh documents; if you have such needs, use one of the other packages listed in the "SEE ALSO" section. To my knowledge, this is the only solution (even in other languages) for applying regular expressions to the contents of Word documents.

Some applications for this module are :

  • content extraction in plain text format;

  • unlinking fields (equivalent of performing Ctrl-Shift-F9 on the whole document)

  • regex replacements within text, for example for :

    • anonymization, i.e. replacement of names or adresses by aliases;

    • templating, i.e. replacement of special markup by contents coming from a data tree (see also MsOffice::Word::Template).

  • pretty-printing the internal XML structure

Operating mode

The format of Microsoft .docx documents is described in http://www.ecma-international.org/publications/standards/Ecma-376.htm and http://officeopenxml.com/. An excellent introduction can be found at https://www.toptal.com/xml/an-informal-introduction-to-docx. Internally, a document is a zipped archive, where the member named word/document.xml stores the main document contents, in XML format.

The present module does not parse all details of the whole XML structure because it only focuses on text nodes (those that contain literal text) and run nodes (those that contain text formatting properties). All remaining XML information, for example for representing sections, paragraphs, tables, etc., is stored as opaque XML fragments; these fragments are re-inserted at proper places when reassembling the whole document after having modified some text nodes.

METHODS

Constructor

new

  my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename);
  # or simply : ->new($filename);

Builds a new surgeon instance, initialized with the contents of the given filename.

Accessors

docx

Path to the .docx file

zip

Instance of Archive::Zip associated with this file

parts

Hashref to MsOffice::Word::Surgeon::PackagePart objects, keyed by their part name in the ZIP file. There is always a 'document' part. Currently, other optional parts may be headers and footers. Future versions may include other parts like footnotes or endnotes.

document

Shortcut to $surgeon->part('document') -- the MsOffice::Word::Surgeon::PackagePart object corresponding to the main document. See the PackagePart documentation for operations on part objects. Besides, the following operations are supported directly as methods to the $surgeon object and are automatically delegated to the document part : contents, original_contents, indented_contents, plain_text, replace.

headers

  my @header_parts = $surgeon->headers;

Returns the ordered list of names of header members stored in the ZIP file.

footers

  my @footer_parts = $surgeon->footers;

Returns the ordered list of names of footer members stored in the ZIP file.

Other methods

part

  my $part = $surgeon->part($part_name);

Returns the MsOffice::Word::Surgeon::PackagePart object corresponding to the given part name.

all_parts_do

  my $result = $surgeon->all_parts_do($method_name => %args);

Calls the given method on all part objects. Results are accumulated in a hash, with part names as keys to the results.

xml_member

  my $xml = $surgeon->xml_member($member_name);
  # or
  $surgeon->xml_member($member_name, $new_xml);

Reads or writes the given member name in the ZIP file, with utf8 decoding or encoding.

save_as

  $surgeon->save_as($docx_file);

Writes the ZIP archive into the given file.

overwrite

  $surgeon->overwrite;

Writes the updated ZIP archive into the initial file.

new_revision

  my $xml = $surgeon->new_revision(
    to_delete   => $text_to_delete,
    to_insert   => $text_to_insert,
    author      => $author_string,
    date        => $date_string,
    run         => $run_object,
    xml_before  => $xml_string,
  );

This method is syntactic sugar for instantiating the MsOffice::Word::Surgeon::Revision class and returning XML markup for MsWord revisions (a.k.a. "tracked changes") generated by that class. Users can then manually review those revisions within MsWord and accept or reject them. This is best used in collaboration with the "replace" method : the replacement callback can call $self->new_revision(...) to generate revision marks in the document.

Either to_delete or to_insert (or both) must be present. Other parameters are optional. The parameters are :

to_delete

The string of text to delete (usually this will be the matched argument passed to the replacement callback).

to_insert

The string of new text to insert.

author

A short string that will be displayed by MsWord as the "author" of this revision.

date

A date (and optional time) in ISO format that will be displayed by MsWord as the date of this revision. The current date and time will be used by default.

run

A reference to the MsOffice::Word::Surgeon::Run object surrounding this revision. The formatting properties of that run will be copied into the <w:r> nodes of the deleted and inserted text fragments.

xml_before

An optional XML fragment to be inserted before the <w:t> node of the inserted text

SEE ALSO

The https://metacpan.org/pod/Document::OOXML distribution on CPAN also manipulates docx documents, but with another approach : internally it uses XML::LibXML and XPath expressions for manipulating XML nodes. The API has some intersections with the present module, but there are also some differences : Document::OOXML has more support for styling, while MsOffice::Word::Surgeon has more flexible mechanisms for replacing text fragments.

Other programming languages also have packages for dealing with docx documents; here are some references :

https://docs.microsoft.com/en-us/office/open-xml/word-processing

The C# Open XML SDK from Microsoft

http://www.ericwhite.com/blog/open-xml-powertools-developer-center/

Additional functionalities built on top of the XML SDK.

https://poi.apache.org

An open source Java library from the Apache foundation.

https://www.docx4java.org/trac/docx4j

Another open source Java library, competitor to Apache POI.

https://phpword.readthedocs.io/en/latest/

A PHP library dealing not only with Microsoft OOXML documents but also with OASIS and RTF formats.

https://pypi.org/project/python-docx/

A Python library, documented at https://python-docx.readthedocs.io/en/latest/.

As far as I can tell, most of these libraries provide objects and methods that closely reflect the complete XML structure : for example they have classes for paragraphs, styles, fonts, inline shapes, etc.

The present module is much simpler but also much more limited : it was optimised for dealing with the text contents and offers no support for presentation or paging features. However, it has the rare advantage of providing an API for regex substitutions within Word documents.

The MsOffice::Word::Template module relies on the present module, together with the Perl Template Toolkit, to implement a templating system for Word documents.

AUTHOR

Laurent Dami, <dami AT cpan DOT org<gt>

COPYRIGHT AND LICENSE

Copyright 2019-2022 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.