NAME

Wubot::Reactor::HTMLStrip - strip HTML data from a field

VERSION

version 0.2_002

SYNOPSIS

  - name: strip HTML from 'title' field and store results in the field title_text
    plugin: HTMLStrip
    config:
      field: title

  - name: strip HTML from the title field in-situ
    plugin: HTMLStrip
    config:
      field: title
      target_field: title

DESCRIPTION

The HTMLStrip plugin uses the perl module HTML::Strip to remove HTML from a field. The original field content is not overwritten by default. If you do not specify a 'target_field', then the HTML-stripped content will be stored in a newly created field that hast the same name as the original field plus _text. For example, if you use the 'subject' field, the results will go into 'subject_text' by default. If you specify a 'target_field' in the config, then the HTML-stripped text will be stored in that field. If you want to replace the contents of an existing field with the HTML-stripped content, set 'field' and 'target_field' to the same field.

HTML::Strip can leave many \xA0 characters in the text which can be difficult to deal with. So HTMLStrip replaces all such characters with a single whitespace.

If the new field is utf8 (according to utf8::is_utf8), then the new field will be passed to utf8::encode().