Wubot::Reactor::HTMLStrip - strip HTML data from a field
- name: strip HTML from 'title' field and store results in the field title_text plugin: HTMLStrip config: field: title - name: strip HTML from the title field in-situ plugin: HTMLStrip config: field: title target_field: title
The HTMLStrip plugin uses the perl module HTML::Strip to remove HTML from a field. The original field content is not overwritten by default. If you do not specify a 'target_field', then the HTML-stripped content will be stored in a newly created field that hast the same name as the original field plus _text. For example, if you use the 'subject' field, the results will go into 'subject_text' by default. If you specify a 'target_field' in the config, then the HTML-stripped text will be stored in that field. If you want to replace the contents of an existing field with the HTML-stripped content, set 'field' and 'target_field' to the same field.
HTML::Strip can leave many \xA0 characters in the text which can be difficult to deal with. So HTMLStrip replaces all such characters with a single whitespace.
If the new field is utf8 (according to utf8::is_utf8), then the new field will be passed to utf8::encode().