Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content...JZHANG/HTML-ContentExtractor-0.03 - 23 Jun 2007 01:36:57 GMT - Search in distribution
HTML::Content::ContentExtractor attempts to extract the content from HTML documents. It attempts to remove tags, scripts and boilerplate text from the documents by trying to find the region of the HTML document that has the maximum ratio of words to ...JTAVERNI/HTML-Content-Extractor-0.01 - 22 Aug 2005 03:38:43 GMT - Search in distribution
- HTML::Content::Extractor - Recieving a main text of publication from HTML page and main media content that is bound to the text
- HTML::WordTagRatio::Ratio - Default module for determining the ratio of words to tags in a range of tokens in an HTML document.
- 7 more results from HTML-Content-Extractor »