Andrea Holstein

NAME

Lingua::DE::Sentence - Perl extension for tokenizing german texts into their sentences.

SYNOPSIS

    use Lingua::DE::Sentence;
    my $sentences = get_sentences($text);
    foreach (@$sentences) {
        print $nr++, "\t$_";
    }

    or

    use Lingua::DE::Sentence;
    my ($sentences, $positions) = get_sentences($text);
    for (my $i=0; $i < scalar(@$sentences); $i++) {
        print "\n", $nr++, "\t", 
              $positions->[$i]->[0], "-", $positions->[$i]->[1], 
              "\t", $sentences->[$i];
    }

DESCRIPTION

The Lingua::DE::Sentence module contains the function get_sentences, which splits text into its constituent sentences. The result can be either the list of sentences in the text or the list of sentences plus and a list of their absolute positions in the text It's based on a regular expression to find possible endings of sentences and many little rules to avoid exceptions like acronyms or numbers.

There is a large list of known abbrevations and a not so large list of known file extensions, which ones are used to differences acronyms and filenames from endings of sentences. They can be extented or exchanged if needed.

EXPORT

get_sentences by default.

You can further export the following methods: get_sentences, get_acronyms, set_acronyms, add_acronyms, get_file_extensions, set_file_extensions, add_file_extensions.

ALGORITHM

Basically, I use a "big" regular expression to find possible sentence endings. This regular expression find punctations (.?!) or sequences of punctations like ?? or !?, perhaps followed by quotationmarks or brackets like "'), but never by comma. An empty line is interpreted as sentence end, too. Of course, the end of text also.

Then, found possibilities of sentence endings are checked for exceptions. To do this, I take 2 substrings, the first from the last sentence endings to the momentan position, the second starts at the momentan positions and has a length of 100 chars. So I can test the environment without any slow substitution and without using $`, ... . Before I check, I cut leading spaces, or any other stuff from the beginning of the sentence and throw it away. I use some heuristics:

Empty sentences

Sentences without any word character don't make any sense.

Enumerations

Something like 7 .. 24 or 1, 2, ....

Abbreviations

One letter plus dot is in german nearly always an acronym. Life ain't easy, in an earlier version I had implemented the following rule: Every lowercase letter like a., b. or so is interpreted as such one. Uppercase letters can be regular, e.g. "Spieler A schoss den Ball zu Spieler B.". I decided me to treat I, X, V and S as acronyms (I, X, V are roman letters, S. stands for "Seite"). For the other uppercase letters, I look where I found them. Only if they are found in a short sentence (less than 25 chars), so they are acronyms. Well, that sounds strange, but it's a cool and a functional algorithm. Of course, something like "S.u.S.e" or "z.B." is always an abbreviation. But in Names there are no rules, e.g. J. Edgar Hoover or F. A. Lange. So, now every one letter plus dot is an abbrevation for me. I'll work for a solution what looks a little bit ahead. If A. is followed by words like 'der', 'die', 'das', it's often really a sentence end.

Another form of abbreviations are known acronyms, I've listed ca. 370 ones. I hope, that's enough for the most cases.

Last I look, wether the word before the dot ends with a lot of consonants. Or the word has only consonants or only vocals as letters. So I'm able, to interprete "Dtschl." in the right way.

Ordinal-Numbers

0., 1., 2., ..., 39. just look like 1st, 2nd, 3rd, ..., 39th. In more than 50 % these are just 1st, 2nd and so on. So a sentence cannot end on these numbers. Of course, to say: "Ich wurde geboren im Jahre 1843." is O.K.. Numbers ending at 00 like 100, 1000, ... are even more often used as 100th, 1000th, ... . Of course, 1900, 2000 and 2100 are year numbers, not 1900th. I respected it, too.

Rational Numbers, IP-Numbers, Phone-Numbers

Something like 127.32.2345.0 or 123.5 is fixed.

URLs

URLs often contains dots and question marks. What looks like a URL will be right interpreted. For me, a URL is something starting with http, file, ftp, ... . Or it's a sequence of lowercase words divided by some punctations. Lowercase is important, because many guys don't write a whitespace after the dot. But even they start their sentences with an uppercase word.

Punctations in brackets.

In german, it's usual to mark parts of a sentence with a "(!)", "(?)", or "(?!)", ... E.g.: "Ich muss mich auf verschiedene (!) Browser einrichten." An open bracket before a punctation signalizes that.

Filenames

In many documents are strings like "readme.html", "dokument1.doc" and so on. I have a short list of usual file extensions. If the word after the dot has only consonants (like html, ...), it's a file extension (or anything else, strange) for me too. I hope that it solves the problem.

Allthough these are many rules, they are implemented to run fast. There are no substitutions, no $`, ... .

FUNCTIONS

get_sentences( $text )

The get sentences function takes a scalar containing ascii text as an argument. In scalar context it returns a reference to an array of sentences that the text has been split into. In list context it returns a list of a reference to an array of sentences and of a reference to an array of the absolute positions of the sentences. Every positions is an array of two elements, the first is the start and the second is the ending of the sentence. Calling the function in list context needs a little bit (ca. 5%) more time, because of the extra amount for the position list. Returned sentences are trimmed of white spaces and sensless beginnings.

get_acronyms( )

This function will return the defined list for acronyms.

set_acronyms( @my_acronyms )

This function replaces the predefined acronym list with the given list. Feel free to suggest me missing acronyms.

add_acronyms( @acronyms )

This function is used for adding acronyms not supported by this code.

get_file_extensions( )

This function will return the defined list for file extensions.

set_file_extensions( @my_file_extensions )

This function replaces the predfined file extension list with the given list. Feel free to suggest me missing file extensions.

add_file_extensions( @extensions )

This function is used for adding file extensions not supported by this code.

BUGS

Sentences like 'Spieler A schoss den Ball zu Spieler B.' are misinterpreted. B. is always an acronym. Similary are sentences wich ends on small numbers.

Many abbreviations and file extensions still misses, feel free to contact me.

If a sentence starts with the incorrect quotes >>quote<<, the '>>' characters are removed. It's not really a bug, it's a feature. The module intends, that these are quotings from email like

  Andrea Holstein wrote:
  > ...
  > ...
  >
  >

You should use the right form of quoting: <<quote>>.

There are texts with such a form of quoting: ,,quote''. Well, the commata are removed, too.

This module tries to use a german locale setting. It tries to set the locale on a POSIX OS to de_DE. Neither on a non POSIX OS, neither you have installed german language locales, the module won't function.

One of the greatest bugs is surely my bad English. Sorry.

AUTHOR

Andrea Holstein <andrea_holsten@yahoo.de>

SEE ALSO

       Lingua::EN::Sentence
       Text::Sentence

COPYRIGHT

       Copyright (c) 2001 Andrea Holstein. All rights reserved.

       This library is free software.
       You can redistribute it and/or modify it under the same terms as Perl itself.



Hosting generously
sponsored by Bytemark