<html><head>
<title>French stemming algorithm</title></head>
<body bgcolor="WHITE">
<h1 align="center">French stemming algorithm</h1>
<table width="75%" align="center" cols="1">
<tbody><tr><td>
<br> <h2>Links to resources</h2>
<dl><dd><table cellpadding="0">
<br><br>
</td></tr></tbody></table></dd></dl>
<dl><dd><table cellpadding="0">
</td></tr></tbody></table></dd></dl>
<dl><dd><table cellpadding="0">
Romance language stemmers</a>
</td></tr></tbody></table></dd></dl>
</td></tr>
<tr><td bgcolor="lightpink">
<br><br>
Here is a sample of French vocabulary, with the stemmed forms that will
be generated with this algorithm.
<br><br>
<dl><dd><table cellpadding="0">
<tbody><tr><td> <b>word</b> </td>
<td></td><td> </td>
<td></td><td> <b>stem</b> </td>
<td></td><td> </td>
<td></td><td> <b>word</b> </td>
<td></td><td> </td>
<td></td><td> <b>stem</b> </td>
</tr>
<tr><td>
continu<br>
continua<br>
continuait<br>
continuant<br>
continuation<br>
continue<br>
continué<br>
continuel<br>
continuelle<br>
continuellement<br>
continuelles<br>
continuels<br>
continuer<br>
continuera<br>
continuerait<br>
continueront<br>
continuez<br>
continuité<br>
continuons<br>
contorsions<br>
contour<br>
contournait<br>
contournant<br>
contourne<br>
contours<br>
contractait<br>
contracté<br>
contractée<br>
contracter<br>
contractés<br>
contractions<br>
contradictoirement<br>
contradictoires<br>
contraindre<br>
contraint<br>
contrainte<br>
contraintes<br>
contraire<br>
contraires<br>
contraria<br>
</td>
<td></td><td> <tt><b> => </b></tt> </td>
<td></td><td>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continuel<br>
continuel<br>
continuel<br>
continuel<br>
continuel<br>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continuon<br>
contors<br>
contour<br>
contourn<br>
contourn<br>
contourn<br>
contour<br>
contract<br>
contract<br>
contract<br>
contract<br>
contract<br>
contract<br>
contradictoir<br>
contradictoir<br>
contraindr<br>
contraint<br>
contraint<br>
contraint<br>
contrair<br>
contrair<br>
contrari<br>
</td>
<td></td><td> </td>
<td></td><td>
main<br>
mains<br>
maintenaient<br>
maintenait<br>
maintenant<br>
maintenir<br>
maintenue<br>
maintien<br>
maintint<br>
maire<br>
maires<br>
mairie<br>
mais<br>
maïs<br>
maison<br>
maisons<br>
maistre<br>
maitre<br>
maître<br>
maîtres<br>
maîtresse<br>
maîtresses<br>
majesté<br>
majestueuse<br>
majestueusement<br>
majestueux<br>
majeur<br>
majeure<br>
major<br>
majordome<br>
majordomes<br>
majorité<br>
majorités<br>
mal<br>
malacca<br>
malade<br>
malades<br>
maladie<br>
maladies<br>
maladive<br>
</td>
<td></td><td> <tt><b> => </b></tt> </td>
<td></td><td>
main<br>
main<br>
mainten<br>
mainten<br>
mainten<br>
mainten<br>
maintenu<br>
maintien<br>
maintint<br>
mair<br>
mair<br>
mair<br>
mais<br>
maï<br>
maison<br>
maison<br>
maistr<br>
maitr<br>
maîtr<br>
maîtr<br>
maîtress<br>
maîtress<br>
majest<br>
majestu<br>
majestu<br>
majestu<br>
majeur<br>
majeur<br>
major<br>
majordom<br>
majordom<br>
major<br>
major<br>
mal<br>
malacc<br>
malad<br>
malad<br>
malad<br>
malad<br>
malad<br>
</td>
</tr>
</tbody></table></dd></dl>
</td></tr>
<tr><td>
<br><br>
<br> <h2>The stemming algorithm</h2>
Letters in French include the following accented forms,
<dl><dd>
<b><i>â à ç ë é ê è ï î ô û ù</i></b>
</dd></dl>
The following letters are vowels:
<dl><dd>
<b><i>a e i o u y â à ë é ê è ï î ô û ù</i></b>
</dd></dl>
Assume the word is in lower case. Then put into upper case <b><i>u</i></b> or <b><i>i</i></b> preceded
and followed by a vowel, and <b><i>y</i></b> preceded or followed by a vowel. <b><i>u</i></b> after <b><i>q</i></b> is
also put into upper case. For example,
<dl><dd><table cellpadding="0">
<tbody><tr><td> jouer </td><td></td><td> <tt>-></tt> </td><td></td><td> joUer
</td></tr><tr><td> ennuie </td><td></td><td> <tt>-></tt> </td><td></td><td> ennuIe
</td></tr><tr><td> yeux </td><td></td><td> <tt>-></tt> </td><td></td><td> Yeux
</td></tr><tr><td> quand </td><td></td><td> <tt>-></tt> </td><td></td><td> qUand
</td></tr></tbody></table></dd></dl>
marking.)
<br><br>
If the word begins with two vowels, <i>RV</i> is the region after the third
letter, otherwise the region after the first vowel not at the beginning of
the word, or the end of the word if these positions cannot be found.
<br><br>
For example,
<br><pre> a i m e r a d o r e r v o l e r
|...| |.....| |.....|
</pre>
<i>R</i>1 is the region after the first non-vowel following a vowel, or the end of
the word if there is no such non-vowel.
<i>R</i>2 is the region after the first non-vowel following a vowel in <i>R</i>1, or the
end of the word if there is no such non-vowel.
<br><br>
For example:
<br><pre> f a m e u s e m e n t
|......R1.......|
|...R2....|
</pre>
Note that <i>R</i>1 can contain <i>RV</i> (<i>adorer</i>), and <i>RV</i> can contain <i>R</i>1 (<i>voler</i>).
<br><br>
Below, ‘delete if in <i>R</i>2’ means that a found suffix should be removed if it
lies entirely in <i>R</i>2, but not if it overlaps <i>R</i>2 and the rest of the word.
‘delete if in <i>R</i>1 and preceded by <i>X</i>’ means that <i>X</i> itself does not have to
come in <i>R</i>1, while ‘delete if preceded by <i>X</i> in <i>R</i>1’ means that <i>X</i>, like the
suffix, must be entirely in <i>R</i>1.
<br><br>
Start with step 1
<br><br>
Step 1: Standard suffix removal
<dl><dd>
Search for the longest among the following suffixes, and perform the
action indicated.
<br><br>
</dd><dl><dt><b><i>ance iqUe isme able iste eux ances iqUes ismes ables istes</i></b>
</dt><dd>delete if in <i>R</i>2
<br><br>
</dd><dt><b><i>atrice ateur ation atrices ateurs ations</i></b>
</dt><dd>delete if in <i>R</i>2
</dd><dd>if preceded by <b><i>ic</i></b>, delete if in <i>R</i>2, else replace by <b><i>iqU</i></b>
<br><br>
</dd><dt><b><i>logie logies</i></b>
</dt><dd>replace with <b><i>log</i></b> if in <i>R</i>2
<br><br>
</dd><dt><b><i>usion ution usions utions</i></b>
</dt><dd>replace with <b><i>u</i></b> if in <i>R</i>2
<br><br>
</dd><dt><b><i>ence ences</i></b>
</dt><dd>replace with <b><i>ent</i></b> if in <i>R</i>2
<br><br>
</dd><dt><b><i>ement ements</i></b>
</dt><dd>delete if in <i>RV</i>
</dd><dd>if preceded by <b><i>iv</i></b>, delete if in <i>R</i>2 (and if further preceded by <b><i>at</i></b>,
delete if in <i>R</i>2), otherwise,
</dd><dd>if preceded by <b><i>eus</i></b>, delete if in <i>R</i>2, else replace by <b><i>eux</i></b>
if in <i>R</i>1, otherwise,
</dd><dd>if preceded by <b><i>abl</i></b> or <b><i>iqU</i></b>, delete if in <i>R</i>2, otherwise,
</dd><dd>if preceded by <b><i>ièr</i></b> or <b><i>Ièr</i></b>, delete if in <i>RV</i>
<br><br>
</dd><dt><b><i>ité ités</i></b>
</dt><dd>delete if in <i>R</i>2
</dd><dd>if preceded by <b><i>abil</i></b>, delete if in <i>R</i>2, else replace by <b><i>abl</i></b>,
otherwise,
</dd><dd>if preceded by <b><i>ic</i></b>, delete if in <i>R</i>2, else replace by <b><i>iqU</i></b>, otherwise,
</dd><dd>if preceded by <b><i>iv</i></b>, delete if in <i>R</i>2
<br><br>
</dd><dt><b><i>if ive ifs ives</i></b>
</dt><dd>delete if in <i>R</i>2
</dd><dd>if preceded by <b><i>at</i></b>, delete if in <i>R</i>2 (and if further preceded by <b><i>ic</i></b>,
delete if in <i>R</i>2, else replace by <b><i>iqU</i></b>)
<br><br>
</dd><dt><b><i>eaux</i></b>
</dt><dd>replace with <b><i>eau</i></b>
<br><br>
</dd><dt><b><i>aux</i></b>
</dt><dd>replace with <b><i>al</i></b> if in <i>R</i>1
<br><br>
</dd><dt><b><i>euse euses</i></b>
</dt><dd>delete if in <i>R</i>2, else replace by <b><i>eux</i></b> if in <i>R</i>1
<br><br>
</dd><dt><b><i>issement issements</i></b>
</dt><dd>delete if in <i>R</i>1 and preceded by a non-vowel
<br><br>
</dd><dt><b><i>amment</i></b>
</dt><dd>replace with <b><i>ant</i></b> if in <i>RV</i>
<br><br>
</dd><dt><b><i>emment</i></b>
</dt><dd>replace with <b><i>ent</i></b> if in <i>RV</i>
<br><br>
</dd><dt><b><i>ment ments</i></b>
</dt><dd>delete if preceded by a vowel in <i>RV</i>
</dd></dl></dl>
In steps 2<i>a</i> and 2<i>b</i> all tests are confined to the <i>RV</i> region.
<br><br>
Do step 2<i>a</i> if either no ending was removed by step 1, or if one of endings
<b><i>amment</i></b>, <b><i>emment</i></b>, <b><i>ment</i></b>, <b><i>ments</i></b> was found.
<br><br>
Step 2<i>a</i>: Verb suffixes beginning <b><i>i</i></b>
<dl><dd>
Search for the longest among the following suffixes and if found,
delete if preceded by a non-vowel.
<br><br>
</dd><dl><dd>
<b><i>îmes ît îtes i ie ies ir ira irai iraIent irais irait iras
irent irez iriez irions irons iront is issaIent issais issait
issant issante issantes issants isse issent isses issez issiez
issions issons it</i></b>
</dd></dl><br>(Note that the non-vowel itself must also be in <i>RV</i>.)
</dl>
Do step 2<i>b</i> if step 2<i>a</i> was done, but failed to remove a suffix.
<br><br>
Step 2<i>b</i>: Other verb suffixes
<dl><dd>
Search for the longest among the following suffixes, and perform the
action indicated.
<br><br>
</dd><dl><dt><b><i>ions</i></b>
</dt><dd>delete if in <i>R</i>2
<br><br>
</dd><dt><b><i>é ée ées és èrent er era erai eraIent erais erait eras erez
eriez erions erons eront ez iez</i></b>
</dt><dd>delete
<br><br>
</dd><dt><b><i>âmes ât âtes a ai aIent ais ait ant ante antes ants as asse
assent asses assiez assions</i></b>
</dt><dd>delete
</dd><dd>if preceded by <b><i>e</i></b>, delete
</dd></dl><br>(Note that the <b><i>e</i></b>that may be deleted in this last step must also be in
<i>RV</i>.)
</dl>
If the last step to be obeyed - either step 1, 2<i>a</i> or 2<i>b</i> - altered the word,
do step 3
<br><br>
Step 3
<dl><dd>
Replace final <b><i>Y</i></b> with <b><i>i</i></b> or final <b><i>ç</i></b> with <b><i>c</i></b>
</dd></dl>
Alternatively, if the last step to be obeyed did not alter the word, do
step 4
<br><br>
Step 4: Residual suffix
<dl><dd>
If the word ends <b><i>s</i></b>, not preceded by <b><i>a</i></b>, <b><i>i</i></b>, <b><i>o</i></b>, <b><i>u</i></b>, <b><i>è</i></b> or <b><i>s</i></b>, delete it.
<br><br>
In the rest of step 4, all tests are confined to the <i>RV</i> region.
<br><br>
Search for the longest among the following suffixes, and perform the
action indicated.
<br><br>
</dd><dl><dt><b><i>ion</i></b>
</dt><dd>delete if in <i>R</i>2 and preceded by <b><i>s</i></b> or <b><i>t</i></b>
<br><br>
</dd><dt><b><i>ier ière Ier Ière</i></b>
</dt><dd>replace with <b><i>i</i></b>
<br><br>
</dd><dt><b><i>e</i></b>
</dt><dd>delete
<br><br>
</dd><dt><b><i>ë</i></b>
</dt><dd>if preceded by <b><i>gu</i></b>, delete
</dd></dl><br>(So note that <b><i>ion</i></b>is removed only when it is in <i>R</i>2 - as well as being
in <i>RV</i>- and preceded by <b><i>s</i></b>or <b><i>t</i></b>which must be in <i>RV</i>.)
</dl>
Always do steps 5 and 6.
<br><br>
Step 5: Undouble
<dl><dd>
If the word ends <b><i>enn</i></b>, <b><i>onn</i></b>, <b><i>ett</i></b>, <b><i>ell</i></b> or <b><i>eill</i></b>, delete the last letter
</dd></dl>
Step 6: Un-accent
<dl><dd>
If the words ends <b><i>é</i></b> or <b><i>è</i></b> followed by at least one non-vowel, remove
the accent from the <b><i>e</i></b>.
</dd></dl>
And finally:
<dl><dd>
Turn any remaining <b><i>I</i></b>, <b><i>U</i></b> and <b><i>Y</i></b> letters in the word back into lower case.
</dd></dl>
</td></tr>
<tr><td bgcolor="lightblue">
<br> <h2>The same algorithm in Snowball</h2>
<br><pre><dl><dd>
routines (
prelude postlude mark_regions
RV R1 R2
standard_suffix
i_verb_suffix
verb_suffix
residual_suffix
un_double
un_accent
)
externals ( stem )
integers ( pV p1 p2 )
groupings ( v keep_with_s )
stringescapes {}
/* special characters (in ISO Latin I) */
stringdef a^ hex 'E2' // a-circumflex
stringdef a` hex 'E0' // a-grave
stringdef c, hex 'E7' // c-cedilla
stringdef e" hex 'EB' // e-diaeresis (rare)
stringdef e' hex 'E9' // e-acute
stringdef e^ hex 'EA' // e-circumflex
stringdef e` hex 'E8' // e-grave
stringdef i" hex 'EF' // i-diaeresis
stringdef i^ hex 'EE' // i-circumflex
stringdef o^ hex 'F4' // o-circumflex
stringdef u^ hex 'FB' // u-circumflex
stringdef u` hex 'F9' // u-grave
define v 'aeiouy{a^}{a`}{e"}{e'}{e^}{e`}{i"}{i^}{o^}{u^}{u`}'
define prelude as repeat goto (
( v [ ('u' ] v <- 'U') or
('i' ] v <- 'I') or
('y' ] <- 'Y')
)
or
( ['y'] v <- 'Y' )
or
( 'q' ['u'] <- 'U' )
)
define mark_regions as (
$pV = limit
$p1 = limit
$p2 = limit // defaults
do (
( v v next ) or ( next gopast v )
setmark pV
)
do (
gopast v gopast non-v setmark p1
gopast v gopast non-v setmark p2
)
)
define postlude as repeat (
[substring] among(
'I' (<- 'i')
'U' (<- 'u')
'Y' (<- 'y')
'' (next)
)
)
backwardmode (
define RV as $pV <= cursor
define R1 as $p1 <= cursor
define R2 as $p2 <= cursor
define standard_suffix as (
[substring] among(
'ance' 'iqUe' 'isme' 'able' 'iste' 'eux'
'ances' 'iqUes' 'ismes' 'ables' 'istes'
( R2 delete )
'atrice' 'ateur' 'ation'
'atrices' 'ateurs' 'ations'
( R2 delete
try ( ['ic'] (R2 delete) or <-'iqU' )
)
'logie'
'logies'
( R2 <- 'log' )
'usion' 'ution'
'usions' 'utions'
( R2 <- 'u' )
'ence'
'ences'
( R2 <- 'ent' )
'ement'
'ements'
(
RV delete
try (
[substring] among(
'iv' (R2 delete ['at'] R2 delete)
'eus' ((R2 delete) or (R1<-'eux'))
'abl' 'iqU'
(R2 delete)
'i{e`}r' 'I{e`}r' //)
(RV <-'i') //)--new 2 Sept 02
)
)
)
'it{e'}'
'it{e'}s'
(
R2 delete
try (
[substring] among(
'abil' ((R2 delete) or <-'abl')
'ic' ((R2 delete) or <-'iqU')
'iv' (R2 delete)
)
)
)
'if' 'ive'
'ifs' 'ives'
(
R2 delete
try ( ['at'] R2 delete ['ic'] (R2 delete) or <-'iqU' )
)
'eaux' (<- 'eau')
'aux' (R1 <- 'al')
'euse'
'euses'((R2 delete) or (R1<-'eux'))
'issement'
'issements'(R1 non-v delete) // verbal
// fail(...) below forces entry to verb_suffix. -ment typically
// follows the p.p., e.g 'confus{e'}ment'.
'amment' (RV fail(<- 'ant'))
'emment' (RV fail(<- 'ent'))
'ment'
'ments' (test(v RV) fail(delete))
// v is e,i,u,{e'},I or U
)
)
define i_verb_suffix as setlimit tomark pV for (
[substring] among (
'{i^}mes' '{i^}t' '{i^}tes' 'i' 'ie' 'ies' 'ir' 'ira' 'irai'
'iraIent' 'irais' 'irait' 'iras' 'irent' 'irez' 'iriez'
'irions' 'irons' 'iront' 'is' 'issaIent' 'issais' 'issait'
'issant' 'issante' 'issantes' 'issants' 'isse' 'issent' 'isses'
'issez' 'issiez' 'issions' 'issons' 'it'
(non-v delete)
)
)
define verb_suffix as setlimit tomark pV for (
[substring] among (
'ions'
(R2 delete)
'{e'}' '{e'}e' '{e'}es' '{e'}s' '{e`}rent' 'er' 'era' 'erai'
'eraIent' 'erais' 'erait' 'eras' 'erez' 'eriez' 'erions'
'erons' 'eront' 'ez' 'iez'
// 'ons' //-best omitted
(delete)
'{a^}mes' '{a^}t' '{a^}tes' 'a' 'ai' 'aIent' 'ais' 'ait' 'ant'
'ante' 'antes' 'ants' 'as' 'asse' 'assent' 'asses' 'assiez'
'assions'
(delete
try(['e'] delete)
)
)
)
define keep_with_s 'aiou{e`}s'
define residual_suffix as (
try(['s'] test non-keep_with_s delete)
setlimit tomark pV for (
[substring] among(
'ion' (R2 's' or 't' delete)
'ier' 'i{e`}re'
'Ier' 'I{e`}re' (<-'i')
'e' (delete)
'{e"}' ('gu' delete)
)
)
)
define un_double as (
test among('enn' 'onn' 'ett' 'ell' 'eill') [next] delete
)
define un_accent as (
atleast 1 non-v
[ '{e'}' or '{e`}' ] <-'e'
)
)
define stem as (
do prelude
do mark_regions
backwards (
do (
(
( standard_suffix or
i_verb_suffix or
verb_suffix
)
and
try( [ ('Y' ] <- 'i' ) or
('{c,}'] <- 'c' )
)
) or
residual_suffix
)
// try(['ent'] RV delete) // is best omitted
do un_double
do un_accent
)
do postlude
)
</dd></dl>
</pre>
</td></tr></tbody></table>
</body></html>