<html><head>
<title>French stemming algorithm</title></head>
<body bgcolor="WHITE">
<h1 align="center">French stemming algorithm</h1>
<table width="75%" align="center" cols="1">
<tbody><tr><td>
<br> <h2>Links to resources</h2>
<dl><dd><table cellpadding="0">
<tbody><tr><td><a href="http://snowball.tartarus.org/"> Snowball main page</a>
</td></tr><tr><td><a href="http://snowball.tartarus.org/french/stem.sbl"> The stemmer in Snowball</a>
</td></tr><tr><td><a href="http://snowball.tartarus.org/french/stem.c"> The ANSI C stemmer</a>
</td></tr><tr><td><a href="http://snowball.tartarus.org/french/stem.h"> - and its header</a>
</td></tr><tr><td><a href="http://snowball.tartarus.org/french/voc.txt"> Sample French vocabulary</a>
</td></tr><tr><td><a href="http://snowball.tartarus.org/french/output.txt"> Its stemmed equivalent</a>
</td></tr><tr><td><a href="http://snowball.tartarus.org/french/diffs.txt"> Vocabulary + stemmed equivalent in two columns</a>
</td></tr><tr><td><a href="http://snowball.tartarus.org/french/tarball.tgz"> Tar-gzipped file of all of the above</a>
<br><br>
</td></tr><tr><td><a href="http://snowball.tartarus.org/french/stop.txt"> French stop word list</a>
</td></tr></tbody></table></dd></dl>
<dl><dd><table cellpadding="0">
<tbody><tr><td><a href="http://snowball.tartarus.org/french/stem-MS-DOS-Latin-I.sbl"> The stemmer in Snowball - MS DOS Latin I encodings</a>
</td></tr></tbody></table></dd></dl>
<dl><dd><table cellpadding="0">
Romance language stemmers</a>
</td></tr></tbody></table></dd></dl>
</td></tr>
<tr><td bgcolor="lightpink">
<br><br>
Here is a sample of French vocabulary, with the stemmed forms that will
be generated with this algorithm.
<br><br>
<dl><dd><table cellpadding="0">
<tbody><tr><td> <b>word</b> </td>
<td></td><td> </td>
<td></td><td> <b>stem</b> </td>
<td></td><td>        </td>
<td></td><td> <b>word</b> </td>
<td></td><td> </td>
<td></td><td> <b>stem</b> </td>
</tr>
<tr><td>
continu<br>
continua<br>
continuait<br>
continuant<br>
continuation<br>
continue<br>
continué<br>
continuel<br>
continuelle<br>
continuellement<br>
continuelles<br>
continuels<br>
continuer<br>
continuera<br>
continuerait<br>
continueront<br>
continuez<br>
continuité<br>
continuons<br>
contorsions<br>
contour<br>
contournait<br>
contournant<br>
contourne<br>
contours<br>
contractait<br>
contracté<br>
contractée<br>
contracter<br>
contractés<br>
contractions<br>
contradictoirement<br>
contradictoires<br>
contraindre<br>
contraint<br>
contrainte<br>
contraintes<br>
contraire<br>
contraires<br>
contraria<br>
</td>
<td></td><td>  <tt><b> =&gt; </b></tt>  </td>
<td></td><td>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continuel<br>
continuel<br>
continuel<br>
continuel<br>
continuel<br>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continu<br>
continuon<br>
contors<br>
contour<br>
contourn<br>
contourn<br>
contourn<br>
contour<br>
contract<br>
contract<br>
contract<br>
contract<br>
contract<br>
contract<br>
contradictoir<br>
contradictoir<br>
contraindr<br>
contraint<br>
contraint<br>
contraint<br>
contrair<br>
contrair<br>
contrari<br>
</td>
<td></td><td> </td>
<td></td><td>
main<br>
mains<br>
maintenaient<br>
maintenait<br>
maintenant<br>
maintenir<br>
maintenue<br>
maintien<br>
maintint<br>
maire<br>
maires<br>
mairie<br>
mais<br>
maïs<br>
maison<br>
maisons<br>
maistre<br>
maitre<br>
maître<br>
maîtres<br>
maîtresse<br>
maîtresses<br>
majesté<br>
majestueuse<br>
majestueusement<br>
majestueux<br>
majeur<br>
majeure<br>
major<br>
majordome<br>
majordomes<br>
majorité<br>
majorités<br>
mal<br>
malacca<br>
malade<br>
malades<br>
maladie<br>
maladies<br>
maladive<br>
</td>
<td></td><td>  <tt><b> =&gt; </b></tt>  </td>
<td></td><td>
main<br>
main<br>
mainten<br>
mainten<br>
mainten<br>
mainten<br>
maintenu<br>
maintien<br>
maintint<br>
mair<br>
mair<br>
mair<br>
mais<br>
maï<br>
maison<br>
maison<br>
maistr<br>
maitr<br>
maîtr<br>
maîtr<br>
maîtress<br>
maîtress<br>
majest<br>
majestu<br>
majestu<br>
majestu<br>
majeur<br>
majeur<br>
major<br>
majordom<br>
majordom<br>
major<br>
major<br>
mal<br>
malacc<br>
malad<br>
malad<br>
malad<br>
malad<br>
malad<br>
</td>
</tr>
</tbody></table></dd></dl>
</td></tr>
<tr><td>
<br><br>
<br> <h2>The stemming algorithm</h2>
Letters in French include the following accented forms,
<dl><dd>
<b><i>â   à   ç   ë   é   ê   è   ï   î   ô   û   ù</i></b>
</dd></dl>
The following letters are vowels:
<dl><dd>
<b><i>a   e   i   o   u   y   â   à   ë   é   ê   è   ï   î   ô   û   ù</i></b>
</dd></dl>
Assume the word is in lower case. Then put into upper case <b><i>u</i></b> or <b><i>i</i></b> preceded
and followed by a vowel, and <b><i>y</i></b> preceded or followed by a vowel. <b><i>u</i></b> after <b><i>q</i></b> is
also put into upper case. For example,
<dl><dd><table cellpadding="0">
<tbody><tr><td> jouer </td><td></td><td> <tt>-&gt;</tt> </td><td></td><td> joUer
</td></tr><tr><td> ennuie </td><td></td><td> <tt>-&gt;</tt> </td><td></td><td> ennuIe
</td></tr><tr><td> yeux </td><td></td><td> <tt>-&gt;</tt> </td><td></td><td> Yeux
</td></tr><tr><td> quand </td><td></td><td> <tt>-&gt;</tt> </td><td></td><td> qUand
</td></tr></tbody></table></dd></dl>
(The upper case forms are not then classed as vowels - see <a href="http://snowball.tartarus.org/texts/vowelmarking.html"> note</a> on vowel
marking.)
<br><br>
If the word begins with two vowels, <i>RV</i> is the region after the third
letter, otherwise the region after the first vowel not at the beginning of
the word, or the end of the word if these positions cannot be found.
<br><br>
For example,
<br><pre> a i m e r a d o r e r v o l e r
|...| |.....| |.....|
</pre>
<i>R</i>1 is the region after the first non-vowel following a vowel, or the end of
the word if there is no such non-vowel.
<i>R</i>2 is the region after the first non-vowel following a vowel in <i>R</i>1, or the
end of the word if there is no such non-vowel.
(See <a href="http://snowball.tartarus.org/texts/r1r2.html"> note</a> on <i>R</i>1 and <i>R</i>2.)
<br><br>
For example:
<br><pre> f a m e u s e m e n t
|......R1.......|
|...R2....|
</pre>
Note that <i>R</i>1 can contain <i>RV</i> (<i>adorer</i>), and <i>RV</i> can contain <i>R</i>1 (<i>voler</i>).
<br><br>
Below, &#8216;delete if in <i>R</i>2&#8217; means that a found suffix should be removed if it
lies entirely in <i>R</i>2, but not if it overlaps <i>R</i>2 and the rest of the word.
&#8216;delete if in <i>R</i>1 and preceded by <i>X</i>&#8217; means that <i>X</i> itself does not have to
come in <i>R</i>1, while &#8216;delete if preceded by <i>X</i> in <i>R</i>1&#8217; means that <i>X</i>, like the
suffix, must be entirely in <i>R</i>1.
<br><br>
Start with step 1
<br><br>
Step 1: Standard suffix removal
<dl><dd>
Search for the longest among the following suffixes, and perform the
action indicated.
<br><br>
</dd><dl><dt><b><i>ance   iqUe   isme   able   iste   eux   ances   iqUes   ismes   ables   istes</i></b>
</dt><dd>delete if in <i>R</i>2
<br><br>
</dd><dt><b><i>atrice   ateur   ation   atrices   ateurs   ations</i></b>
</dt><dd>delete if in <i>R</i>2
</dd><dd>if preceded by <b><i>ic</i></b>, delete if in <i>R</i>2, else replace by <b><i>iqU</i></b>
<br><br>
</dd><dt><b><i>logie   logies</i></b>
</dt><dd>replace with <b><i>log</i></b> if in <i>R</i>2
<br><br>
</dd><dt><b><i>usion   ution   usions   utions</i></b>
</dt><dd>replace with <b><i>u</i></b> if in <i>R</i>2
<br><br>
</dd><dt><b><i>ence   ences</i></b>
</dt><dd>replace with <b><i>ent</i></b> if in <i>R</i>2
<br><br>
</dd><dt><b><i>ement   ements</i></b>
</dt><dd>delete if in <i>RV</i>
</dd><dd>if preceded by <b><i>iv</i></b>, delete if in <i>R</i>2 (and if further preceded by <b><i>at</i></b>,
delete if in <i>R</i>2), otherwise,
</dd><dd>if preceded by <b><i>eus</i></b>, delete if in <i>R</i>2, else replace by <b><i>eux</i></b>
if in <i>R</i>1, otherwise,
</dd><dd>if preceded by <b><i>abl</i></b> or <b><i>iqU</i></b>, delete if in <i>R</i>2, otherwise,
</dd><dd>if preceded by <b><i>ièr</i></b> or <b><i>Ièr</i></b>, delete if in <i>RV</i>
<br><br>
</dd><dt><b><i>ité   ités</i></b>
</dt><dd>delete if in <i>R</i>2
</dd><dd>if preceded by <b><i>abil</i></b>, delete if in <i>R</i>2, else replace by <b><i>abl</i></b>,
otherwise,
</dd><dd>if preceded by <b><i>ic</i></b>, delete if in <i>R</i>2, else replace by <b><i>iqU</i></b>, otherwise,
</dd><dd>if preceded by <b><i>iv</i></b>, delete if in <i>R</i>2
<br><br>
</dd><dt><b><i>if   ive   ifs   ives</i></b>
</dt><dd>delete if in <i>R</i>2
</dd><dd>if preceded by <b><i>at</i></b>, delete if in <i>R</i>2 (and if further preceded by <b><i>ic</i></b>,
delete if in <i>R</i>2, else replace by <b><i>iqU</i></b>)
<br><br>
</dd><dt><b><i>eaux</i></b>
</dt><dd>replace with <b><i>eau</i></b>
<br><br>
</dd><dt><b><i>aux</i></b>
</dt><dd>replace with <b><i>al</i></b> if in <i>R</i>1
<br><br>
</dd><dt><b><i>euse   euses</i></b>
</dt><dd>delete if in <i>R</i>2, else replace by <b><i>eux</i></b> if in <i>R</i>1
<br><br>
</dd><dt><b><i>issement   issements</i></b>
</dt><dd>delete if in <i>R</i>1 and preceded by a non-vowel
<br><br>
</dd><dt><b><i>amment</i></b>
</dt><dd>replace with <b><i>ant</i></b> if in <i>RV</i>
<br><br>
</dd><dt><b><i>emment</i></b>
</dt><dd>replace with <b><i>ent</i></b> if in <i>RV</i>
<br><br>
</dd><dt><b><i>ment   ments</i></b>
</dt><dd>delete if preceded by a vowel in <i>RV</i>
</dd></dl></dl>
In steps 2<i>a</i> and 2<i>b</i> all tests are confined to the <i>RV</i> region.
<br><br>
Do step 2<i>a</i> if either no ending was removed by step 1, or if one of endings
<b><i>amment</i></b>, <b><i>emment</i></b>, <b><i>ment</i></b>, <b><i>ments</i></b> was found.
<br><br>
Step 2<i>a</i>: Verb suffixes beginning <b><i>i</i></b>
<dl><dd>
Search for the longest among the following suffixes and if found,
delete if preceded by a non-vowel.
<br><br>
</dd><dl><dd>
<b><i>îmes   ît   îtes   i   ie   ies   ir   ira   irai   iraIent   irais   irait   iras
  irent   irez   iriez   irions   irons   iront   is   issaIent   issais   issait
  issant   issante   issantes   issants   isse   issent   isses   issez   issiez
  issions   issons   it</i></b>
</dd></dl><br>(Note that the non-vowel itself must also be in <i>RV</i>.)
</dl>
Do step 2<i>b</i> if step 2<i>a</i> was done, but failed to remove a suffix.
<br><br>
Step 2<i>b</i>: Other verb suffixes
<dl><dd>
Search for the longest among the following suffixes, and perform the
action indicated.
<br><br>
</dd><dl><dt><b><i>ions</i></b>
</dt><dd>delete if in <i>R</i>2
<br><br>
</dd><dt><b><i>é   ée   ées   és   èrent   er   era   erai   eraIent   erais   erait   eras   erez
  eriez   erions   erons   eront   ez   iez</i></b>
</dt><dd>delete
<br><br>
</dd><dt><b><i>âmes   ât   âtes   a   ai   aIent   ais   ait   ant   ante   antes   ants   as   asse
  assent   asses   assiez   assions</i></b>
</dt><dd>delete
</dd><dd>if preceded by <b><i>e</i></b>, delete
</dd></dl><br>(Note that the <b><i>e</i></b>that may be deleted in this last step must also be in
<i>RV</i>.)
</dl>
If the last step to be obeyed - either step 1, 2<i>a</i> or 2<i>b</i> - altered the word,
do step 3
<br><br>
Step 3
<dl><dd>
Replace final <b><i>Y</i></b> with <b><i>i</i></b> or final <b><i>ç</i></b> with <b><i>c</i></b>
</dd></dl>
Alternatively, if the last step to be obeyed did not alter the word, do
step 4
<br><br>
Step 4: Residual suffix
<dl><dd>
If the word ends <b><i>s</i></b>, not preceded by <b><i>a</i></b>, <b><i>i</i></b>, <b><i>o</i></b>, <b><i>u</i></b>, <b><i>è</i></b> or <b><i>s</i></b>, delete it.
<br><br>
In the rest of step 4, all tests are confined to the <i>RV</i> region.
<br><br>
Search for the longest among the following suffixes, and perform the
action indicated.
<br><br>
</dd><dl><dt><b><i>ion</i></b>
</dt><dd>delete if in <i>R</i>2 and preceded by <b><i>s</i></b> or <b><i>t</i></b>
<br><br>
</dd><dt><b><i>ier   ière   Ier   Ière</i></b>
</dt><dd>replace with <b><i>i</i></b>
<br><br>
</dd><dt><b><i>e</i></b>
</dt><dd>delete
<br><br>
</dd><dt><b><i>ë</i></b>
</dt><dd>if preceded by <b><i>gu</i></b>, delete
</dd></dl><br>(So note that <b><i>ion</i></b>is removed only when it is in <i>R</i>2 - as well as being
in <i>RV</i>- and preceded by <b><i>s</i></b>or <b><i>t</i></b>which must be in <i>RV</i>.)
</dl>
Always do steps 5 and 6.
<br><br>
Step 5: Undouble
<dl><dd>
If the word ends <b><i>enn</i></b>, <b><i>onn</i></b>, <b><i>ett</i></b>, <b><i>ell</i></b> or <b><i>eill</i></b>, delete the last letter
</dd></dl>
Step 6: Un-accent
<dl><dd>
If the words ends <b><i>é</i></b> or <b><i>è</i></b> followed by at least one non-vowel, remove
the accent from the <b><i>e</i></b>.
</dd></dl>
And finally:
<dl><dd>
Turn any remaining <b><i>I</i></b>, <b><i>U</i></b> and <b><i>Y</i></b> letters in the word back into lower case.
</dd></dl>
</td></tr>
<tr><td bgcolor="lightblue">
<br> <h2>The same algorithm in Snowball</h2>
<br><pre><dl><dd>
routines (
prelude postlude mark_regions
RV R1 R2
standard_suffix
i_verb_suffix
verb_suffix
residual_suffix
un_double
un_accent
)
externals ( stem )
integers ( pV p1 p2 )
groupings ( v keep_with_s )
stringescapes {}
/* special characters (in ISO Latin I) */
stringdef a^ hex 'E2' // a-circumflex
stringdef a` hex 'E0' // a-grave
stringdef c, hex 'E7' // c-cedilla
stringdef e" hex 'EB' // e-diaeresis (rare)
stringdef e' hex 'E9' // e-acute
stringdef e^ hex 'EA' // e-circumflex
stringdef e` hex 'E8' // e-grave
stringdef i" hex 'EF' // i-diaeresis
stringdef i^ hex 'EE' // i-circumflex
stringdef o^ hex 'F4' // o-circumflex
stringdef u^ hex 'FB' // u-circumflex
stringdef u` hex 'F9' // u-grave
define v 'aeiouy{a^}{a`}{e"}{e'}{e^}{e`}{i"}{i^}{o^}{u^}{u`}'
define prelude as repeat goto (
( v [ ('u' ] v &lt;- 'U') or
('i' ] v &lt;- 'I') or
('y' ] &lt;- 'Y')
)
or
( ['y'] v &lt;- 'Y' )
or
( 'q' ['u'] &lt;- 'U' )
)
define mark_regions as (
$pV = limit
$p1 = limit
$p2 = limit // defaults
do (
( v v next ) or ( next gopast v )
setmark pV
)
do (
gopast v gopast non-v setmark p1
gopast v gopast non-v setmark p2
)
)
define postlude as repeat (
[substring] among(
'I' (&lt;- 'i')
'U' (&lt;- 'u')
'Y' (&lt;- 'y')
'' (next)
)
)
backwardmode (
define RV as $pV &lt;= cursor
define R1 as $p1 &lt;= cursor
define R2 as $p2 &lt;= cursor
define standard_suffix as (
[substring] among(
'ance' 'iqUe' 'isme' 'able' 'iste' 'eux'
'ances' 'iqUes' 'ismes' 'ables' 'istes'
( R2 delete )
'atrice' 'ateur' 'ation'
'atrices' 'ateurs' 'ations'
( R2 delete
try ( ['ic'] (R2 delete) or &lt;-'iqU' )
)
'logie'
'logies'
( R2 &lt;- 'log' )
'usion' 'ution'
'usions' 'utions'
( R2 &lt;- 'u' )
'ence'
'ences'
( R2 &lt;- 'ent' )
'ement'
'ements'
(
RV delete
try (
[substring] among(
'iv' (R2 delete ['at'] R2 delete)
'eus' ((R2 delete) or (R1&lt;-'eux'))
'abl' 'iqU'
(R2 delete)
'i{e`}r' 'I{e`}r' //)
(RV &lt;-'i') //)--new 2 Sept 02
)
)
)
'it{e'}'
'it{e'}s'
(
R2 delete
try (
[substring] among(
'abil' ((R2 delete) or &lt;-'abl')
'ic' ((R2 delete) or &lt;-'iqU')
'iv' (R2 delete)
)
)
)
'if' 'ive'
'ifs' 'ives'
(
R2 delete
try ( ['at'] R2 delete ['ic'] (R2 delete) or &lt;-'iqU' )
)
'eaux' (&lt;- 'eau')
'aux' (R1 &lt;- 'al')
'euse'
'euses'((R2 delete) or (R1&lt;-'eux'))
'issement'
'issements'(R1 non-v delete) // verbal
// fail(...) below forces entry to verb_suffix. -ment typically
// follows the p.p., e.g 'confus{e'}ment'.
'amment' (RV fail(&lt;- 'ant'))
'emment' (RV fail(&lt;- 'ent'))
'ment'
'ments' (test(v RV) fail(delete))
// v is e,i,u,{e'},I or U
)
)
define i_verb_suffix as setlimit tomark pV for (
[substring] among (
'{i^}mes' '{i^}t' '{i^}tes' 'i' 'ie' 'ies' 'ir' 'ira' 'irai'
'iraIent' 'irais' 'irait' 'iras' 'irent' 'irez' 'iriez'
'irions' 'irons' 'iront' 'is' 'issaIent' 'issais' 'issait'
'issant' 'issante' 'issantes' 'issants' 'isse' 'issent' 'isses'
'issez' 'issiez' 'issions' 'issons' 'it'
(non-v delete)
)
)
define verb_suffix as setlimit tomark pV for (
[substring] among (
'ions'
(R2 delete)
'{e'}' '{e'}e' '{e'}es' '{e'}s' '{e`}rent' 'er' 'era' 'erai'
'eraIent' 'erais' 'erait' 'eras' 'erez' 'eriez' 'erions'
'erons' 'eront' 'ez' 'iez'
// 'ons' //-best omitted
(delete)
'{a^}mes' '{a^}t' '{a^}tes' 'a' 'ai' 'aIent' 'ais' 'ait' 'ant'
'ante' 'antes' 'ants' 'as' 'asse' 'assent' 'asses' 'assiez'
'assions'
(delete
try(['e'] delete)
)
)
)
define keep_with_s 'aiou{e`}s'
define residual_suffix as (
try(['s'] test non-keep_with_s delete)
setlimit tomark pV for (
[substring] among(
'ion' (R2 's' or 't' delete)
'ier' 'i{e`}re'
'Ier' 'I{e`}re' (&lt;-'i')
'e' (delete)
'{e"}' ('gu' delete)
)
)
)
define un_double as (
test among('enn' 'onn' 'ett' 'ell' 'eill') [next] delete
)
define un_accent as (
atleast 1 non-v
[ '{e'}' or '{e`}' ] &lt;-'e'
)
)
define stem as (
do prelude
do mark_regions
backwards (
do (
(
( standard_suffix or
i_verb_suffix or
verb_suffix
)
and
try( [ ('Y' ] &lt;- 'i' ) or
('{c,}'] &lt;- 'c' )
)
) or
residual_suffix
)
// try(['ent'] RV delete) // is best omitted
do un_double
do un_accent
)
do postlude
)
</dd></dl>
</pre>
</td></tr></tbody></table>
</body></html>