Package TEES :: Package Utils :: Package Libraries :: Module PorterStemmer :: Class PorterStemmer
[hide private]

Class PorterStemmer

source code

Instance Methods [hide private]
 
__init__(self)
The main part of the stemming algorithm starts here.
source code
 
cons(self, i)
cons(i) is TRUE <=> b[i] is a consonant.
source code
 
m(self)
m() measures the number of consonant sequences between k0 and j.
source code
 
vowelinstem(self)
vowelinstem() is TRUE <=> k0,...j contains a vowel
source code
 
doublec(self, j)
doublec(j) is TRUE <=> j,(j-1) contain a double consonant.
source code
 
cvc(self, i)
cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y.
source code
 
ends(self, s)
ends(s) is TRUE <=> k0,...k ends with the string s.
source code
 
setto(self, s)
setto(s) sets (j+1),...k to the characters in the string s, readjusting k.
source code
 
r(self, s)
r(s) is used further down.
source code
 
step1ab(self)
step1ab() gets rid of plurals and -ed or -ing.
source code
 
step1c(self)
step1c() turns terminal y to i when there is another vowel in the stem.
source code
 
step2(self)
step2() maps double suffices to single ones.
source code
 
step3(self)
step3() dels with -ic-, -full, -ness etc.
source code
 
step4(self)
step4() takes off -ant, -ence etc., in context <c>vcvc<v>.
source code
 
step5(self)
step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
source code
 
stem(self, p, i, j)
In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j] inclusive.
source code
Method Details [hide private]

__init__(self)
(Constructor)

source code 

The main part of the stemming algorithm starts here. b is a buffer holding a word to be stemmed. The letters are in b[k0], b[k0+1] ... ending at b[k]. In fact k0 = 0 in this demo program. k is readjusted downwards as the stemming progresses. Zero termination is not in fact used in the algorithm.

Note that only lower case sequences are stemmed. Forcing to lower case should be done before stem(...) is called.

m(self)

source code 
m() measures the number of consonant sequences between k0 and j.
if c is a consonant sequence and v a vowel sequence, and <..>
indicates arbitrary presence,

   <c><v>       gives 0
   <c>vc<v>     gives 1
   <c>vcvc<v>   gives 2
   <c>vcvcvc<v> gives 3
   ....

cvc(self, i)

source code 
cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. this is used when trying to
restore an e at the end of a short  e.g.

   cav(e), lov(e), hop(e), crim(e), but
   snow, box, tray.

step1ab(self)

source code 

step1ab() gets rid of plurals and -ed or -ing. e.g.

caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat

feed -> feed agreed -> agree disabled -> disable

matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess

meetings -> meet

step2(self)

source code 

step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.

step3(self)

source code 

step3() dels with -ic-, -full, -ness etc. similar strategy to step2.

stem(self, p, i, j)

source code 

In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j] inclusive. Typically i is zero and j is the offset to the last character of a string, (p[j+1] == ''). The stemmer adjusts the characters p[i] ... p[j] and returns the new end-point of the string, k. Stemming never increases word length, so i <= k <= j. To turn the stemmer into a module, declare 'stem' as extern, and delete the remainder of this file.