Package TEES :: Package Utils :: Package Libraries :: Module PorterStemmer :: Class PorterStemmer

Class PorterStemmer

source code

Instance Methods

[hide private]

__init__(self)
The main part of the stemming algorithm starts here.

source code

cons(self, i)
cons(i) is TRUE <=> b[i] is a consonant.

source code

m(self)
m() measures the number of consonant sequences between k0 and j.

source code

vowelinstem(self)
vowelinstem() is TRUE <=> k0,...j contains a vowel

source code

doublec(self, j)
doublec(j) is TRUE <=> j,(j-1) contain a double consonant.

source code

cvc(self, i)
cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y.

source code

ends(self, s)
ends(s) is TRUE <=> k0,...k ends with the string s.

source code

setto(self, s)
setto(s) sets (j+1),...k to the characters in the string s, readjusting k.

source code

r(self, s)
r(s) is used further down.

source code

step1ab(self)
step1ab() gets rid of plurals and -ed or -ing.

source code

step1c(self)
step1c() turns terminal y to i when there is another vowel in the stem.

source code

step2(self)
step2() maps double suffices to single ones.

source code

step3(self)
step3() dels with -ic-, -full, -ness etc.

source code

step4(self)
step4() takes off -ant, -ence etc., in context <c>vcvc<v>.

source code

step5(self)
step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.

source code

stem(self, p, i, j)
In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j] inclusive.

source code

Method Details

[hide private]

init(self)
(Constructor)

source code

The main part of the stemming algorithm starts here. b is a buffer holding a word to be stemmed. The letters are in b[k0], b[k0+1] ... ending at b[k]. In fact k0 = 0 in this demo program. k is readjusted downwards as the stemming progresses. Zero termination is not in fact used in the algorithm.

Note that only lower case sequences are stemmed. Forcing to lower case should be done before stem(...) is called.

m(self)

source code

m() measures the number of consonant sequences between k0 and j.
if c is a consonant sequence and v a vowel sequence, and <..>
indicates arbitrary presence,

   <c><v>       gives 0
   <c>vc<v>     gives 1
   <c>vcvc<v>   gives 2
   <c>vcvcvc<v> gives 3
   ....

cvc(self, i)

source code

cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. this is used when trying to
restore an e at the end of a short  e.g.

   cav(e), lov(e), hop(e), crim(e), but
   snow, box, tray.

step1ab(self)

source code

step1ab() gets rid of plurals and -ed or -ing. e.g.

caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat

feed -> feed agreed -> agree disabled -> disable

matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess

meetings -> meet

step2(self)

source code

step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.

step3(self)

source code

step3() dels with -ic-, -full, -ness etc. similar strategy to step2.

stem(self, p, i, j)

source code

In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j] inclusive. Typically i is zero and j is the offset to the last character of a string, (p[j+1] == ''). The stemmer adjusts the characters p[i] ... p[j] and returns the new end-point of the string, k. Stemming never increases word length, so i <= k <= j. To turn the stemmer into a module, declare 'stem' as extern, and delete the remainder of this file.

Class PorterStemmer

__init__(self) (Constructor)

m(self)

cvc(self, i)

step1ab(self)

step2(self)

step3(self)

stem(self, p, i, j)

init(self)
(Constructor)