From: Jordi Mas (jmas@softcatala.org)
Date: Tue Jul 15 2003 - 14:44:19 EDT
En/na Dom Lachowicz ha escrit:
> For Catalan, you just need to add a list of the 200 or
> so most common *meaningless* words in the language.
> Like:
> 
> the, a, an, he, she, of, ...
Hello Dom and the others,
The summarizer should not only have stop words (maningless) it should also 
know the most common words in every language.
Well, that I wanted to mention is that if you are doing the selection of the 
stop words manually the quality of the summarisation is going to be low and 
the algorithm is not going to perform well. If you use Word, you may be 
familiar with the concept of not performing well when doing summarisation.
The right way  of getting a list of common words for a language is to get a 
corpus (colletion of the documents), calculate the relative word frequency 
(number of times that the words appears in all documents) and then select the 
200 o 300 most common words, then you are going to have exactly that you need.
Also, it would be necessary that the corpus contain texts from different parts 
of the human knowdlege.
I know that no every one has a corpus handy, but we should do this with love 
at least for the major languages (English, Spanish, German), if not we are not 
going to perform well.
I would also suggest to implement another algorism in the library that has 
been proof to be effective for text sumarisation. Lots of texts contains words 
like "In conclusion", etc, that definitly should have enhance the score of the 
sentence and words like "As we said before", "As you already seen" that should 
give you less score. This works well, specially for formal texts.
Finally, one common problem in text summarisation is that the selected 
sentences assume knowdlege that you may no longer have. For exemple, if you 
select "He will do the course with them" or "Also, ..." you no longer have 
these references in the text. We can have a list of pronames (pronobres in 
Spanish) that if there are present we score lower the setence, because we 
prefer first setences with no references to text that we longer no have.
Here my five cents, if you think that some of this is interesting, I can give 
you guys a hand, or two :-)
Best Regards,
--Jordi Mas i Hernāndez - Abiword developer - http://www.abisource.com jmas@softcatala.org - Softcatalā member - http://www.softcatala.org - Personal Homepage http://www.softcatala.org/~jmas
This archive was generated by hypermail 2.1.4 : Tue Jul 15 2003 - 14:58:14 EDT