Grammar Checking, my Take. (was ..)

From: <msevior_at_physics.unimelb.edu.au>
Date: Thu Feb 10 2005 - 02:02:23 CET

> > Btw, why is all this integrated in the main tree,
>> with the grammar
>> > checker being a plugin?
>> >
>>
>> Because it's all pure XP code, has a negilible
>> impact on performance and
>> way too hard to write as a plugin.
>>
>> Feel free to correct me.
>
> I'd very much like for us to design some
> IGrammarChecker interface that gets implemented via
> this plugin. If done correctly, that gains us at
> least:
>

I fully agree Dom.

> 1) Preference on whether to Grammar Check blocks or
> not automatically, much like the automatic spell
> checking toggle

I think this code should just live in AbiWord proper. We're a large way
towards getting this done already.

> 2) Support for multiple languages

Yes but see later.

> 3) The basis for some grammar-checking dialog, ala
> MSWord
>

Yes But...

> We should then move a lot of the code that's currently
> in this plugin up into AbiWord proper. I think that
> the plugin itself should be fairly stupid with regard
> to anything != grammar. This means that some piece of
> code in AbiWord proper should intercept the BLKGRAMMAR
> Listener::notify() signal, tokenize the sentence
> accordingly, and pass it to the appropriate grammar
> checking plugin.
>

We can do this. But I want to play a bit more with plugin first. In
particular right now we just mark a whole sentence good/bad.

I read a bit more of the link-grammar documentation last night. It appears
possible to be much more precise about the location of the faulty grammar
in a sentence. If we highlight the bad region(s) of a sentence we can help
the user locate the incorrect part of the sentence.

So to my mind we need to change the current api to allow the grammar
checker to do this.

So we change:

  bool LinkGrammarWrap::parseSentence(const char * szSentence);

to

 bool LinkGrammarWrap::parseSentence( PieceOfText * pSentence);

Where:

class ABI_EXPORT AbiGrammarErrors
{
  UT_uint32 m_iErrLow;
  UT_uint32 m_iErrHigh;
  UT_UTF8String m_sErrorDesc;
};

class ABI_EXPORT PieceOfText
{
 public:
  PieceOfText(void);
  virtual ~PieceOfText(void);
  UT_uint32 iInLow;
  UT_uint32 iInHigh;
  UT_UTF8String sText;
  bool m_bGrammarChecked;
  bool m_bGrammarOK;
  UT_GenericVector<AbiGrammarErrors *> m_vecGrammarErrors;
  UT_UTF8String m_sSuggestion;
};

So the grammar checker can fill in the vector m_vecGrammarErrors with
more precise squiggle locations and maybe even provide hints/or
descriptions of the problem.

The other thing we need to do is to provide a framework for ignoring
grammar errors. My idea is to turn the sentence into either a 32 bit or 64
bit hash code. Sentences with ignored grammar errors have their hash codes
saved in a sorted UT_Vector in the document. Sentences marked as having
bad grammar have their hash codes compared with these and if they match,
the sentences are not displayed as having bad grammar. We can do a binary
search for hashcodes so this should be very fast. The hash codes are
exported and imported with the document.

Regarding the current plugin architecture, I think we merely need to
detect the presence of a grammar checker of the correct language. Your
original plugin interface allowed that already. Once we have a pointer to
it and a well defined interface we do not need to use the
AV_Listener::notify(..) technique to execute code.

So I agree that something like the spell-checker interface would be better.

Now regarding *link-grammar*.

There is no doubt in my mind that this is already a very powerful and
useful piece of software. It is very good about spotting bad grammar.
However right now we only identify a complete sentence as being
grammatically incorrect. This makes the users job of determining where a
sentence has gone wrong harder than it needed to be.

Everyone interested should read the documentation on the website at:
http://www.link.cs.cmu.edu/link/index.html

link-grammar works by marking up English words with a description of how
the words may be linked together. It uses these definitions identify
correct sequences of words and then determines if the word sequences
satisfy the rules of an English sentence. It outputs not only whether a
sentence is correct but also all the valid links it has found within the
sentence.

It is brilliant at finding words in a sentence that should be removed. We
should exploit this to mark only these words with squiggles. This will
make the task of identifying where a sentence is grammatically incorrect
much easier.

It will be harder to precisely determine cases where words are left out
and where a sentence is just totally wrong but I think we can develop
heuristics to improve things. I am personally interested in working on
this and I invite anyone else interested to read the link-grammar
documentation and join in. I plan on working on this immediately.

I am by no means an expert at Natural Language parsing nor do I know much
of the formal study of grammar. However it appears that the link-grammar
framework could be extended to other languages. I believe that English has
rather a sloppy grammatical structure compared to other languages. It may
well be that link-grammar will work even better for many other languages.

What is needed are language dictionaries with correctly defined
link-grammar mark-up. Interested people could take one of the open source
spell-check lists of words and start the mark-up process. It appears to be
a perfect open source project that can be parallelized and which will work
with the Open-Source feedback process. If we wish to pursue this we should
engage the huge translation projects of the GNOME and KDE community.

Regarding a MSWord style Grammar Checking dialog:

The MSWord grammar checking dialog provides a somewhat useful suggestion
for how sentences should be constructed. link-grammar does not do this.
What it does do well is provide information on correct sequences of words.
We should make sure sure we play on link-grammar strengths rather on how
MSWord does things. I encourage Alan and other people interested in
developing a GUI for link-grammar to build the command line tool "parse"
(it's very really easy to build), read the documentation, play with tools,
see the strengths of the code and think about how best to show that to
users.

Right now there is no "default" solution for grammar checkers in the
Open-Source world. link-grammar really does work very well. I wrote most
of this email in AbiWord with grammar checking turned on and it has really
helped.

That said I know of other efforts to write Open Source Grammar checkers.
It would be great to hear of their progress.

What do other people think about what I've written?
There is a lot in this email!

Cheers!

Martin
Received on Thu Feb 10 02:03:41 2005

This archive was generated by hypermail 2.1.8 : Thu Feb 10 2005 - 02:03:42 CET