From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Mon Mar 18 2002 - 09:37:43 EST
 --- Tomas Frydrych <tomas@frydrych.uklinux.net>
wrote: > 
> This bug surfaced in the MS Word importer, but it
> will plague many 
> others. I has to do with handling the so called
> mirror characters, 
> such as ( ) or [ ].
> 
> In Unicode, mirror characters are defined
> semantically. For 
> instance character u0028 is defined as 'opening
> parenthesis'. In left-
> to-right context an opening parenthesis looks like
> '(', but in right-to-
> left context it looks like ')'. AW uses this
> semantic definition, and it 
> displays the correct glyph depending on the context.
> 
> A problem arises when we import documents from some
> other 
> format that does not use the semantic Unicode
> definition. For 
> instance MS Word will not use u0028 for opening
> parenthesis in 
> RTL context, but instead it will store u0029 in its
> place (which is 
> 'closing parenthesis'). So, when we load the
> document and analyse 
> the context, we will display a glyph for closing
> parenthesis in RTL 
> context which is '(', while the author intended ')'.
> 
> This is a serious bug that needs some fix before the
> 1.0 release. I 
> see two possible avenues:
> 
> (1) the MS Word importer carries out the analysis of
> the context 
> and it translates any mirroring characters in RTL
> context to the 
> correct Unicode values. The problem with this is
> that (a) the 
> importer was not designed to analyse the context, it
> handles a 
> character at a time and to get it do this properly
> would not be 
> entirely simple, but surmountable (b) we will have
> to redo this in 
> every importer that handles a file format with the
> same problem 
> (plain text, etc.).
> 
> (2) A second solution would be to add a method to
> our edit 
> methods, which would scan through the document for
> any mirror 
> characters in RTL context and replace them with
> their mirror 
> images. This method would be called by the importer
> once the 
> entire document is loaded. The main advantages of
> this are (a) it 
> can be used by any importer that needs it; (b) when
> the document 
> has been loaded, the context has already been
> analysed, so that it 
> is easy to identify the offending characters. The
> main disadvantage 
> is that the character-by-character scanning of the
> document and 
> the deletion/insertion operations carried on the
> offending characters 
> will prolong the document loading, which could be
> noticeable on 
> large files. Also, with the incremental loader in
> place, the initial 
> appearance of the document before the loading is
> completed would 
> be incorrect (unless we can call this fixing method
> while doing the 
> loading -- that should be possible, I think, if it
> is part of the 
> BlockLayout class rather than a independent edit
> method).
> 
> I would appreciate some comments on this, especially
> if someone 
> has a better idea of how to fix this.
I'm strongly in favour of solution 1 above.  If the
problem is just with certain formats then we should
fix only those formats.  Especially when those
formats otherwise support RTL documents.  Otherwise
we are second-guessing what user wanted which
character
in which position in which document.  And that will
surely lead to subtle bugs.
Andrew Dunbar.
=====
http://linguaphile.sourceforge.net http://www.abisource.com
__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com
This archive was generated by hypermail 2.1.4 : Mon Mar 18 2002 - 09:37:40 EST