From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Sat Feb 16 2002 - 06:52:23 GMT
On Sat, 16 Feb 2002, Anthony Fok wrote:
> Hello all,
> 
> On Fri, Feb 15, 2002 at 11:42:02AM +0000, Andrew Dunbar wrote:
> > ----- Forwarded message from Fencol Yung -----
> > > I found that the following Unicode Private User
> > > character defined in
> > > AbiWord 0.9.6 conflicted with our charset:
> > > 
> > > UCS_FIELDSTART 		0xe000
> > > UCS_FIELDEND 		0xe001
> > > UCS_BOOKMARKSTART 	0xe002
> > > UCS_BOOKMARKEND 	0xe003
> > >
> > > I read from the mail archives that these character is actually move from
> > > elsewhere. If I move them to other position to resolve the conflict,
> > > will that create any drawback? Actually what is the purpose of those
> > > special character?
> 
> Yes, these four codepoints are at the very beginning of the Unicode
> PUA, and thus clash with at least 2 major charsets: GB18030 and
> BIG5-HKSCS.  (GB18030: New simplified Chinese encoding standard that
> maps to Unicode one-to-one; BIG5-HKSCS: Extension to the traditional
> Chinese Big5 encoding, by the Hong Kong government.)
> 
> As a matter of fact, we ran into the same problem about a
> month ago when our product was being certified for GB18030 compliance
> at the official Chinese Testing Agency.  Three of the GB18030 test
> documents map to Unicode PUA U+E000 to U+E765.  Loading the first of
> these 3 test documents would cause AbiWord to crash immediately.
> 
> We had to moved these {FIELD,BOOKMARK}{START,END} out of the way (to
> U+F000..U+F003 temporarily) in order to pass the certification.
> 
> But I agree that even putting them in U+F000..U+F003 is problematic.
We should move to 32 bit internal presentation so we can find code that
won't be touched.
> 
> > Yes I moved them.  I think we only had the first two at that time. 
> > They are used internally by AbiWord. And are assumed not to be
> > imported by any document. This is not a good assumption and we really
> > need to redesign this part of the code IMHO to use some kind of
> > out-of-band data instead of overriding the characters. Possibly we
> > can find some true "never to be used" characters but I doubt it.  The
> > people who know these parts of the code (please grep for them) should
> > be able to discuss the whys, wherefores, and possible solutions in
> > this list.  Before I moved them there were in conflicted with illegal
> > and/or BOM codes from memory which messed with importers and
> > exporters and generally seemed like a bad idea.
> 
> > Hope someone has a good idea to fix this.  Merely moving them around
> > is probably going to keep breaking somebody's private stuff here and
> > there...
> 
> I agree.  Nevertheless, I think we do need to move them now until a
> better solution is found.  The Unicode PUA is in U+E000..U+F8FF.
> The range U+E000..U+E765 is explicitly set as three User Defined
> Areas (UDAs) in the GB18030 standard, so we must stay out of this
> range, otherwise AbiWord would not comply with GB18030 (mandatory in
> Mainland China).  (Yes, the mapping table goes higher, but no one uses
> anything that high yet, not even the GB18030 test documents.  :-)
> 
> U+E000-U+F848 maps to the EUDC (End-User Defined Characters) in the
> CP950 / BIG5 / BIG5-HKSCS standard as compatibility codepoints, and it
> would be best to stay out of these ranges too.  There is no mapping
> from BIG5-HKSCS to U+F849..U+F8FF.
> 
> So, for now, U+F849..U+F8FF is free.  I suggest putting the four
> AbiWord internal control codes in U+F850..U+F853 for now.  Yes, this
> only solve the symptom, but it is important to have this fix _now_ for
> GB18030 and HKSCS compliance.  This will do until the real cure comes. 
> :-)
> 
A real cure is 32 bit internally. That will come after 1.0. Thanks very
much for this work around.
> A patch is attached.  Thanks!  :-)
> 
Even better :-)
Martin
This archive was generated by hypermail 2.1.4 : Sat Feb 16 2002 - 01:56:31 GMT