From: F J Franklin (F.J.Franklin@sheffield.ac.uk)
Date: Wed May 08 2002 - 10:34:21 EDT
> > Support is there but incomplete. Byte sequences
> > longer 3 bytes will cause
> > problems, and there isn't a UTF-8 -> UCS-4
> > conversion yet.
>
> Sorry to keep whining about this but it was all in my lost huge Unicode
> patch over a year ago. UTF-8 sequences can be up to 6 bytes long. We
> should probably leave it up to iconv anyway since we have to handle
> things like overlong sequences, illegal sequences etc. iconv should
> handle this. I think my implementation used the ByteBuf class so that
> it could handle UCS-2 and UCS-4 properly without worrying about all
> those null bytes looking like string terminators and stuff.
Andrew, Andrew, I know. The reason why only 3-byte sequences are handled
is that the routine was written to convert Abi's internal UCS-2. Now that
Abi uses UCS-4 internally I'll add the code to handle 6-byte sequences.
In general I support the use of iconv for conversion between encodings,
but conversion between validated UTF-8 and UCS-4 is trivial and the
[UT_]UTF8String class was designed to handle the conversion without
resorting to iconv.
Ciao, Frank
ps. BTW, do you know anything about the overheads of using various iconv
implementations? or their thread-safety, for that matter? (Genuinely
curious/worried...)
Francis James Franklin
F.J.Franklin@shef.ac.uk
"No, she really likes me. She told me I look like Britney Spears, and why
would you say that to somebody you don't like?"
--- Elle Woods
This archive was generated by hypermail 2.1.4 : Wed May 08 2002 - 10:37:09 EDT