Subject: Re: Patch: RecognizeContents for UTF-8
From: Vlad Harchev (hvv@hippo.ru)
Date: Mon Apr 09 2001 - 10:38:25 CDT
On Mon, 9 Apr 2001, Andrew Dunbar wrote:
 Hi, 
> Here's my patch to allow loading UTF-8 files regardless
> of the filename extension.  Including the .txt which
> is generally the case on Windows at least.
 Hmm, why all .txt files are utf8 files on Windows? Only ASCII files are
really utf8 files, any other encoding is not a subset of utf8 - so I think
nothing should be changed in IE_Imp_Text::RecognizeSuffix.
 Also, the AW will get an assertion failure (will crash) for any file with
8bit content if it's not a valid utf8 sequence - that's an incorrect
behaviour - 
        UT_ASSERT(UT_SHOULD_NOT_HAPPEN);
should definitely be removed.
Stylistic note: there is no need to surround return value in parenthesis.
 
> Andrew.
> 
> 
> --- ie_imp_Text.cpp.orig	Wed Feb  7 08:55:08 2001
> +++ ie_imp_Text.cpp	Mon Apr  9 23:08:12 2001
> @@ -250,7 +250,10 @@
>  
>  bool IE_Imp_Text::RecognizeSuffix(const char * szSuffix)
>  {
> -	return (UT_stricmp(szSuffix,".txt") == 0);
> +	// TODO: We give the other guys a chance, since this
> +	// TODO: importer is so generic.  Does this seem
> +	// TODO: like a sensible strategy?
> +	return(false);
>  }
>  
>  UT_Error IE_Imp_Text::StaticConstructor(PD_Document * pDocument,
> 
> --- ie_imp_UTF8.cpp.orig	Wed Feb  7 08:55:08 2001
> +++ ie_imp_UTF8.cpp	Sun Apr  8 00:20:56 2001
> @@ -308,8 +308,58 @@
>  
>  bool IE_Imp_UTF8::RecognizeContents(const char * szBuf, UT_uint32
> iNumbytes)
>  {
> -	// TODO: Not yet written
> -	return(false);
> +	bool bSuccess = false;
> +	const char *p = szBuf;
> +
> +	while (p < szBuf + iNumbytes)
> +	{
> +		int len;
> +		
> +		if ((*p & 0x80) == 0)				// ASCII
> +		{
> +			++p;
> +			continue;
> +		}
> +		else if (*p == 0xfe || *p == 0xff)		// BOM markers?  RFC2279 says
> illegal
> +		{
> +			UT_DEBUGMSG(("  BOM?\n"));
> +			break;
> +		}
> +		else if ((*p & 0xfe) == 0xfc)			// lead byte in 6-byte sequence
> +			len = 6;
> +		else if ((*p & 0xfc) == 0xf8)			// lead byte in 5-byte sequence
> +			len = 5;
> +		else if ((*p & 0xf8) == 0xf0)			// lead byte in 4-byte sequence
> +			len = 4;
> +		else if ((*p & 0xf0) == 0xe0)			// lead byte in 3-byte sequence
> +			len = 3;
> +		else if ((*p & 0xe0) == 0xc0)			// lead byte in 2-byte sequence
> +			len = 2;
> +		else						// not UTF-8 lead byte
> +		{
> +			UT_DEBUGMSG(("  not utf-8 lead byte\n"));
> +			UT_ASSERT(UT_SHOULD_NOT_HAPPEN);
> +			return(false);
> +		}
> +	
> +		while (--len)
> +		{
> +			++p;
> +			if (p >= szBuf + iNumbytes)
> +			{
> +				UT_DEBUGMSG(("  out of data!\n"));
> +				//return(false);
> +				break;
> +			}
> +			if ((*p & 0xc0) == 0x80)
> +				bSuccess = true;
> +			else
> +				return(false);
> +		}
> +		++p;
> +	}
> +	
> +	return(bSuccess);
>  }
>  
>  bool IE_Imp_UTF8::RecognizeSuffix(const char * szSuffix)
 Best regards,
  -Vlad
This archive was generated by hypermail 2b25 : Mon Apr 09 2001 - 11:23:35 CDT