Re: A slight regression in 2.9.1 text:p loading...

From: Martin Sevior <msevior_at_gmail.com>
Date: Tue Jul 19 2011 - 02:15:25 CEST

Hi Ben,

Thanks very much for this clear investigation of a very convoluted spec!

Just one point about your commit here:

+ UT_UCS4String s = ODi_textp_fold_whitespace( pBuffer, length );
+ s = ODi_textp_compact_two_or_more_spaces( s );
+ if(!m_bContentWritten)
         {
- printf(">>> content written!\n");
- m_charData += UT_UCS4String (pBuffer, length, true);
+ s = ODi_textp_trim_whitespace_leading( s );
         }
+ m_charData += s;
+
+ printf(">>> content written!\n");
+ UT_DEBUGMSG(("charData() s:%s\n", s.utf8_str() ));
     }

lets, turn that printf(">>> content written \n");

into a UT_DEBUGMSG so we don't put debug info into a users terminal
while we're loading an odt doc!

I should never have left that code there :-(

Cheers

Martin

On Tue, Jul 19, 2011 at 9:58 AM, Ben Martin
<monkeyiq@users.sourceforge.net> wrote:
> Hmm, one flaw to my flush() plan is that flush() is called to
> appendSpan() during a load. In reality it might not matter so much
> because the pathological case that I see causing problems might be so
> rare.
>
> The problem case would be with
> <text:p>foo <text:span>bar </text:span> bzzzzt</text:p>
>
> where the spaces after bar and before bzzzzt seem to require being
> coalesced to a single space. Point 2 of the ODF spec appears to
> concatenate the text of all text:p, Point 3 trims leading/trailing
> whitespace, Point 4 then coalesces 2+ spaces to 1 in the content.
>
> If one coalesces the intermediate spaces in the above, it does raise the
> question of where the remaining space should be placed (inside or
> outside the text:span close). So maybe I'm reading point 2 slightly
> wrong.
>
> My current fix which I think I'll commit shortly does much the same as
> Martin's code but also performs newline and space coalescing for the
> initial call to charData(). The doc from 11847 seems to render the same
> for my code and 2.9.1.
>
>
>
> On Mon, 2011-07-18 at 18:31 +1000, Ben Martin wrote:
>> Hi Martin,
>>
>>   Unsurprisingly, this is actually a bit more complex than I originally
>> thought :) I was reading over the part of the ODF spec I cited, and
>> traces of charData() calls, and the bug that ran text together when
>> whitespace is eaten 12813.
>>
>> A fairly simple fragment created with OO is the following, which should
>> all be on one line.
>>
>> <text:p text:style-name="Standard">
>> Hello <text:span text:style-name="T1">there</text:span> you
>> </text:p>
>>
>> In this case one believes that the spaces are significant. Abiword's
>> charData might be fed these during a load:
>> "Hello "
>> "there"
>> " you"
>>
>> The spec goes:
>> 2) The character data of the paragraph element and of all descendant
>> elements for which the OpenDocument schema permits the inclusion of
>> character data for the element itself and all its ancestor elements up
>> to the paragraph element, is concatenated in document order.
>>
>> 3) Leading " " (U+0020, SPACE) characters at the start of the resulting
>> text and trailing SPACE characters at the end of the resulting text are
>> removed.
>>
>> So it would seem that accruing m_charData in the charData() method and
>> treating the removal of leading and trailing whitespace in flush() might
>> be more appropriate. The charData() method should probably fold anything
>> in
>> Z = { U+0009, U+000D, U+000A )
>> to the SPACE = U+0020 char as per the spec, and maybe collapse
>> contiguous sequences of 2+ spaces to 1 space too.
>>
>> The folding of "other" whitespace into spaces and the collapsing must be
>> done in charData() needed so that "text:line-break" and "text:s" code
>> can still simply work on m_charData. If folding/collapsing was delayed
>> to flush() then those particular cases would have their effect ignored.
>>
>> This does leave the sticky case of a text:s at the start of a text:p.
>> The spec makes it hard to tell if one or more text:s at the beginning of
>> a text:p should actually stick around or not. It seems not, looking at
>> the following quote from the spec 6.1.3:
>>     The <text:s> element is used to represent
>>     the [UNICODE] character " " (U+0020, SPACE).
>>
>> If this was a special case for leading whitepsace in a text:p then I
>> would imagine that 6.1.2 would explicitly mention it.
>>
>> Of course, once I'm somewhat happy I'll check 11847 against 2.9.1 and my
>> modified trunk.
>>
>> On Mon, 2011-07-18 at 11:04 +1000, Martin Sevior wrote:
>> > Hi Ben,
>> >
>> > Yes, I did write that code! it was revision 29316 with the commit comment:
>> >
>> > "Properly fix bug 11847 Spaces inserted before paragraphs."
>> >
>> > So when you do your fixes, check that bug 11847 remains fixed :-)
>> >
>> > Cheers
>> >
>> > Martin
>> >
>> > On Mon, Jul 18, 2011 at 10:50 AM, Martin Sevior <msevior@gmail.com> wrote:
>> > > Hi Ben,
>> > >
>> > > This seems vaguely familiar to me but revision  29384 appears t be a
>> > > commit by uwog that cleans up a bit of code. As far as I can tell
>> > > if(!m_bContentWritten) {...}  was present before 29384.
>> > >
>> > > I think I might have put that code in place but I can't recall the
>> > > reason. I'll do a bit of digging with viewVC
>> > >
>> > > Cheers
>> > >
>> > > Martin
>> > >
>> > >
>> > > On Sun, Jul 17, 2011 at 4:58 PM, Ben Martin
>> > > <monkeyiq@users.sourceforge.net> wrote:
>> > >> Hi,
>> > >>  I dug into the loading of hand crafted ODT files in 2.9.1 and have
>> > >> filed a bug 13112 [1]. I've been looking into fixing the bug, but also
>> > >> don't want to step on anyone's toes in doing so.
>> > >>
>> > >> The gist of what I see being the problem is a special case in void
>> > >> ODi_TextContent_ListenerState::charData() where
>> > >> if(!m_bContentWritten) {...} is called. I'm not sure what use case
>> > >> prompted this special case to be added, so I don't want to blindly
>> > >> change it without discussion first.
>> > >>
>> > >>  It seems from "6.1.2 White Space Characters" of the spec [2], in
>> > >> particular page 120 of [3], that leading and trailing whitespace is to
>> > >> be removed, and internal whitespace is to be collapsed. ie, a stream of
>> > >> 2+ SPACE to be replaced with a single SPACE.
>> > >>
>> > >> Prior to collapsing anything in
>> > >> Z = { U+0009, U+000D, U+000A ) is first replaced with
>> > >> SPACE = U+0020.
>> > >>
>> > >> The current code, when given something like the following will preserve
>> > >> the newline after magic;
>> > >>
>> > >> <text:p>
>> > >>  I am a magic
>> > >>     and special wizard<text:span>....
>> > >> </text:p>
>> > >>
>> > >> This is because the entire string "\n  I am a magic\n     and special
>> > >> wizard" will be passed to charData() on the first call when
>> > >> m_bContentWritten == false;
>> > >>
>> > >> FWIW the special case was brought in at revision 29384.
>> > >>
>> > >>
>> > >> [1] http://bugzilla.abisource.com/show_bug.cgi?id=13112
>> > >> [2] http://docs.oasis-open.org/office/v1.2/cos01/
>> > >> [3]
>> > >> http://docs.oasis-open.org/office/v1.2/cos01/OpenDocument-v1.2-cos01-part1.pdf
>> > >>
>> > >>
>> > >
>> >
>>
>
>
Received on Tue Jul 19 02:15:36 2011

This archive was generated by hypermail 2.1.8 : Tue Jul 19 2011 - 02:15:36 CEST