Hmm, one flaw to my flush() plan is that flush() is called to
appendSpan() during a load. In reality it might not matter so much
because the pathological case that I see causing problems might be so
rare.
The problem case would be with
<text:p>foo <text:span>bar </text:span> bzzzzt</text:p>
where the spaces after bar and before bzzzzt seem to require being
coalesced to a single space. Point 2 of the ODF spec appears to
concatenate the text of all text:p, Point 3 trims leading/trailing
whitespace, Point 4 then coalesces 2+ spaces to 1 in the content.
If one coalesces the intermediate spaces in the above, it does raise the
question of where the remaining space should be placed (inside or
outside the text:span close). So maybe I'm reading point 2 slightly
wrong.
My current fix which I think I'll commit shortly does much the same as
Martin's code but also performs newline and space coalescing for the
initial call to charData(). The doc from 11847 seems to render the same
for my code and 2.9.1.
On Mon, 2011-07-18 at 18:31 +1000, Ben Martin wrote:
> Hi Martin,
>
> Unsurprisingly, this is actually a bit more complex than I originally
> thought :) I was reading over the part of the ODF spec I cited, and
> traces of charData() calls, and the bug that ran text together when
> whitespace is eaten 12813.
>
> A fairly simple fragment created with OO is the following, which should
> all be on one line.
>
> <text:p text:style-name="Standard">
> Hello <text:span text:style-name="T1">there</text:span> you
> </text:p>
>
> In this case one believes that the spaces are significant. Abiword's
> charData might be fed these during a load:
> "Hello "
> "there"
> " you"
>
> The spec goes:
> 2) The character data of the paragraph element and of all descendant
> elements for which the OpenDocument schema permits the inclusion of
> character data for the element itself and all its ancestor elements up
> to the paragraph element, is concatenated in document order.
>
> 3) Leading " " (U+0020, SPACE) characters at the start of the resulting
> text and trailing SPACE characters at the end of the resulting text are
> removed.
>
> So it would seem that accruing m_charData in the charData() method and
> treating the removal of leading and trailing whitespace in flush() might
> be more appropriate. The charData() method should probably fold anything
> in
> Z = { U+0009, U+000D, U+000A )
> to the SPACE = U+0020 char as per the spec, and maybe collapse
> contiguous sequences of 2+ spaces to 1 space too.
>
> The folding of "other" whitespace into spaces and the collapsing must be
> done in charData() needed so that "text:line-break" and "text:s" code
> can still simply work on m_charData. If folding/collapsing was delayed
> to flush() then those particular cases would have their effect ignored.
>
> This does leave the sticky case of a text:s at the start of a text:p.
> The spec makes it hard to tell if one or more text:s at the beginning of
> a text:p should actually stick around or not. It seems not, looking at
> the following quote from the spec 6.1.3:
> The <text:s> element is used to represent
> the [UNICODE] character " " (U+0020, SPACE).
>
> If this was a special case for leading whitepsace in a text:p then I
> would imagine that 6.1.2 would explicitly mention it.
>
> Of course, once I'm somewhat happy I'll check 11847 against 2.9.1 and my
> modified trunk.
>
> On Mon, 2011-07-18 at 11:04 +1000, Martin Sevior wrote:
> > Hi Ben,
> >
> > Yes, I did write that code! it was revision 29316 with the commit comment:
> >
> > "Properly fix bug 11847 Spaces inserted before paragraphs."
> >
> > So when you do your fixes, check that bug 11847 remains fixed :-)
> >
> > Cheers
> >
> > Martin
> >
> > On Mon, Jul 18, 2011 at 10:50 AM, Martin Sevior <msevior@gmail.com> wrote:
> > > Hi Ben,
> > >
> > > This seems vaguely familiar to me but revision 29384 appears t be a
> > > commit by uwog that cleans up a bit of code. As far as I can tell
> > > if(!m_bContentWritten) {...} was present before 29384.
> > >
> > > I think I might have put that code in place but I can't recall the
> > > reason. I'll do a bit of digging with viewVC
> > >
> > > Cheers
> > >
> > > Martin
> > >
> > >
> > > On Sun, Jul 17, 2011 at 4:58 PM, Ben Martin
> > > <monkeyiq@users.sourceforge.net> wrote:
> > >> Hi,
> > >> I dug into the loading of hand crafted ODT files in 2.9.1 and have
> > >> filed a bug 13112 [1]. I've been looking into fixing the bug, but also
> > >> don't want to step on anyone's toes in doing so.
> > >>
> > >> The gist of what I see being the problem is a special case in void
> > >> ODi_TextContent_ListenerState::charData() where
> > >> if(!m_bContentWritten) {...} is called. I'm not sure what use case
> > >> prompted this special case to be added, so I don't want to blindly
> > >> change it without discussion first.
> > >>
> > >> It seems from "6.1.2 White Space Characters" of the spec [2], in
> > >> particular page 120 of [3], that leading and trailing whitespace is to
> > >> be removed, and internal whitespace is to be collapsed. ie, a stream of
> > >> 2+ SPACE to be replaced with a single SPACE.
> > >>
> > >> Prior to collapsing anything in
> > >> Z = { U+0009, U+000D, U+000A ) is first replaced with
> > >> SPACE = U+0020.
> > >>
> > >> The current code, when given something like the following will preserve
> > >> the newline after magic;
> > >>
> > >> <text:p>
> > >> I am a magic
> > >> and special wizard<text:span>....
> > >> </text:p>
> > >>
> > >> This is because the entire string "\n I am a magic\n and special
> > >> wizard" will be passed to charData() on the first call when
> > >> m_bContentWritten == false;
> > >>
> > >> FWIW the special case was brought in at revision 29384.
> > >>
> > >>
> > >> [1] http://bugzilla.abisource.com/show_bug.cgi?id=13112
> > >> [2] http://docs.oasis-open.org/office/v1.2/cos01/
> > >> [3]
> > >> http://docs.oasis-open.org/office/v1.2/cos01/OpenDocument-v1.2-cos01-part1.pdf
> > >>
> > >>
> > >
> >
>
This archive was generated by hypermail 2.1.8 : Tue Jul 19 2011 - 01:59:00 CEST