overview -- revisions and the piece table

From: Paul Rohr (paul@abisource.com)
Date: Sat May 18 2002 - 13:40:39 EDT

Next message: Paul Rohr: "design q -- are revisions linear or parallel?"

Previous message: David Chart: "Re: revision marks"
Next in thread: Tomas Frydrych: "Re: overview -- revisions and the piece table"
Maybe reply: Tomas Frydrych: "Re: overview -- revisions and the piece table"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

While I haven't had time to carefully read and think through all the design
work going on in the "revision marks" threads, I'm happy to see that people
are seriously engaging with the complexity of the problem.

The following conceptual overview of how the piece table works might be
helpful for folks who aren't familiar with it.

piece table concepts
--------------------
If you dig into the implementation of the piece table, you'll note that it
includes:

  I. the entire document as originally loaded,
  II. all new content & formatting added during the edit session,
  III. an audit history of all the changes, in order, and
  IV. bookkeeping information to construct a "current" view of the document.

Of these, the rest of AbiWord spends most of its time dealing with IV. III
is created in response to specific edit methods, and is also somewhat
visible via the undo/redo mechanism that allows you to replay them. By
comparison, the first two are pretty low-level and the rest of the AbiWord
never sees them.

For example, the drawing code just references IV, as do the exporters.

revision marks
--------------
AFAICT, the most useful thing about revision marks is the ability to tell
the difference between two states of a document:

- what it was like when someone loaded it, and
- what it was like when they saved it.

In short, while you do want to know overall what the diff between those two
states was -- content that got added, deleted, moved, or reformatted by a
specific contributor -- you *don't* need to know what all of the specific
typing glitches in between were. For example, if an editor retyped a
sentence fifteen different ways before saving it, that's just not useful
information. [Plus which, it's a pain in the ass to keep track of it.]

the insight
-----------
If you look at how the piece table is constructed in memory, there are
basically five chunks of memory.

  - a contiguous buffer of the original document's text (read-only)
  - a contiguous buffer of all new text typed (append-only)
  - the original document's properties (again, read-only)
  - new properties added (again, append-only)
  - a heck of a lot of bookkeeping to tell which parts of each are being
    used in the current version.

In short, I believe that this is enough information to detect all of the
relevant changes at export time.

caveat
------
The following example just shows how to detect text-level changes. There's
definitely an additional level of complexity needed to capture formatting
changes, too.

(Indeed, FWIW, I sometimes think that we should take advantage of the fact
that our file format is text-based, and use a tailored XML-aware version of
diff to handle all of the low-level work here.)

BTW, *do* people use revision marks to keep track of formatting changes too?
What does the UI look like?

for example
-----------
Say that the original document was:

The quick brown fox jumped over the lazy dog.

After a few cuts and pastes, plus a little typing, the new document reads:

The lazy green fox slept next to the quick dog.

In short, the text stored in the piece table can be divided into the
following five equivalence classes. The first three subdivide the content
from the original document as follows:

1. text that didn't change

The quick brown fox jumped over the lazy dog.
^^^^ ^^^^ ^^^^ ^^^^

2. text that's no longer used

The quick brown fox jumped over the lazy dog.
^^^^^^ ^^^^^^^^^^^^

3. text that got moved:

The quick brown fox jumped over the lazy dog.
^^^^^ ^^^^

The other two equivalence classes come from the "other half" of the piece
table, namely:

4. new text that got used

sloppieysleppt on top of ofver greenmowvemauve next to
^^^^ ^ ^^^^^ ^^^^^^^

5. new text that got abandoned

sloppieysleppt on top of ofver greenmowvemauve next to
^^^^^^^^ ^ ^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^

Yes, that's actually what the piece table looks like at the lowest levels.
[Indeed, it's usually even messier.] There are two big streams of text:

- one as imported
- one of all the characters typed in this edit session

All of the rest of the PT machinery just assembles together the relevant
pieces of the text as needed. It's a really, really cool data structure.

so what?
--------
Take a look at the five equivalence classes in the above example. When we
currently persist the file format, the piece table extracts an ordered
sequence of fragments of text that corresponds to the following equivalence
classes:

  1. unchanged text
  3. moved text
  4. new text

Even if we're supporting revision marks, there's still no reason to export
the following equivalence class:

5. new-and-abandoned

However, we *will* want to export the other four, and differentiate them
accordingly:

  1. unchanged text
  2. deleted text
  3. moved text
  4. new text

My hope is that when revision tracking is turned on, we can algorithmically
identify which equivalence class a given frag is in, and then emit and flag
them, in order, at export time. For example, the following are easy:

  #4 is used, and comes from the append buffer
  #1 is used, and comes from the read-only buffer
  #2 is *not* used, and comes from the read-only buffer

The first two (#4 and #1) are easy to detect and sequence properly. #2 can
be inferred (by a brute force scan, if needed), although getting them
ordered properly will be more work.

Thus, the only screw case will be #3. As with #1 and #4, sequencing them is
easy -- that's what the PT is already doing for us -- but it's not
immediately obvious what the best way is to differentiate #3 from #1.
Export-time alternatives might include:

  - scanning change records,
  - maintaining a stack to detect reordered frags, or
  - something more clever.

To be honest, I forget how Jeff and I actually coded the copy/paste logic.
It very well may be that equivalence class #3 gets decomposed into a pair of
add/delete operations. If so, then this isn't even an issue.

next steps
----------
The above explanation just addresses whether and how to detect changes
between the version loaded and the version about to be saved. It does NOT
address:

- how that should be reflected in the file format, NOR
- how small a subset of CVS's branching & merging features to reinvent.

As we've seen, the latter can quickly get way too complex. Still, I hope
that people find this a useful starting point.

Enjoy!

Paul

Next message: Paul Rohr: "design q -- are revisions linear or parallel?"
Previous message: David Chart: "Re: revision marks"
Next in thread: Tomas Frydrych: "Re: overview -- revisions and the piece table"
Maybe reply: Tomas Frydrych: "Re: overview -- revisions and the piece table"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Sat May 18 2002 - 13:43:12 EDT