What data structure is used to store the text in a QTextDocument?

I’ve always wondered what data structures are used by text editors to work with text so fast and smoothly, especially when the user is typing in the middle of a large document. I had already known about such data structures as ropes and gap buffers or splitting the document into smaller chunks (lines or paragraphs), but the implementation of a QTextDocument in Qt got my attention because it looked like the whole document content were stored in a huge continuous string:

class Q_GUI_EXPORT QTextDocumentPrivate : public QObjectPrivate
{
    // ...

private:
    QString text; // <-- the whole content, including tables and images, is indeed stored here :-)
};

The fact that this algorithm is absolutely undocumented and Google couldn’t find anything about it on the Internet, encouraged me to study the source code and write this blog post.

The best way to figure out how something works is to make some experiments with it and an empty document looks like the most obvious starting point:

QTextDocument doc;
qDebug() << doc.toRawText(); // ""
qDebug() << QTextDocumentPrivate::get(&doc)->buffer(); // "\u2029"

As expected, the internal buffer of an empty document is empty too. The code point 0x2029 is just a paragraph separator defined in Unicode Annex 13 and can be ignored for now. Qt omits it when calling doc.toRawText(), but it still appears in some places, the most prominent example being QTextDocument::characterCount(), which always returns 1 for an empty document.

Appending some text just appends it to the buffer too: (a pedantic reader might already notice a catch in the output below, but don’t worry if you didn’t - me too)

QTextCursor cursor(&doc);
cursor.insertText("Hello, World");
qDebug() << doc.toRawText(); // "Hello, World"
qDebug() << QTextDocumentPrivate::get(&doc)->buffer(); // "\u2029Hello, World"

Things are getting interesting when replacing the first word with another one:

cursor.movePosition(QTextCursor::StartOfBlock);
cursor.movePosition(QTextCursor::EndOfWord, QTextCursor::KeepAnchor);
cursor.insertText("Goodbye");
qDebug() << doc.toRawText(); // "Goodbye, World"
qDebug() << QTextDocumentPrivate::get(&doc)->buffer(); // "\u2029Hello, WorldGoodbye"

What happened here? The output of doc.toRawText() shows the intended text, but obviously differs from the internal buffer, whose content looks strange.

To explain this, we need to consider a very interesting data structure called piece table (please, read the linked Wikipedia article first if you’re unaware what it is) and another part of QTextDocument - its fragment map:

class Q_GUI_EXPORT QTextDocumentPrivate : public QObjectPrivate
{
public:
    typedef QFragmentMap<QTextFragmentData> FragmentMap;

private:
    QString text; // <-- the whole content, including tables and images, is indeed stored here :-)
    FragmentMap fragments; // <-- but what is it?!
};

What is a fragment map? At first sight, it looks like a typical red-black binary search tree, but with an interesting property: it has no keys! Indeed, each node stores its “size” and the total size of its left child, but not the position in the final text displayed to the user.

How is it possible? Imagine, a widget (e.g. QTextEdit) wants to find the piece (Qt calls it a “fragment”, hence the name) containing the 1000th character in the text. It starts with the root node and checks if the given position is larger than its “left size”, but less or equal than its “left size” plus the node’s own size. If it is, the node (and, thus, the fragment) was found. If not, it checks if the position is smaller or equal than the node’s “left size”. If it is, the algorithm is repeated for the left subtree. Otherwise, for the right one. Since the tree is (almost) balanced, the time is logarithmic in the worst case.

It’s probably better to explain it by repeating the “Goodbye, World” experiment, but checking the document’s fragment map after each step this time.

An empty document contains nothing but the paragraph separator in its buffer and a single black node pointing to it: (note that “position” refers to the position in the buffer, not to the position in the final text; node indexes start with 1, index 0 is used both for the header and the NIL nodes)

The fragment map of an empty document.

When inserting “Hello, World”, the new text is appended to the buffer and a new node in the fragment tree is created:

The fragment map after inserting 'Hello, World'.

Here is the catch mentioned above: obviously, the paragraph separator should be placed at the end of a paragraph (side note: paragraphs are called “blocks” in Qt), not at the beginning! But that’s how the data structure works: new content can only be appended at the end, and its position is changed (remapped) using the fragment map.

When the first word is replaced with “Goodbye”, a bit more complicated things happen. First, the first fragment is cut in two:

Intermediate state of the fragment map when replacing the word 'Hello' with 'Goodbye'.

The buffer hasn’t been changed yet, only the fragment map.

A side note: despite rich text formatting being out of scope of this post, I wanted to show this intermediate step separately, because that’s also how formatting works in Qt. For instance, if the user just wanted to make “Hello” pink, a new node fragment would be created too and its “format” property (not shown on the picture) changed. The buffer itself stays untouched when only text formatting is changed.

Now, the new text (“Goodbye”) is appended to the buffer and the pointer of the first fragment is adjusted:

The fragment map after replacing the word 'Hello' with 'Goodbye'.

The text “Hello” (shown gray on the picture) is garbage now and not used anymore.

When the undo/redo feature of QTextDocument is enabled (it is by default), this unused text is still referenced by the items on the undo stack: if the user presses Ctrl+Z, the old fragment map is restored, and now it’s “Goodbye” being the garbage.

When this history is unnecessary and disabled, a kind of “garbage collection” happens. Qt tracks the number of unused characters in the member variable called unreachableCharacterCount. When it becomes too high, a new buffer is created and only the reachable text (“Goodbye, World” in this case) is copied into it. A new single-node tree is also created, and its only node spans the whole buffer.

That’s how text is stored in QTextDocument. Note that it also has another tree called “block map” to split the text in blocks (paragraphs). Maybe I’ll write another blog post about it in the future, but it’s not that interesting as the fragment map covered above.