Some ideas for Lempel-Ziv 77 variants

Job van der Zwan, July 10, 2022

Like many programmers I have the following problem: I like to come up with possible solutions to problems I don't actually have. For example, I have been mulling over some ideas for Lempel-Ziv compression schemes. I also don't have any interest in writing and maintaining a new compression format.

Some of these ideas feel like they would be especially relevant on limited hardware, like an embedded context, but since don't work in embedded contexts I have no idea of how well that hunch holds up nor a realistic use-case to test it on. This is a braindump of these ideas in case there's something useful among it.

What this article is going to do is describe variations for the Lempel-Ziv 77 compression algorithm. If you're already familiar and comfortable with this algorithm you can skip the next sections, but I want to make this text as accessible as possible so we'll also introduce the "base"algorithm.

The basics of a Lempel-Ziv 77 compression scheme

Let's start with setting up a "baseline" to make variations of. Say that we wish to compress the lyrics of an old George Gershwin song:

You like po-tay-to and I like po-tah-to
You like to-may-to and I like to-mah-to
Po-tay-to, po-tah-to, to-may-to, to-mah-to
Let's call the whole thing off

We assume UTF8 encoding, so one byte per character. Clearly there is some repetition in this text. The Lempel-Ziv family of compressors works by basically telling the computer "ok, here I want you to repeat n bytes from m spaces earlier". Let's look at the first sentence as an example:

You like po-ta-yto and I like po-tah-to

I underlined the repeated substring: " like po-ta". Note that we should include the leading space, for a total of eleven characters. What an LZ compressor does is basically say:

You like po-tay-to and I(11,20)h-to

The eleven refers to the length of the repeated substring, and the twenty refers to how far back in the text we can find the substring that we copy over. While conceptually simple, the devil is in the implementation details. For starters, how does the decompressor now when we're dealing with bytes, and when we're dealing with one of these (length, distance) pairs? There is an entire zoo of different variations of LZ algorithms, and honestly a lot of them kind of just boil down to having a different answer to this question. One of the more elegant and simple approaches (in my opinion) is the LZ4 algorithm.

How to encode Lempel-Ziv compression "LZ4" style

LZ4 is a recent-ish LZ77 flavor by Yann Collet. The main design goal is to be really, really fast. Obviously that partially depends on how efficiently its algorithm is implemented, but the design of algorithm itself also matters. In that regard, the algorithm is fast by being so shockingly simple that when implemented it's really just a couple of for-loops copying data all the time. That simplicity makes it a good starting point for an example implementation.

One observation of LZ4 is that we're not actually dealing with pairs of (length, distance), we're dealing with triplets of (uncompressed bytes length, copied bytes length, copied bytes distance). Let's look at that compressed sentence again:

You like po-tay-to and I(11,20)h-to

Before we reach the (11,20) pair, we first go through an uncompressed substring of bytes: "You like po-tay-to and I", which is a total of 24 characters. So we can rewrite this first part as follows:

(24,11,20)You like po-tay-to and I

Using triplets solves our problem of distinguishing uncompressed bytes and pairs. We start by reading three bytes. The first tells us the length of uncompressed bytes, the second the length of copied (and therefore compressed) bytes, and the third tells us from how far back we should start copying those bytes.

Then, once we're done with that first section, we read a new triplet, followed by the uncompressed bytes, followed by copying the bytes indicated by the triplet. Then we read another triplet, and so on until we reach the end of the text:

(24,11,20)You like po-tay-to and I(5,9,40)h-to
(4,17,40)to-m(0,5,20)(0,5,40)h-to
(1,8,71)P(1,10,61),(1,10,53),(1,11,43),
(30,0,0)Let's call the whole thing off

Since this doesn't really show how much shorter it is, here it is again with the triplets replaced by XYZ characters representing each byte in the triplet:

XYZYou like po-tay-to and IXYZh-to
XYZto-mXYZXYZh-to
XYZPXYZ,XYZ,XYZ,
XYZLet's call the whole thing off

Some observations:

A byte can have values between 0 and 255. What if our uncompressed or compressed byte length is longer than that?
Each triplet has a cost, an "overhead" so to speak, of three bytes. To actually save data we have to copy substrings of four bytes or longer
Being only able to copy from the previous 255 bytes gives us a really small window. With larger inputs we're much more likely to find longer repeated substrings if we can look further back

Let's address these issues one by one.

Encoding lengths bigger than a byte value

There are a couple of ways we could handle it. For example we could make each entry in our triplet two bytes by default. That would let us handle lengths up to 65536. That would have two downsides though: most of the time these lengths are much smaller than that, meaning we waste a byte, and also, what if we have lengths larger than that? We need something a little more flexible

To solve this Yann Colet came up with an extremely simple (read: fast) option. If one of our triplet values is 255 we read another byte and add it to our current value. If that byte also has a value of 255 we repeat this. Let's use a concrete example: say we want to encode the triplet (724, 255, 12). This would convert to the following sequence of bytes:

[255, 255, 214, 255, 0, 12]

Let's decode this: we read the first byte, which sets the uncompressed bytes length to 255. Because the byte we read was 255, we read another byte, which also happens to be 255. We add that to the current length, which becomes 510, and read another byte. This one has a value of 214, which adds up to 714. Because it's less than 255 we know we are done with the uncompressed byte length.

Now we move on to the compressed byte length. Unfortunately it's exactly 255, so we have to read another byte with a value of zero to determine that we don't have a larger compressed byte length. Then we can finally move on to the distance byte, which reads twelve. And now we've successfully decoded the triplet (724, 255, 12)