2016-09-20

LZMA parametrization

One fascinating property of the LZMA data compression format is that it is actually a family of formats with three numeric parameters that can be set:

  • The “Literal context bits” (lc) sets the number of bits of the previous literal (a byte) that will be used to index the probability model. With 0 the previous literal is ignored, with 8 you have a full 256 x 256 Markov chain matrix, with probability of getting literal j when the previous one was i.
  • The “Literal position” (lp) will take into account the position of each literal in the uncompressed data, modulo 2lp. For instance lp=1 will be better fitted for 16 bit data.
  • The pb parameter has the same role in a more general context where repetitions occur.

For instance when (lc, lp, pb) = (8, 0, 0) you have a simple Markov model similar to the one used by the old "Reduce" format for Zip archives. Of course the encoding of this Markov-compressed data is much smarter with LZMA than with "Reduce".
Additionally, you have a non-numeric parameter which is the choice of the LZ77 algorithm – the first stage of LZMA.

The stunning thing is how much the changes in these parameters lead to different compression quality. Let’s take a format difficult to compress as a binary data, losslessly: raw audio files (.wav), 16 bit PCM.
By running Zip-Ada's lzma_enc with the -b (benchmark) parameter, all combinations will be tried – in total, 900 different combinations of parameters! The combination leading to the smallest .lzma archive is with many .wav files (but not all) the following: (0, 1, 0) – list at bottom [1].
It means that the previous byte is useless for predicting the next one, and that the compression has an affinity with 16-bit alignment, which seems to make sense. The data seems pretty random, but the magic of LZMA manages to squeeze 15% off the raw data, without loss. The fortuitous repetitions are not helpful: the weakest LZ77 implementation gives the best result! Actually, pushing this logic further, I have implemented for this purpose a “0-level” LZ77 [2] that doesn’t do any LZ compression. It gives the best output for most raw sound data. Amazing, isn’t it? It seems that repetitions are so rare that they output a very large code through the range encoder, while weakening slightly and temporarily the probability of outputting a literal - see the probability evolution curves in the second article, “LZMA compression - a few charts”.
Graphically, the ordered compressed sizes look like this:



and the various parameters look like this:

The 900 parameter combinations

The best 100 combinations

Many thanks to Stephan Busch who is maintaining the only public data compression corpus, to my knowledge, with enough size and variety to be really meaningful for the “real life” usage of data compression. You find the benchmark @ http://www.squeezechart.com/ . Stephan is always keen to share his knowledge about compression methods.
Previous articles:
____
[1] Here is the directory in descending order (the original file is a2.wav).

37'960 a2.wav
37'739 w_844_l0.lzma
37'715 w_843_l0.lzma
37'702 w_842_l0.lzma
37'696 w_841_l0.lzma
37'693 w_840_l0.lzma
37'547 w_844_l2.lzma
...
32'733 w_020_l0.lzma
32'717 w_010_l1.lzma
32'717 w_010_l2.lzma
32'707 w_011_l1.lzma
32'707 w_011_l2.lzma
32'614 w_014_l0.lzma
32'590 w_013_l0.lzma
32'577 w_012_l0.lzma
32'570 w_011_l0.lzma
32'568 w_010_l0.lzma

[2] In the package LZMA.Encoding you find the very sophisticated "Level 0" algorithm

    if level = Level_0 then
      while More_bytes loop
        LZ77_emits_literal_byte(Read_byte);
      end loop;
    else
      My_LZ77;
    end if;

Hope you appreciate ;-)

2016-09-10

LZMA compression - a few charts

Here are a few plots that I have set up while exploring the LZMA compression format.

You can pick and choose various LZ77 variants - for LZMA as well as for other LZ77-based formats like Deflate. Of course this choice can be extended to the compression formats themselves. There are two ways of dealing with this choice.
  1. You compress your data with all variants and choose the smallest size - brute force, post-selection; this is what the ReZip recompression tool does
  2. You have a criterion for selecting a variant before the compression, and hope it will be good enough - this is what Zip.Compress, method Preselection does (and the ZipAda tool with -eps)
If the computing resource - time, even energy costs (think of massive backups) - is somewhat limited, you'll be happy with the 2nd way.
A criterion appearing obviously by playing with recompression is the uncompressed size (one of the things you know before trying to compress).


Obviously the BT4 (one of the LZ77 match finders in the LZMA SDK) variant is better on larger sizes than the IZ_10 (Info-Zip's match finder for their Deflate implementation), but is it always the case ? Difficult to say on this graphic. But, if you cumulate the differences, things begin to become interesting.


Funny, isn't it ? The criterion would be to choose IZ_10 for sizes smaller than the x-value where the green curve reaches its bottom, and BT4 for sizes larger than that x-value.

Another (hopefully) interesting chart is the way the probability model in LZMA (this time, it's the "MA" part explained last time) is adapted to new data. The increasing curves show the effect of a series of '0' on a certain probability value used for range encoding; the decreasing curves show the effect of a series of '1'. On the x-axis you have the number of steps.


2016-09-01

Taux négatifs - toujours plus bas!

Taux d'intérêts CHF. Source: BNS. Cliquer pour agrandir.

Deux graduations de plus, d'un seul coup...

Gold pricing 2024-03-16

Note for subscribers: if you are interested in my financial articles only, you can use this RSS feed link. An uptick coming out of nowh...