jeudi 18 août 2016

LZMA compression explained

This summer vacation's project was completed almost on schedule: write a LZMA encoder, whilst enjoying vacation - that is, work early in the morning and late in the evening when everybody else is sleeping; and have fun (bike, canoe, visiting caves and amazing dinosaurs fac-similes, enjoying special beers, ...) the rest of the day.

Well, "schedule" is a bit overstretched, because with a topic as tricky as data compression, it is difficult to tell when and even whether you will succeed...

LZMA is a compression format invented by Igor Pavlov, which combines a LZ77 compression and range encoding.

With LZ77, imagine you are copying a text, character by character, but want to take some shortcuts. You send either single characters, or a pair of numbers (distance, length) meaning "please copy 'length' characters, starting back 'distance' characters in the copied text, from the point where the cursor is right now". That's it!
LZ77 is a well covered subject and is the first stage of most compression algorithms. Basically you can pick and choose an implementation, depending on the final compression size.

Range encoding is a fascinating way of compressing a message of any nature. Say you want to send a very large number N, but with less digits. It's possible - if some of the digits (0 to 9), appear more frequently, and some, less. The method is the following.
You begin with a range, say [0, 999[.
You subdivide it in ten intervals, corresponding to the digits 0 to 9, and calibrated depending on their probability of occurrence, p0 .. p9. The first digit of N is perhaps 3, and its corresponding interval is, say, [295, 405[.
Then, you continue with the second digit by subdividing [295, 405[ in ten intervals. If the second digit is 0, you have perhaps now [295, 306[, representing the partial message "30". You see, of course, that if you want to stick with integers (with computers you don't have infinite precision anyway), you lose quickly precision when you set up the ten intervals with the probabilities p0 .. p9. The solution is to append from time to time a 0 to the interval, when the width is too small. So, if you decide to multiply everything by 10 each time the width is less than 100, then the interval for "30" will be now [2950, 3060[.
Some n digits to be encoded later (after n subdivisions and some x10 when needed) your interval will perhaps look like [298056312, 298056701[. The bounds become larger and larger - second problem. Solution: you see that the leftmost digits won't change anymore. You can get rid of them and send them as a chunk of the compressed message. The compression will be better when symbols are much more frequent than others: the closer the probability is to 1, the more the range width will be preserved. If the probability was exacly 1, the width wouldn't change at all and this trivial message with only the same symbol wouln't take any space in its compressed form! It is an absurd case, but it shows why compression methods such as LZMA are extremely good for very redundant data.
That's how the basic range encoding works.
Then, a funny thing is that you can encode a mix of different alphabets (say digits '0' to '9' and letters 'A' to 'Z') or even the same alphabet, but with different probabilities depending on the context, provided the decoder knows what to use when. That's all for range encoding (you find a more detailed description in the original article [1]).

LZMA's range encoder works exclusively on a single, binary alphabet (0's and 1's), so the range is always divided in two parts. But it works with lots of contextual probabilities. With some parameters you can have millions of different probabilities in the model! The probabilities are not known in advance, so in this respect LZMA is a purely adaptive compression method: the encoder and the decoder adapt the probabilities as the symbols are sent and received. After each bit encoded, sent, received, decoded, the entire probability set is (and has to be) exactly in the same state by the encoder and by the decoder.

Developing an encoder from scratch, even if you have open-source code to reproduce, is fun, but debugging it is a pain. A bug feels like when something doesn't work in a PhD work in maths. No way to get help from anybody or by browsing the Web. By nature, the compressed data will not contain any redundancy that would help you fixing bugs. The decoder is confused on faulty compressed data and cannot say why. For range encoding, it is worse: as in the example, digits sent have nothing to do with the message to be encoded. The interval subdivision, the shipping of the leading interval digits, and the appending of trailing '0', occur in a way which is completely asynchronous. So, the good tactic is, as elsewhere, to simplify and divide the issues to the simplest.
First, manage to encode an empty message (wow!). It seems trivial, but the range encoder works like a pipeline; you need to initialize it and flush it correctly. Then, an empty message and the end-of-stream marker. And so on.
Another source of help for LZMA is the probability set: it needs to be identical at every point as said before.

The results of this effort in a few numbers:
  • LZMA.Encoding, started July 28th, first working version August 16th (revision 457).
  • Less than 450 lines - including lots of comments and some debugging code to be removed!
  • 5 bugs had to be fixed.

To my (of course biased) opinion, this is the first LZMA encoder that a normal human can understand by reading the source code.

Zip-Ada's Zip.Compress makes use of LZMA encoding since revision 459.

The source code is available here (main SourceForge SVN repository) or here (GitHub mirror).

Back to vacation topic (which is what you do often when you're back from vacation): a tourist info sign was just perfect for a 32x32 pixels "info" icon for the AZip archive manager.

Click to enlarge
The beautiful sign

By the way, some other things are beautiful in this town (St-Ursanne at the Doubs river)...

[1] G. N. N. Martin, Range encoding: an algorithm for removing redundancy
   from a digitized message, Video & Data Recording Conference,
   Southampton, UK, July 24-27, 1979.

jeudi 7 juillet 2016

GLOBE_3D: now, a bit of fog...

Click to enlarge picture

Here is the code activating the fog in the background.

    if foggy then
      Enable (FOG);
      Fog (FOG_MODE, LINEAR);
      Fog (FOG_COLOR, fog_colour(0)'Unchecked_Access);
      Hint (FOG_HINT, FASTEST);
      Fog (FOG_START, 1.0);
      Fog (FOG_END, 0.4 * fairly_far);
    end if;

As usual with GL, it looks very obvious, but (as usual too) it is one of the few combinations that are actually working.

mercredi 6 juillet 2016

GLOBE_3D Release 2016-07-05 - "Blender edition"

GLOBE_3D is a GL Object Based 3D engine realized with the Ada programming language.

Latest additions:
  • Use of Generic Image Decoder (GID) in GL.IO; now most image formats are supported for textures and other bitmaps to be used with GLOBE_3D (or any GL app)
  • New Wavefront format (.obj / .mtl) importer
  • Doom 3 / Quake 4 map importer more complete
  • Unified GNAT project file (.gpr), allowing to selected the target Operating System (Windows, Linux, Mac) and compilation mode (fast, debug, small) for demos, tools, etc.
  • Project file for ObjectAda 9.1+ updated
The first two points facilitate the import of 3D models from software such as Blender.
Here is an example:
Click to enlarge
Coincidentally, the Wavefront file format so simple that you can also write 3D models "by hand" in that format. An example made in an Excel sheet is provided along with the importer, in the ./tools/wavefront directory.

Click to enlarge

lundi 4 juillet 2016

Touché, coulé

Nouveauté de juin 2016: tous les taux dits "sans risque" jusqu'à 30 ans sont négatifs.
Actuellement, le seul risque qu'on ne court pas avec ces obligations est de s'enrichir...

Cliquer pour agrandir

dimanche 3 juillet 2016

GLOBE_3D: non-convex objects with transparency

It's stunning how the inventors of GL addressed from the beginning, in 1991, subtle issues popping up when displaying 3D object in your own program 25 years later.
For instance, take this model:

No alpha test. Click to enlarge.

It is a cross shaped (considered from above) object; texture has lots of transparency.
In the red rectangle you see the issue: the face in front was displayed before the face behind.
There is no bullet-proof rule for sorting faces, and GL has a per-screen-pixel depth buffer that allows displaying faces in an arbitrary order. So we don't want to introduce imperfect face sorting just for dealing with this kind of object.
Fortunately, the GL geniuses have invented a solution for that issue too:
    Enable    (ALPHA_TEST);
    AlphaFunc (GREATER, 0.05);
Et voilà...
Alpha test. Click to enlarge.
The model, "herbe01.obj" is in the ./tools/wavefront directory in the GLOBE_3D repository.

GLOBE_3D is a GL Object Based 3D engine realized with the Ada programming language.


mercredi 22 juin 2016

GLOBE_3D: most image formats now available for textures

The texture loader in GL.IO was around 15 years old and supported only the Targa (.tga) format for textures, plus a few sub-formats of Windows bitmaps (.bmp).
In order to make things easy when dealing with various models, e.g. those imported from Blender, the old code for reading images has been wiped out and the loader is using now GID for the job, supporting JPEG or PNG in addition. For instance the Blender model below is using the JPEG format for textures.

Futuristic Combat Jet (hi poly version) by Dennis Haupt (DennisH2010)

The following Blender model has a single PNG texture projected on a complicated surface called a Mandelbulb (never heard of before!) :

Mandelbulb 3D Panorama 3 by DennisH2010

GLOBE_3D is a GL Object Based 3D engine realized with the Ada programming language.


mardi 21 juin 2016

Wavefront importer for GLOBE_3D

Basically, it is possible now to import a model saved in Blender as a Wavefront (.obj) model, and turn it into a GLOBE_3D object:

Futuristic Combat Jet by Dennis Haupt (DennisH2010)

GLOBE_3D is a GL Object Based 3D engine realized with the Ada programming language.