2016-04-03

Zip-Ada v.50

There is a new version of Zip-Ada @ http://unzip-ada.sf.net .

*** 

In a nutshell, there are now, finally, fast *and* efficient compression methods available.

* Changes in '50', 31-Mar-2016:
  - Zip.Compress.Shrink is slightly faster
  - Zip.Compress.Deflate has new compression features:
     - Deflate_Fixed is much faster, with slightly better compression
     - Deflate_1 was added: strength similar to zlib, level 6
     - Deflate_2 was added: strength similar to zlib, level 9
     - Deflate_3 was added: strength similar to 7-Zip, method=deflate, level 5

I use the term "similar" because the compression strength depends on the algorithms used and on the data, so it may differ from case to case. In the following charts, we have a comparison on the two most known benchmark data set ("corpora"), where the similarity with zlib (=info-zip, prefix iz_ below) holds, but not at all with 7-Zip-with-Deflate.
In blue, you see non-Deflate formats (BZip2 and LZMA), just to remind that the world doesn't stop with Deflate, although it's the topic in this article.
In green, you have Zip archives made by Zip-Ada.

Click to enlarge image
Click to enlarge image

Here is the biggest surprise I've had by testing randomly chosen data: a 162MB sparse integer matrix (among a bunch of results for a Kaggle challenge) which is a very redundant data. First, 7-Zip in Deflate mode gives a comparatively poor compression ratio - don't worry for 7-Zip, the LZMA mode, genuine to 7-Zip, is second best in the list. The most surprising aspect is that the Shrink format (LZW algorithm) has a compressed size only 5.6% larger than the best Deflate (here, KZip).

Click to enlarge image

Typically the penalty for LZW (used for GIF images) is from 25% to 100% compared to the best Deflate (used for PNG images). Of course, at the other end of redundancy spectrum, data which are closer to random are also more difficult to compress and the differences between LZW and Deflate narrow forcefully.

About Deflate

As you perhaps know, the Deflate format, invented around 1989 by the late Phil Katz for his PKZip program, performs compression in two steps by combining a LZ77 algorithm with Huffman encoding.
In this edition of Zip-Ada, two known algorithms (one for LZ77, one for finding an appropriate Huffman encoding based on an alphabet's statistics) are combined probably for the first time within the same software.
Additionally, the determination of compressed blocks' boundaries is done by an original algorithm (the Taillaule algorithm) based on similarities between Huffman code sets.

*** 

Zip-Ada is a library for dealing with the Zip compressed archive
file format. It supplies:

 - compression with the following sub-formats ("methods"):
     Store, Reduce, Shrink (LZW) and Deflate
 - decompression for the following sub-formats ("methods"):
     Store, Reduce, Shrink (LZW), Implode, Deflate, BZip2 and LZMA
 - encryption and decryption (portable Zip 2.0 encryption scheme)
 - unconditional portability - within limits of compiler's provided
     integer types and target architecture capacity
 - input (archive to decompress or data to compress) can be any data stream
 - output (archive to build or data to extract) can be any data stream
 - types Zip_info and Zip_Create_info to handle archives quickly and easily
 - cross format compatibility with the most various tools and file formats
     based on the Zip format: 7-zip, Info-Zip's Zip, WinZip, PKZip, Java's
     JARs, OpenDocument files, MS Office 2007+, Nokia themes, and many others
 - task safety: this library can be used ad libitum in parallel processing
 - endian-neutral I/O

Enjoy!

Gold pricing 2024-03-16

Note for subscribers: if you are interested in my financial articles only, you can use this RSS feed link. An uptick coming out of nowh...