Monday, December 1, 2014

.NET Compression Libraries Benchmark

I often find myself needing a compression library in a software development project, and after a few tests with disappointing results I usually fall back to spawning 7Zip as an external process. This time however, the application specification prohibits any data written to disk, which 7Zip needs to do in order to work correctly. I needed to do everything in memory, and so I had to test some of the C# libraries available for compression as well as the different compression algorithms.

Here is a full list of the tested libraries:

 

 Compression 101

Before I get to the results of the benchmark, I think it is important to understand the different archive formats and compression algorithms in order to compare the results with other benchmarks/and or tests. Lets get to it!

A compression algorithm can be compared to a small machine that takes in some data, do some math on it and transform it to some new data - just smaller in size. It sounds simple, but it involves a lot of math to do it efficient. The best compression algorithms are the ones which have a high compression ratio while only using small amounts of RAM and CPU. One such algorithm is the very well known DEFLATE which internally uses the famous LZ77 algorithm created by Abraham Lempel and Jacob Ziv. The DEFLATE algorithm has become the de facto standard when it comes to compression, and it is used in nearly all instances when a compression algorithm is needed. There are many compression algorithms out there, some are better than DEFLATE and also very wide spread. Examples include LZW, BWT, LZSS and LZMA.

Each of the compression algorithms only take in a single file, either as a blob of bytes, or a stream of bytes. The question is, how do compression applications such as 7Zip, WinZip and WinRAR compress multiple files?

Enter the world of archive formats. We all know about Zip, BZip2, GZip and RAR. They all simply describe the content of themselves either in terms of files or streams of data, along with some metadata such as original size, file name and path. The archive formats are more complex that you would think, simply because there are a lot of features. Some of them support multiple data streams (for multi-threading), encryption, password protection, file name obfuscation, self-extracting archives and more.

Figure 1: The relationship between files, compression algorithms and archive formats.

Many of the archiving formats are now synonymous with the compression algorithm they use. We say "Zip the file" or "Make a 7z of the folder". In reality, you can interchange the different archive formats with any of the compression algorithms, and as such you can have combinations such as Zip with LZMA or 7z with DEFLATE. That being said, there are some default compression algorithms in some archive formats. I've listed some examples here in archive/algorithm format:
  • Zip / DEFLATE
  • 7zip / LZMA
  • Bzip2 / BWT
  • Gzip / DEFLATE
  • GIF / LZW
  • PNG / DEFLATE
  • Tar / none
  • Rar / LZSS
The avid reader noticed that I put a couple of image formats in the list (GIF, PNG). Both of them use compression algorithms to minimize the size of the image data, and are therefore technically archive formats. The same applies to Microsoft Excel 2013 files (xlsx / DEFLATE) and Java Archives (Jar / DEFLATE).

In the Linux world, the solution to compressing multiple files was solved using an archiving format called Tar. It is a pretty simply format designed for tape archiving, which preserves the filesystem information such as timestamps, owner information and file size on the files held inside the archive. To compress a directory, Tar simply store all the files within the directory inside a single file, then compress it using DEFLATE (gzip) or BWT (bzip2).

 Benchmark

Now that we know the difference between compression algorithms and archive formats, we can continue to the benchmark. The libraries tested here are overlapping in the sense that they are simply different implementations of different compression algorithms, and even though the name of the library might contain Zip, it has nothing to do with the actual Zip archiving format, but everything to do with the compression algorithm used by default in Zip files.

The benchmark uses a 128 MB test file that contains images, music, text and some random data. The file represent arbitrary data usually found on a filesystem.

Library Type Packed (MB)   Pack (ms)     Unpack (ms)    Ratio   Speed (MB/s)
.NET 4.5.1 Zip 34.10 2,921 895 3.58 43.82
ICSharpCode Zip 34.07 6,373 1,524 3.58 20.08
ICSharpCode Gzip 34.07 6,347 1,511 3.58 20.17
ICSharpCode BZip2 35.69 45,002 6,811 3.42 2.84
7z SDK LZMA 28.51 40,235 3,254 4.28 3.18
Zlib.NET Zlib 33.97 4,837 985 3.59 26.46
DotNETZip Zip 33.97 5,120 1,482 3.59 25.00
DotNETZip Gzip 35.67 31,739 4,556 3.42 4.03
DotNETZip Zlib 33.97 4,509 887 3.59 28.39
LZMA-managed   LZMA 28.07 35,555 7,402 4.35 3.60
QuickLZ QuickLZ 42.90 1,015 1,010 2.85 126.11

 

Conclusion

The fastest algorithm was QuickLZ, but it is also the worst compression ratio. The best compression ratio was gained by LZMA, but it is also the third slowest compression speed. It would seem the built-in DeflateStream in .NET is the overall winner when it comes to speed vs. compression ratio.

Monday, July 9, 2012

The Power of Wolfram Alpha - Now in a .NET API

Update: Here is the direct link to WolframAlpha.net.

In an ongoing project of mine, I needed access to a good source of information that is free, powerful and provides some sort of API. My first thought was on Wolfram|Alpha, a powerful search engine that have thousands of databases filled with useful information and a really great interpreter that tries to guess what you mean by the 22th president of the USA. If you have not yet tried Wolfram|Alpha, I highly encourage you to take a look at it.

 
The results from Wolfram|Alpha when looking up "Google stock"

Wolfram|Alpha has a REST based API that outputs beautiful XML data ready to parse. They really put some work into making it easy to use, so I though it would be a no-brainer to find an existing C# library that would parse it and give me objects that I could work with. However, it turns out that there are only a few of them, and only a fraction of those support the 2.0 API that Wolfram|Alpha introduced back in January 2011. I decided to code an API myself that could be easily extended with new features and would provide me with the full API.

I present to you, the all new WolframAlpha.net API. It is based on the open source RestSharp framework for parsing the XML data, so a big thanks goes out to the guys behind it for an awesome job.

Monday, January 30, 2012

Introduction To WP7 Development and Farseer Slides

I've put up the slides from my presentation at Nordic Game Jam 2012. Have fun!