Optimal Compression for Binary Data

The Problem

My current project uses 5mb of disk space to save a 2D map. There could be hundreds of these maps. I tried compressing the files, but couldn't get them any smaller than 2mb.

This was an improvement, but still unacceptable. A better approach was needed. This article describes how I managed to compress my 5mb of map data into just 36kb.

Understanding Compression

The objective is not to naively throw compression algorithms at the file, but instead to give the file more compression potential. A file has better compression potential when it contains larger blocks of identical byte sequences and when there are smaller differences between non-identical bytes.

This can be done in two simple steps.

The Implementation - Part 1

The first step, very simply, is to organise like-blocks together.

The following code will create files that compress very badly, because each integer written is going to vary wildly:

For i = 0 to n
    WriteInt(myThing[i].iType)
    WriteInt(myThing[i].iColor)
    WriteInt(myThing[i].iCost)
Next

In contrast, the following code will write blocks of data that contain lots of consecutive identical values. The resulting file will be much more compressible.

For i = 0 to n
    WriteInt(myThing[i].iType)
Next

For i = 0 to n
    WriteInt(myThing[i].iColor)
Next

For i = 0 to n
    WriteInt(myThing[i].iCost)
Next

The Implementation - Part 2

The next step is to write a pre-compression filter. This can very simply be done using a difference filter or subtraction filter.

This works by writing only the difference between values as they are written, instead of the values themselves.

The following is all it takes:

For i = 0 to n
    If i = 0 then
        WriteInt(myThing[i].iType)
    Else
        WriteInt(myThing[i].iType - myThing[i-1].iType)
    Endif
Next

As a bonus tip, if you can make sure that you only have positive integer differences, perhaps by adding a large value such as (1 ShiftLeft 30) to each integer written, this becomes much more effective:

Const OFFSET = 1073741824 // (1 Shl 30)

For i = 0 to n
    If i = 0 then
        WriteInt(myThing[i].iType + OFFSET)
    Else
        WriteInt(myThing[i].iType - myThing[i-1].iType + OFFSET)
    Endif
Next

Further Work

If you are saving data that has two dimensions, e.g. a map or an image, you can improve the compression potential further by using filters such as Average or Paeth that work on more than one dimension. You can even choose the best method to use on a line-by-line basis. This is how PNG images work.

If your data contains real/floating-point numbers, consider reducing their resolution. These numbers are really difficult to compress, so reducing the resolution will help a great deal. Here's an example, including an offset of 1.0, assuming Real values between 0.0 and 1.0:

Const FACTOR = 3600.0 // Reduce to compress,
                      // increase for accuracy.

Function CompressReal(r: Real) : Integer
Begin
    result = ToInt((r + 1.0) * FACTOR)
End

Function DecompressReal(r: Integer) : Real
Begin
    result = (ToReal(r) / FACTOR) - 1.0
    If result < -1.0 then result = -1.0
    If result > 1.0 then result = 1.0
End

The "compressed" real numbers are then saved as integers:

For i = 0 to n
    If i = 0 then
        WriteInt(CompressReal(myThing[i].rFoobar))
    Else
        WriteInt(CompressReal(myThing[i].rFoobar - myThing[i-1].rFoobar)))
    Endif
Next

Stay Subscribed

@golightlyb, subscribe to the RSS feed or get updates by e-mail.

You can also contact me directly - I make an effort to reply to every e-mail.

Optimal Compression for Binary Data - Articles

Main Navigation

Featured Items

Main Article