Enhanced Data Compression

Release DP4 4.622 supports an enhanced method of compressing data. The table below shows the effectiveness of the new style compression by comparing the size (in kilobytes) of the data in various tables in its uncompressed state, the size with traditional DP4 compression, and the size it requires with new style compression enabled. The database used is a Chameleon "central" database containing real data.

TableRecord LengthRecord CountUncompressed SizeOld Compressed SizeActual SizeCompression Ratio
TRANSACTION3682,935,0551,054,786278,12975,93692.8% (13.8:1)
TRANSACTION_LINE2245,000,8291,093,932469,966 85,48392.1% (12.8:1)
TRANS_DT_LINE 136296,80339,42016,1253,35291.5%(11.8:1)
TRANS_MD_LINE 3123,109,492947,424206,57456,72194% (16.7:1)
TRANSACTION_XREF4838,0411,7841,11345174%(4:1)
TRANS_EJ_LINE 11229,763,2143,255,3521,398,633533,88083.5% (6.1:1)
TXN_FTP_Files 35215625371243294% (17:1)

It is clear from the table above that the compression ratio varies considerably between tables. For example the TRANSACTION_XREF table compression ratio is hampered by the fact that at least 186K of space is required just for its timestamps and one header byte per record (Full sized DP4 headers would occupy 446K on their own for this table!). The relatively poor compression ratio achieved for TRANS_EJ_LINE is probably a reflection on the variability of the EJLINE data.

Overall the size of the database DAT file decreased from 2.26GB to 0.72GB - so that the new compression is 3 times more effective on this database than traditional DP4 compression. Because REORGDB manages compression automatically in this release, much larger reductions in database size will be achieved on databases for which tables were not already compressed.

Technical Overview

New style compression cannot be turned on in advance on a table. It will be enabled on any particular table when you reorganise a database for which enhanced compression is enabled and it contains sufficient data for the new style compression to become effective. The compression process estimates how much it will be able to compress the data by, and if the amount of compression exceeds the size of the compression information that would be required does actually do the compression, and writes compression information to the DCI (data compression information) file. By default REORGDB is allowed to change the compression algorithm for all tables, but MAKEDB has a new option allowing you to disable this automatic compression management on particular tables. This may be useful on tables that contain relatively low amounts of data but which are frequently amended because compressed records may need to be moved when they are updated, which can be expensive. (However, if you use the new "Two File Index" database structure, the cost of moving a record is much less than with previous DP4 releases, and if you enable the "Advanced Transaction Processor", records are ALWAYS moved when updated, so in either of these two cases there is little point in using this option).

Tables for which you disable compression management will usually not be compressed at all, but you can enable traditional DP4 compression in MAKEDB.

The size of the compression information is quite large - as much as slightly in excess of 1K for every byte of a table, though this will only be the case for tables with a lot of non character data. For tables with less data, the compression algorithm will group bytes or characters according to the similarity of the data distribution in order to reduce the size of the compression information. This will have the effect of rendering the compression somewhat less effective, but also ensures that dissimilar data added later is more likely to be compressed moderately well.

Extrapolation of database size from a small database is therefore unreliable with new compression, at least until the database contains enough data for a full sized compression info record for each compressed table.

You can use the count records option of dbcheck to check how each table has been compressed. Records that are "compressed" have been compressed with the new compression algorithm. Records that are "squeezed" have been compressed using the old DP4 compression algorithm.

Compression requires two passes through each table, and can only be done economically when there is a good index file: REORGDB -corrupt will not be able to build an optimally compressed database but will continue to use the old compression information.

The process of compressing a table goes like this:

  1. Fetch all the data in a table and count up all the bytes in all the records and determine how best to compress records in this table. This involves the construction of a 1028 byte record for each byte in the data part of table and also for each of the possible 256 character values.
  2. Repeatedly combine compression records for bytes with similar compression records where this does not reduce the total effectiveness of the compression by more than the size of the compression record. (It is not possible to do this optimally, but in practice the result will be close to optimal)
  3. Write the data structure that will permit the general purpose compression engine to compress this table effectively to the DCI compression file.
  4. Rewrite all the data records using the new compression structure to a temporary file.

Once all tables have been compressed REORGDB writes the new data file and additionally compresses the record headers as well.

Undoubtedly a ZIP or similar compression program could compress the data better, but by nothing like the same margin as with the current DP4. It can do better for two main reasons:

When further new records are added to the database they will be compressed with the same data structure. If the new data is generally similar to the data already present the compression will be very effective. If it is not then it won't be. Compression will never actually grow a record - the database engine will just post the record uncompressed instead, or fall back on the old DP4 compression scheme.

For example, suppose a million transaction records have been posted containing dates from the year 2004, and the database has been reorganised. The compression will certainly know that 2004 is a frequently occurring number, and will almost certainly squash it very effectively. But come 2005 the two bytes for the year that were previously highly effectively compressed will no longer be so. On its own this won't matter much, but after a year when the transaction numbers are also in a completely different range, the PLUs being sold have changed, and new operators are doing transactions... then the compression will start being quite ineffective and a REORGDB will be well worth doing.

Implications

Databases without Enhanced Data Compression

If you decide not to enable enhanced data compression for a database REORGDB may still make your data file considerably smaller when you reorganise the database. This will happen for two reasons: