Release DP4 4.622 supports an enhanced method of compressing data. The table below shows the effectiveness of the new style compression by comparing the size (in kilobytes) of the data in various tables in its uncompressed state, the size with traditional DP4 compression, and the size it requires with new style compression enabled. The database used is a Chameleon "central" database containing real data.
| Table | Record Length | Record Count | Uncompressed Size | Old Compressed Size | Actual Size | Compression Ratio |
|---|---|---|---|---|---|---|
| TRANSACTION | 368 | 2,935,055 | 1,054,786 | 278,129 | 75,936 | 92.8% (13.8:1) |
| TRANSACTION_LINE | 224 | 5,000,829 | 1,093,932 | 469,966 | 85,483 | 92.1% (12.8:1) |
| TRANS_DT_LINE | 136 | 296,803 | 39,420 | 16,125 | 3,352 | 91.5%(11.8:1) |
| TRANS_MD_LINE | 312 | 3,109,492 | 947,424 | 206,574 | 56,721 | 94% (16.7:1) |
| TRANSACTION_XREF | 48 | 38,041 | 1,784 | 1,113 | 451 | 74%(4:1) |
| TRANS_EJ_LINE | 112 | 29,763,214 | 3,255,352 | 1,398,633 | 533,880 | 83.5% (6.1:1) |
| TXN_FTP_Files | 352 | 1562 | 537 | 124 | 32 | 94% (17:1) |
It is clear from the table above that the compression ratio varies considerably between tables. For example the TRANSACTION_XREF table compression ratio is hampered by the fact that at least 186K of space is required just for its timestamps and one header byte per record (Full sized DP4 headers would occupy 446K on their own for this table!). The relatively poor compression ratio achieved for TRANS_EJ_LINE is probably a reflection on the variability of the EJLINE data.
Overall the size of the database DAT file decreased from 2.26GB to 0.72GB - so that the new compression is 3 times more effective on this database than traditional DP4 compression. Because REORGDB manages compression automatically in this release, much larger reductions in database size will be achieved on databases for which tables were not already compressed.
New style compression cannot be turned on in advance on a table. It will be enabled on any particular table when you reorganise a database for which enhanced compression is enabled and it contains sufficient data for the new style compression to become effective. The compression process estimates how much it will be able to compress the data by, and if the amount of compression exceeds the size of the compression information that would be required does actually do the compression, and writes compression information to the DCI (data compression information) file. By default REORGDB is allowed to change the compression algorithm for all tables, but MAKEDB has a new option allowing you to disable this automatic compression management on particular tables. This may be useful on tables that contain relatively low amounts of data but which are frequently amended because compressed records may need to be moved when they are updated, which can be expensive. (However, if you use the new "Two File Index" database structure, the cost of moving a record is much less than with previous DP4 releases, and if you enable the "Advanced Transaction Processor", records are ALWAYS moved when updated, so in either of these two cases there is little point in using this option).
Tables for which you disable compression management will usually not be compressed at all, but you can enable traditional DP4 compression in MAKEDB.
The size of the compression information is quite large - as much as slightly in excess of 1K for every byte of a table, though this will only be the case for tables with a lot of non character data. For tables with less data, the compression algorithm will group bytes or characters according to the similarity of the data distribution in order to reduce the size of the compression information. This will have the effect of rendering the compression somewhat less effective, but also ensures that dissimilar data added later is more likely to be compressed moderately well.
Extrapolation of database size from a small database is therefore unreliable with new compression, at least until the database contains enough data for a full sized compression info record for each compressed table.
You can use the count records option of dbcheck to check how each table has been compressed. Records that are "compressed" have been compressed with the new compression algorithm. Records that are "squeezed" have been compressed using the old DP4 compression algorithm.
Compression requires two passes through each table, and can only be done economically when there is a good index file: REORGDB -corrupt will not be able to build an optimally compressed database but will continue to use the old compression information.
The process of compressing a table goes like this:
Once all tables have been compressed REORGDB writes the new data file and additionally compresses the record headers as well.
Undoubtedly a ZIP or similar compression program could compress the data better, but by nothing like the same margin as with the current DP4. It can do better for two main reasons:
When further new records are added to the database they will be compressed with the same data structure. If the new data is generally similar to the data already present the compression will be very effective. If it is not then it won't be. Compression will never actually grow a record - the database engine will just post the record uncompressed instead, or fall back on the old DP4 compression scheme.
For example, suppose a million transaction records have been posted containing dates from the year 2004, and the database has been reorganised. The compression will certainly know that 2004 is a frequently occurring number, and will almost certainly squash it very effectively. But come 2005 the two bytes for the year that were previously highly effectively compressed will no longer be so. On its own this won't matter much, but after a year when the transaction numbers are also in a completely different range, the PLUs being sold have changed, and new operators are doing transactions... then the compression will start being quite ineffective and a REORGDB will be well worth doing.
The new compression is also a fairly effective encryption mechanism. Looking at the raw data will tell you very little about the content of a record. Tampering with the data is effectively impossible without using proper DP4 programs. Only an extremely able hacker would be able to reverse engineer the data (and he'd have to get past the DP4 anti-debug code first as well).
It is vitally important to back up the database properly. The data for the compression engine is written to a new database file (DBNAME.DCI in the DAT directory). If this file is ever lost or damaged the database data file will be completely useless, at least for those tables that are compressed using the new scheme, and for which the compression information is lost. In fact a DCI file is always created, and needed for, reading databases unless they use a 4.620 or earlier structure, as it contains information necessary for reading and writing record headers properly.
The DCI file is only ever updated by REORGDB, so you can check for corruptions by comparing the DCI file with a backed up version of the file.
A single corrupt bit anywhere in a data record will almost always cause the whole of the rest of the data in that record to be completely corrupt. You can look on this as a plus or a minus depending on whether you would rather know when corruptions occur or be able to ignore them.
Note that any corruption in the data file in the data area of one record will only corrupt that one record. Also DP4 will almost always be able to tell the record is corrupt, because it will fail to encounter the "end of byte stream" symbol for the record before it has reached the actual end of the byte stream. A corruption will not cause DP4 to crash, because it keeps track of exactly where it has got to, and in any case all possible input is valid up to the point where the byte stream is completely decoded or the decompress buffer is full.
A consequence of implementing a sophisticated compression mechanism is that WINZIP etc will not be able to compress the data file themselves very well. Therefore it may paradoxically:
Typically a ZIP file for an enhanced compressed DAT file will by 10 to 15% bigger than the ZIP file for the equivalent current DP4 database.
The new compression is considerably more expensive in terms of processor time than the DP4 algorithm. Testing on a variety of databases indicates that scanning through a whole table using the primary index takes between 1.5 and around 3 times as long using a compressed database than an uncompressed one where both have been reorganised, which sounds bad. However, reading the compressed data may still be more than ten fimes faster than reading uncompressed data from a database that has not recently been reorganised. For data accessed over a LAN, or accessed without using buffered fetches, the difference between reading data from a compressed and an uncompressed database will probably be negligible. The penalty for using compression is much less when reading with secondary indexes than the primary index, because the additional caching made possible by data compression means that the number of non sequential disk reads is much slower than for the uncompressed database: reading a whole table on a secondary index can be between 1.2 and 2.3 times slower than for an uncompressed database.
Records in the DAT file are stored in a completely different format from traditional DP4 records. Each record begins with a four byte timestamp. This is followed by a variable number of header bytes, usually between 1 and 4 (fewer bytes are used for frequently occuring length/typenr combinations). The database manager translates headers between the internal format and the equivalent DP4 header format, so there is no need to know much about this format. It should be noted that the FDUMPM program cannot understand new style DAT files.
In fact this new style of header is always enabled unless the database is using a structure compatible with an older DP4 release.
If a database without compression is very large, and has not been recently been reorganised using the −ALTSORT option, the first time it is reorganised using the new REORGDB program the reorganisation will be very slow, because of the way the data is read. Subsequent reorganisations will probably be faster, because data that has not changed will be sorted correctly.
If you decide not to enable enhanced data compression for a database REORGDB may still make your data file considerably smaller when you reorganise the database. This will happen for two reasons: