Configuring the Index file Structure

Unfortunately the content of this section is rather difficult. However, it is well worth trying to understand it, as it will enable you to make the best decisions for your database. For those of you in a hurry, for most databases and applications upgrading your database to use the "Two File Index" is likely to be a very good decision.

As well as the index file structure from previous DP4 releases, release 4.622 supports three revised methods of arranging the database index file. These three methods are all fairly similar, but differ slightly in how they treat secondary indexes. Which option you choose can make a difference of close to 50% in how large your database index file will be. The three methods differ in how they assign a "row id" to each record. The "row id" is connected with the mysterious entity known as the "find list", which is used by DP4 when navigating through secondary indexes.

Row IDs and Index entries in traditional DP4

In a traditional DP4 database, every record has a unique row id, which is assigned when the record is first posted to the database: (row ids are just assigned from a field in the index header, which is incremented each time a record is added to the data file; when a database is reorganised the row id is changed so that it matches the record number within the data file, which is why the ordering of secondary indexes can change at REORGDB time). The row id is not actually part of the record in the data file, but is stored as a suffix after the record's primary key in the index. The row id is in fact used as part of every single key in the index file. For the primary key, the row id is superfluous from the point of view of indexing. However for secondary indexes it is very important - it is used what distingishes between the index entries for records with identical secondary keys. Each time you read a record from the database, DP4 remembers its row id. A subsequent fetch(NEXT) or fetch(PREV) will use that stored row id, to determine where exactly you were in the index. If you have some kind of recursive data structure (such as the procedures in a QAB program), you can prevent DP4 from getting confused between nested scans using the same index, either by using a different memory area for reading the record at each level of recursion, or by using just one memory area, but remembering the appropriate primary key for the level you are currently at, and then calling rec_fetch_main() on the saved main index to re-establish the row id. (DP4 utilities use the latter technique frequently to avoid having to allocate hundreds of in memory record areas for a deep recursion). For this reason storing the row id as part of the primary key is also very important.

As well as the row id, each key in a traditional DP4 index file, contains a pointer to the corresponding data record. Each key also has two bytes of house-keeping associated with it. Also, most, but not all, keys can be compressed, by only storing the part of the key that is different from the previous key in the index. This means that every primary key in a traditional DP4 index file contributes a minimum of 11 bytes to the index file, and every secondary key contributes a minimum of 7 bytes (or 11 bytes if the key is unique). (Obviously the total size of the key will vary considerably depending on the number of key fields for the index; also, since the index has an (inverted) tree structure, each additional key will contribute some small fraction to the entries "higher up" (nearer the root) in the tree). In a typical DP4 database index the row id and data pointer together form a very significant proportion of the index file, probably two thirds or more.

Alternatives to the Row ID/Data Pointer Index

Note that the row id is only really needed to facilitate the reading of data on a secondary index, and note also that it does not really matter how the row id for each record is assigned, as long as its value is unique to that record. Further it is obvious that for most purposes the address of the record in the data file would serve as a perfectly adequate row id. Finally, note that storing the address of the record as part of every single key wastes space, since if there is a table which cross references row ids and data addresses we can easily look the address of the record up from its row id. Hence we have two possible ways to improve on the traditional DP4 index structure:

I'd be interested to hear of any data anyone can come with on comparative performance and database sizes using real databases for the various types of index (though the Archive Index should only ever be used where appropriate. Unless the "Archive Index" or "Single File Index" are shown to have real benefits I will remove support for them from the 4.622 final release.

  1. Keep using an invented row id, and use it as an index into a table of record address pointers (the PTR file). This means that the index file does not have to contain any data file pointers, which makes updates (record amendments) cheaper to perform. This type of index, henceforth called the "Two file index" is much smaller than a traditional DP4 index, often by as much as 40-50%. With this style of index primary keys can occupy as little as six bytes (if your database has less than 16 million records), or 7 bytes if you need up to 4 billion records. Non unique secondary keys will usually occupy only two or three bytes. The system database is actually built with two byte sequence numbers, so that many primary keys only take 5 or 6 bytes.

    Using this style of index enables a new REORGDB option: "Defragment Database". This typically runs extremely quickly, requiring little more than the time needed to copy the database files. Defragmentation signficantly improves the performance of batch programs such as reports, by ensuring data for each table is allocated as near as possibly sequentially.

  2. Instead of using an invented row id, just use the address of the record within the data file as a row id. Unfortunately this method of indexing causes some loss of functionality on secondary indexes, as explained below. This type of index, called the "Archive index" is similar in size to the "Two file index". However you will need a four byte data pointer unless your database is less than 16MB, which can offset the benefit of not having a PTR file.

The third new method of indexing is essentially the same as that used in previous releases of DP4, but with a slightly more efficient use of space in the index file, and the possibility of varying the number of bytes used for the row id and data pointer size. It is referred to as the "Single file index" in REORGDB. This type of index may be slightly faster than a two file index at performing random lookups (such as fetching information from a catalog based on an item code), but it is significantly larger than a "Two file index", so will usually be slower at everything else.

Problems with the "Archive Index"

Using the data file pointer as the row id (the "Archive Index") has one significant drawback: if a record is updated, and has to be moved within the data file, which is very likely to happen if the record is compressed using the new compression algorithm, and quite likely to happen if it is compressed using the old algorithm, then its row id would have to change. This will upset any ongoing search through the table on a secondary index, causing the same record to be read more than once, or possibly missed altogether (if the update is made by another program). Therefore this method of assigning a row id is not suited to systems which need to use the database like this, unless the affected applications can cope with possibly reading the same record twice while scanning through the table, and missing out some records does not matter, or you know the database will only be used by one user at a time. However this type of index will work well for databases that contain only static data, such as a "program" database, or a database that has been retained for archive purposes only, or a database that only needs primary indexes.

Comparison of Index Size and Performance for "Archive" and "Two file" index

In many, but by no means all, cases, the "Archive" method of indexing the data will give the smallest index file. The "Two file" index may give an index that is smaller, if you can afford to use a row id with fewer bytes than are needed for the data pointer size and there are a lot of secondary indexes. This is actually a common scenario: if your database will never contain more than 16 million records, but has a data file bigger than 16 Megabytes, then you will need only three bytes for a unique row id, but four bytes for the data file pointer. In this case, for any table that has more than four or more indexes active, the "Two file" index will create an index that is smaller. The "Two file" index also does not cause any loss of functionality on the database. Potentially it may slightly slow down database reads, since one extra disk access may be needed to read a record from the database in the case where caching is ineffective (though because the index is smaller, it may well be flatter, completely eliminating this disadvantage). Tests indicate that when scanning through a whole table this method of indexing is actually faster than a traditional DP4 index by around 10-15%.The Archive index is the fastest index for scanning through a table. However record amendments are much more expensive for this type of index than either of the other two index types.

Reducing the size of Secondary Indexes

It is quite common to add additional key components to a secondary index to ensure that the index is unique. This practice is not necessarily to be recommended from release 4.622, for the following reason:

Conclusion

For most databases, the "Two file index" structure will give the best overall performance, both in terms of small size and fast operation. However for "read only" databases, the "Archive index" may be preferable. The "Single file index" is only a marginal improvement on traditional DP4 databases, but may be a good choice if you need to support larger databases than were possible with traditional DP4, and do not have the time to update back up and other scripts that access DP4 database files directly and which are unaware of the new PTR file.