Recently discovered bugs in DP4 Database Manager - 8th March 2006

Since this article was originally written, more bugs have been found in the database manager, which unlike the ones originally described can occur on any system. Please see the last section on DBRECOV mechanism

Record Locking Bugs

We have recently been improving the scope of the tests run against DP4. As a result of this we have discovered two closely related and long standing bugs in the database manager. Both bugs can cause data file corruptions, and are therefore potentially very serious. However the bugs can only arise where record locking is used, and a when a very particular set of circumstances arise, so in practice few systems will be affected.

The bug affects all releases of DP4 on all platforms. Currently the fix has only been applied to the WIN32 and WINCE set of DP4. The fixes have been applied to the 4.621 and 4.622 sets only, and the corresponding 4.5xx releases. They will be applied to other releases and versions of DP4 on request, and to the next release for all supported platforms.

Details

Both bugs arise when a record that has been locked immediately follows a hole in the database and the locked record is moved or deleted. In two different scenarios, when the lock is released, the old location of the record in the data file is updated, instead of the correct location. If the hole was reused and the new record inserted there also uses part of the the released space, then the new record will be corrupted. A particularly bad corruption will occur if the old record header is now in the middle of another record header.

There are two ways this erroneous unlocking could arise:

A user locks and then unlocks a record, and then locks it again. While the record is unlocked another user deletes the record, or amends it and the record is moved (which can happen if the data in the record is compressed, or data indpendence was invoked). When the first user locks the record again, it reuses its original lock information if the user has not done a checkpoint in the meantime. Therefore the lock is recorded on the wrong data file address, and when the lock is released an incorrect update is made to the data file.
In the second scenario the user locks the record, but another program does a delete which causes a cascaded delete of the locked record. DP4 allows deletes in this circumstance - a child delete "trumps" any lock on the record. However, when this happened DP4 did not break the original program's lock, and when the lock was released the data file would be corrupted as described.

In the fixed version of the database manager, the code does not reuse old lock information which might be outdated, and when a record is deleted as part of a child delete, any lock on that record is broken, so that there will be no later unlock. Additionally, whenever a lock flag is to be set or unset in the data file, DP4 now checks that the record header is as expected, and if not an internal error 86 will be generated.

Bugs in DBRECOV mechanism

Effect of Second System Failure while DBRECOV is running.

A long standing problem with DBRECOV has been discovered. It is possible that while DBRECOV is running a second system failure might occur. It was intended that DBRECOV should cope with this possibility, and so when the transaction log is being posted, DBRECOV writes a new rollback log which would allow the database recovery to be restarted if necessary. However it has been noticed that DBRECOV does not set the flag indicating the rollback recovery is required in a timely manner (i.e. before making any updates at all to the database), and that therefore a later DBRECOV may not be aware of the need to apply the rollback file. If this happens then the database will be corrupt even though DBRECOV claims to have successfully recovered it. The bug is not in DBRECOV itself, but in the support for the rollback file built into the database manager. The bug has always been present but is more serious in release 4.620 and beyond because the rollback file is allowed to grow to a larger size.

Failure to Write all Necessary Information to Rollback File

Another problem is that in some circumstances DP4SRVR.W32 does not write information about holes to the rollback log. This usually results in a checksum error (particularly when DYNABACK -COPY is used), but can cause more serious corruptions in rare circumstances.

DP4 reuses space from deleted records by creating holes. It also merges adjacent holes to try to manage space efficiently. When a hole is changed or used it should be written to the rollback log. In one particular case this did not happen. Therefore if a rollback log is applied by dbrecov or dynaback exactly one hole may be incorrect . This should cause system error 87,13 or dbcheck error 17. However, usually the problem is masked because when dbrecov/dynaback apply the transaction log the hole will be restored to its correct length without DP4 noticing the corruption - the code that checks holes is in the same function that writes holes to the rollback log, so applying the transaction log normally does the one thing that will cause the corruption to be repaired without its ever being noticed. The only trace left of the error is the checksum error in the DAT file.

However, it might happen that the corrupt hole was created after the last checkpoint. In this case DBRECOV would not unconsciously do the right thing to repair the hole, and it would be a matter of luck whether the next database activity did so or not. If it did the database would have a checksum error. If not a bad hole would be left, so that the data file was corrupt.

Checksum Corruptions on New Databases

Another bug in the DBRECOV mechanism can cause checksum corruptions in the DAT file, but never anything more serious. This bug is present in 4.621 and earlier releases, but not in 4.622. The problem can only occur on any particular database a few times, and will therefore not occur when a database has been in use for a while.The bug is as follows:

In the primary PC record for each table the pc_flags field has two flags PC_DAT_ADDED and PC_DAT_DELETED.

When a table is initially created these flags are not set. The first is set the first time data is added to the table. The second is set the first time data is deleted. The first is used for deciding whether data independence is needed in makedb/makelink. The second is not actually used at all, and was removed in 4.622. When the flag is set the database manager writes the previous state of the PC record to the rollback log, but until 4.622 there was a bug in the way it did this - it forced the generation field in the record header to the current db generation, whereas it was usually 0 in the record actually on the database. So if you have to run dbrecov just after the first time data has been added or removed to a table for the first time the checksum will be wrong if the generation number of the database is not 0.