The purpose of this part of the guide is to provide some information on how the DP4 SQL Engine executes SQL statements, so that developers have the best chance of writing statements in the most efficient form. You must not develop your system in such a way as to rely on the DP4 SQL Engine executing statements in a particular way: wherever possible we will seek to improve the SQL Engine by adding new optimisations, which may mean the database is accessed in a different way.
The code of the SQL Engine is very complex. When a particular statement is being compiled the Engine may transform the statement in many ways to reduce the number of accesses to the database to the minimum and to minimise the amount of extra work it has to undertake, for example to sort records. These transformations take the form of scoring the various parts of the statement in various ways, to determine which to execute first. It is very difficult to explain these transformations in simple terms, especially for the SELECT statement. In case of doubt use the AUXDEBUG ADC to discover exactly how a statement is executed.
The following explanations mention various flags from time to time:
Before considering the SELECT statement the other statements are briefly considered.
An INSERT statement will be executed by a call to rec_fectch_main(EQUAL). If there is a matching record, then the INSERT will fail, unless the table is auto-increment. If a table is auto-increment any call to INSERT will result in a call to rec_autoinc() followed by a call to rec_post_with_flags() specifying either CHECK_PARENT, if omit_checks is not set, or NO_CHECK if it is.
If the go faster flag is set then the record is not fetched initially, and the INSERT may erroneously update an existing record. Therefore,when go faster is set, INSERT statements are only safe on auto-increment tables .
For an explanation of how UPDATE reads the records it will update, see the explanation of the WHERE clause for the SELECT statement.
The updating done by UPDATE statements is executed by a call to rec_post_with_flags specifying the UPDATE_ONLY flag. This will cause a commit failure at the next COMMIT WORK if the record does not exist. If the omit_checks flag is not set the CHECK_PARENT and CHECK_CHANGE flags are also specified.
If an UPDATE statement changes the primary key of the record, it works differently:
For an explanation of how DELETE reads the records it will update, see the explanation of the WHERE clause for the SELECT statement.
When deleting records the flags are specified as follows:
The processing of the SELECT statement can be extremely complex, especially where there are several joins. The goal of the SQL Engine is to produce the correct result set for a query with the least amount of database access through query optimisation. In practice this goal is not always currently achieved, as there are some optimisations which could be performed but which are not. Queries more complicated than the optimiser can deal with may well be executed in a fairly inefficient manner.
The remarks in this section also apply to ON clauses in SQL2 joins. ON clauses are treated as though they were ANDed with the WHERE clause.
The optimisations performed by DP4 SQL are mainly concerned with identifying the best index to use for reading a table, and reading as few records as possible from a table. For any table in a query the whole table will be read unless an index can be found with N keys for which keys 1 through M, 1 ≤ M ≤ N, there is a condition on the keyfield of the form fieldname=expression or expression=value (where expression is an arbitrary expression that evaluates to a result of the apppropriate type (perhaps even a subquery). What is more there must not be any OR which would allow these fields to take other values. It is much better to issue two queries without an OR than to this query:
An alternative, which is optimised, would be to use a UNION of two subqueries.
Important possible classes of optimisation are not performed. For example there is no optimisation which uses start and stop points for searching where BETWEEN, or IN clauses are used, nor of LIKE or MAX searches. Such optimisations will be added as time, resources and general usefulness permit. However many potential optimisations are extremely complex in the general case. Compared with earlier releases the current version of DP4 SQL will optimise many computer generated or modified queries better. Some software has a habit of fully parenthesising all expressions (or predicates to use the official jargon), and prior to the release 4.521/4.617 DP4 SQL gave up trying to optimise when it encountered such parentheses.
To clarify, here are some examples:
This query is optimised by reading records on the primary key with ordernr=1. As there is at most one of these, only one record is read. Similarly queries using IN are not optimised.
This query is not currently optimised. All the records in the order table are read, and the matching records are found by filtering.
This query is optimised. Two subqueries each read one record from the order table.
This query is not optimised. Similary queries using BETWEEN are not optimised. This is the most important class of optimisation that is not performed, and optimisation of this type of query may be added in future.
How this query is executed will depend on whether there is an active secondary index that can read the order table by customer number - if there is then it will be used.
Where there are multiple optimisable predicates in the WHERE clause, and there are several candidate indexes, the SQL Engine will pick an index according to one of various rules.
Prior to release 4.523 the SQL engine looked for the index that was most completely used, i.e. the one for which the fewest key components were unnaccounted for after all leading keys were matched with conditions in the WHERE clause. It did not consider further indexes if it identified an index that is completely used. Unfortunately this meant that it was not always the best index that was necessarily used. For example consider the following query
The SQL Engine picked the CUSTOMER index rather than the ORDERHEADER index, because the INDEXDATETIME, which could also be used, has a time component as well. However, even if the query specifies a time as well the SQL Engine continued to use the CUSTOMER index, if it happens to come across it first (which will depend on the internal numbering of the tables), because it is completely used. This might well not be the best choice.
From 4.523 the default behaviour is pick the prime index, if all its keyfields are accounted for, or to pick the index that accounts for the most predicates otherwise. Compared with the previous algorithm this one may sometimes perform additional dictionary look ups, (and sometimes fewer), and will usually, but not always be more efficient. The situation where a worse index might be picked is where an index on two or more components is less discriminating than one on a single key. For example a SELECT on some kind of credit card table that sepecified a WHERE clause such as NAME='Aardvark' and YEAR_EXPIRY="01" and MONTH_EXPIRY="05" would most likely work faster if a NAME index is used than an index on the YEAR/MONTH_EXPIRY combination. Therefore in 4.523 you can use a compiler hint to specify that the old rule should be used, or that the primary index should be used.
ORDER BY and GROUP BY clauses can both affect the index used for queries, but only if the compilation of the WHERE clause had not already caused an index to be selected.
For an ORDER BY clause the SQL Engine will search for an index that has a leading subset of the keyfields which correspond in order, to the fields in the ORDER BY clause, i.e. the field with key precedence 1 in the index is the first field in the order by clause, the field with key precedence 2 is the second field in the order by clause, and so on. It does not matter if there are more keys in the index than fields in the ORDER BY clause. However, in the current implementation the SQL Engine only uses the ORDER BY field to select an index if all the ORDER BY fields can be matched with an index. This is because the sort algorithm used by the SQL Engine, "quick sort" does not take advantage of any existing ordering in the list. It is possible that the sort algorithm will be changes in future to natural merge sort, used by Report Writer and QAB, in which case ORDER BY will have more influence on the choice of index.
This optimisation is only done if all the fields in the ORDER BY clause are ordered the same way, i.e. all ascending or all descending.
For a GROUP BY clause a similar optimisation is performed, but the SQL Engine just searches for an index that contains all the fields grouped on. The order of the fields is not relevant.
The use of DISTINCT on a column may also require the SQL Engine to sort data, and according to internal documenation also affects the choice of index. However, this does not appear to be the case currently.
So far we have considered only queries on a single table. A very important part of optimisation is concerned with optimising the number of fetches required to construct the result set of joins. You may find it helpful to refer to SQL Joins, which is adapted from information previously published on the DP4 User group. This explains how joins are constructed, and introduces some of the issues involved in optimising join construction.
The basic example of a join is the following:
which can also be written as:
There are two possible efficient schemes for performing this query:
For each order line fetch the order header if it is not the same as the previous order header.
Depending on other aspects of the SELECT statement the SQL Engine may use either of these strategies. In fact for the query as written here the SQL Engine will use the first strategy. If the order of the table names was interchanged it would use the second strategy.
The choice of strategy may be influenced by the type of join used, and by the WHERE clause. This will be illustrated by some examples.
Because a right outer join is used this query must return a row
for every orderheader record whether or not there are order lines
for it. This means that the second stragey above cannot be used any
more. The same applies to
orderheader left join
orderline. The first strategy would be ruled out by
orderline left join orderheader or
right join orderline
This statement is in the form which would normally cause the first strategy to be used. However the WHERE clause has a condition on the orderline table which can be optimised (because there is a MATERIAL index that can be used to read the order line table. Therefore the SQL Engine uses a revised version of the second strategy:
It should be noted that because the material index is used, in
principle the order lines may not be retrieved in order number
sequence, so more order header accesses may be required, and an
order by ordernumber will need a sort.
This statement is the reverse case of the previous one: instead of using the second strategy as would be the case without the customer number condition, a revised version of the second stragey is used:
In general the SQL Engine will reorder a query so that any condition that can be evaluated without having to join all the tables is evaluated first. Then if the condition fails, the remaining records needed for the join do not have to be read from the database.
This statement raises the question of what happens when there are two conditions of this type. Now there is no perfect strategy for reading the data, as there are conflicting requirements for which query to execute first. As with the original query with no WHERE clause, the strategy used for creating the result set will depend on the order of the tables in the join.
As written the statement uses this strategy:
If the order of reading the tables were reversed so the
select ordernumber,line from
orderheader,orderline then this strategy would be used:
This is possibly not optimal - it might better to read all the order lines for the current order and select the lines that apply. On the other hand, although the order lines are read from the database repeatedly, if there are only relatively few order lined for this particular material the database manager will probably cache them efficiently, and they will be fetched in a single block fetch operation, (actually there will be at least three calls to fetch the records, but the first will only fetch one order line, and the last won't fetch any, so the reads will be cheap), whereas in the other case every order line is read at least once. In this example the strategy used when the the orderline table is listed first is almost certainly superior, as the number of reads is at most the twice the number of order lines for the material, where as in the second case the number of reads is this number multiplied by the number of orders for the customer, though the actual amount of disk access may be less if there are relatively few orders for a particular material relative to the total number of orders.
Where more than two tables are read in a single select statement the current SQL engine can occasionally make very poor decisions about the best order to read the tables in. From version 4.523 you can control the order tables are read in, and therefore force the query to be executed in a more efficient way.
Some implementations of SQL allow a programmer to supply hints to the SQL Engine as to how a query should be executed. It is reasonable to do this, because as the relatively simple example above shows there is sometimes no absolutely right way to execute a query. In many cases the best strategy will be different depending on the relative number of records in the tables concerned. The DP4 SQL Engine did not support hints until release 4.523. The hint syntax used by DP4 may affect portability of your SQL to some extent - it will cause problems if your SQL engine does not accept C style comments (an extension invented by Oracle). Although SQL has some powerful features, it also has many glaring gaps, of which this is one of the most prominent (another being the impossibility of simply fetching one record in a particular direction relative to one you know about already).
The most important thing to do is to keep statements simple. In many cases there is no particular reason to generate a hugely complex query. If you are accessing the DP4 SQL Engine from a C/C++ program, or indeed any sufficiently powerful language, using either ODBC or pipes to SQLEXE, then you can probably write your own code to combine results from several simpler queries in an appropriate manner.
Other than this, the only real influence you have over the execution of the statement is the order tables are listed in. The first listed table is the least likely to be read multiple times. Therefore you should put the table for which reading repeatedly would be most inefficient first in the list of tables and join other tables from that one, and whenever possible ensure that tabels are listed in an order which will allow efficient use of indexes. If you use the /*DP4SQL O0*/ compiler hint mentioned in the next section the SQL Engine will not change the order of your tables at all, and you can experiment with ordering the tables in different ways to get the best performance.
The order of predicates in a WHERE clause, and the order of expressions on either side of a comparison has no effect on the efficiency of your SQL in DP4.
From version 4.523 you can affect the optimisation behaviour of the SQL Engine to some extent, by using a compiler hint that takes the form of a C style comment. (It might be considered better to use an ANSI or ODBC escape syntax, but the comment syntax has the merit that it can be placed anywhere in the SQL statement and is syntactically independent of it). The compiler hint takes the following form:
/* DP4SQL optionvalue... */
The value will always be a single digit. Currently the only
O to specify how to reorder the tables in
a multi table SELECT, and
I to specify how to choose
the indexes to be used. For example the comment
O0*/ will force the SQL Engine to access tables in the exact
order specified in the FROM clause of a SELECT. You can specify a
default value for the
I options in
[SQLLIB] section of the DP4 Configuration file.
Specify a value for the
O option by adding a line of
table_sort=n. Specify a value for the
I option by adding a line of the form
The meanings of the various values for the options are given in the table below:
|O||0||The tables are accessed in exactly the order specified in the FROM clause|
|O||1||The tables are reordered so that where possible tables which are used for grouped or higher precedence sorted fields are accessed earliest. It is not possible to reorder tables where one or other is outer joined.|
|O||2||The tables are reordered so that tables which are mentioned most often in the WHERE clause are accessed first. Tables mentioned equally often are sorted so that where possible (not outer joined) tables which are used for grouped or higher precedence sorted fields are accessed earliest.|
|O||3||The tables are reordered so that where possible child tables are accessed before parent tables that are mentioned in a condition, and otherwise those which are mentioned most often in the WHERE clause are accessed first. Tables mentioned equally often are sorted so that where possible tables which are used for grouped or sorted fields are accessed earliest.|
|I||0||All tables are accessed using the prime index|
|I||1||Tables are accessed using the first index where the total number of keyfields minus the number of leading keys specified by a predicate is as small as possible.|
|I||2||Tables are accessed using the prime index if all its keys are specified by predicates, or the first index encountered that has the largest leading subset of its keys specified by a predicate.|
There should only be at most one hint for each option in a statement. Each hint option applies on a global level to the whole statement - you cannot use multiple hints to force different behaviours on different tables.