SQL Joins

In DP4 the word join is typically used to denote the action of fetching a parent record whose primary key is available from corresponding foreign key fields in a child table. This type of join is very easy using the built in processing in QAB and other DP4 utilities such as ReportWriter. The Index analysis program (MAKELINK) analyses the data dictionary and constructs a table called @PC_RECORD (short for parent child record) which describes all the possible direct (and indirect) joins between tables. Except in the case where there is more than one instance of a foreign key in the child table (possible in DP4 through the role concept) a join is typically thought of as the unique parent record that corresponds to the child. In the DP4 API a program joins a table by copying the key fields from the child record that has already been read into the key fields of the parent and calling a DP4 fetch function to read the EQUAL record on the main index.

Typically we might do something like this to retrieve an order and the associated customer record:

askf_n(1,1,&order.order_nr); /* input order number */
rec_fetch_main(&EQUAL,&order.l);
customer.cust_nr = order.cust_nr;
rec_fetch_main(EQUAL,&customer.l);

In SQL a join is any kind of SELECT statement where columns (fields in DP4 terminology) are fetched from more than one table.

The fundamental SQL join is the cross or Cartesian product of the tables, consisting of all possible pairs of rows selecting one row from the first table , and one row from the second table . Any more useful join is expressed by restricting the rows returned through the use of a WHERE clause, or an equivalent construct (of which there are several in the SQL2 standard). In SQL the traditional way of expressing the DP4 code above would be something like:

SELECT ORDERHEADER.*,CUSTOMER.* FROM ORDERHEADER,CUSTOMER WHERE ORDERHEADER.ORDER_NR=? AND CUSTOMER.CUST_NR=ORDERHEADER.ORDER_NR

In order to return the single order and customer record the SQL engine notionally constructs a temporary table consisting of all possible pairs of orders and customers and then filters it leaving only the one row of interest. Of course you have to hope that the SQL engine does not work like this in practice. If you have a database of 10 customers and 50 orders it won't matter too much. If you have 10000 customers and 50000 orders then a SQL engine that worked like this would take all week to find one record.

A SQL engine will contain various optimisations that enable it to recognise certain kinds of query and execute them efficiently by limiting the records actually read from the database in some way.

It is important to realise that any SQL engine will have only a limited range of optimisations. When the SQL engine is unable to optimise a query, it really will construct the Cartesian product of the all the tables used in the query, and then filter it. Therefore you should satisfy yourself that any queries you construct are capable of being executed within a reasonable time on a database of the intended size, and not a "noddy" development database with just a few items of test data.

In SQL a join can be constructed using any kind of valid expression. Thus you could equally well have:

SELECT ORDERHEADER.*,CUSTOMER.* FROM ORDERHEADER,CUSTOMER WHERE ORDERHEADER.ORDER_NR=? AND CUSTOMER.CUST_NR=ORDERHEADER.ORDER_NR-10000

This obviously fetches the customer who placed the order number 10000 less than the one you are thinking of now.
More interestingly, though not especially useful is this example:

SELECT ORDERHEADER.*,CUSTOMER.* FROM ORDERHEADER,CUSTOMER WHERE ORDERHEADER.ORDER_NR=? AND CUSTOMER.CUST_NR<ORDERHEADER.CUST_NR.

This would retrieve the list of all customers with a customer number less than the customer for this order.

In SQL the parent child concept, which is central to DP4, is almost invisible.

SELECT ORDERHEADER.*,ORDERLINE.* from ORDERHEADER,ORDERLINE WHERE ORDERNUMBER=? AND ORDERLINE.ORDERNUMBER=ORDERHEADER.ORDERNUMBER

This query is syntactically identical to the first example. However, since presumably there can be many order lines in a single order, the query may return many rows.

Multi-table joins

DP4 also supports the notion of an indirect join . For example the CUSTOMER table might well have an ACCOUNT_MANAGER field which was a foreign key into an EMPLOYEE table. A program that takes orders might need to check the customer's credit rating. If the customer has a poor payment history it might want to look up the account manager and automatically send a memo to him, requesting authorisation of the order. The DP4 program has to add the following code:

employee.emp_nr = customer.account_manager.emp_nr;
rec_fetch_main(EQUAL,&employee.l);

Note that the program cannot retrieve the "correct" employee details until it has first read the appropriate customer record.
The SQL for this would be something like:

SELECT ORDERHEADER.*,CUSTOMER.*,EMPLOYEE.* FROM ORDERHEADER,CUSTOMER,EMPLOYEE WHERE ORDERHEADER.ORDER_NR=? AND CUSTOMER.CUST_NR=ORDERHEADER.ORDER_NR AND EMPLOYEE.EMPNR=CUSTOMER.EMP_NR$ACCOUNT_MANAGER

(The details would vary according to how the ACCOUNT_MANAGER foreign key was named. Only DP4 would support CUSTOMER.ACCOUNT_MANAGER.EMP_NR as other databases lack roles)

Join failure and Outer Joins.

Consider our first example again (fetching an order and its associated customer record). Up until now we have not considered the possibility that an order record might exist for which the customer record was missing for some reason. In DP4 this situation is called join failure . It is usually up to the application to decide how to handle this situation. Note that the DP4 program will manage to retrieve the order even though the customer does not exist.

In SQL the situation is a little different. Recall that the result of a query is a subset of the Cartesian product of the rows. Now if there is no customer with a particular customer number, any query involving selecting a customer with a customer number equal to that number will necessarily return 0 rows. This could potentially cause confusion - a naive programmer might jump to the conclusion that when the

SELECT ORDERHEADER.*,CUSTOMER.* FROM ORDERHEADER,CUSTOMER WHERE ORDERHEADER.ORDER_NR=? AND CUSTOMER.CUST_NR=ORDERHEADER.ORDER_NR

query fails it is because there is no order with that number.

One potential solution is to issue two SQL queries instead of one query with a join. However this is likely to be inefficient and if the join is complex then it may not be an option. Because of this the notion of an Outer join was added. Roughly speaking, in an outer join if there is no record that satisfies the condition on the record an imaginary "null" record is created, and this record is deemed to satisfy the condition involving that record. "Regular" joins are therefore sometimes known as Inner joins .

The notion of outer joins was not part of SQL originally. As a result outer joins have been implemented in different ways by different database products. For example in MS SQL Server outer joins are apparently specified by using a syntax using *= instead of = in where conditions. In Oracle a completely different syntax using (+) was invented. The SQL2 standard introduced many extensions allowing the specification of joins of various types. Unfortunately this standard is far from being universal as yet - Oracle 7 does not understand SQL2 joins for example, and Crystal reports does not use SQL2 syntax to specify outer joins. The SQL2 syntax has the merit of allowing multiple outer joins to be specified in a non-ambiguous way, which is not the case with the other syntaxes.

Before going any further it would be best to explain the concept of an outer join more accurately. In the example we want our query to return an order record, whether or not the corresponding customer record exists. The Oracle and Crystal reports syntax for doing this is as follows:

SELECT ORDERHEADER.*,CUSTOMER.* FROM ORDERHEADER,CUSTOMER WHERE ORDERHEADER.ORDER_NR=? AND CUSTOMER.CUST_NR (+) = ORDERHEADER.CUST_NR.

The (+) directs Oracle, (or other databases that understand this syntax) to return a NULL row when there is no customer with the customer number matching the customer number in the order. All the fields in the customer record returned in this case will be NULL. (I am not quite sure what the value in the CUST_NR field itself would or should be. In the DP4 implementation this field would also be NULL, which is not perhaps ideal). It should be noted that the field the (+) follows is significant. It establishes which table has imaginary null records added. I should perhaps stress that these imaginary nulls are only hovering in the wings to be called upon when there is no other way of satisfying the condition of the join. If this were not the case the output of the query would be peppered with unwanted null records.

If we had a more general query (missing out the condition on the order number) we would end up with a list of all orders on the database, whether or not the corresponding customer record existed. Another query we might like to issue is one that lists the most recent order for any customer.

SELECT CUSTOMER.*,MAX(ORDERNUMBER) FROM CUSTOMER, ORDER WHERE ORDERHEADER.CUSTNR =CUSTOMER.CUSTNR GROUP BY CUSTNR

This query would miss out customers who had not placed an order yet. In order to get them included as well we would specify an outer join. (The GROUP BY clause is needed otherwise the query would only return the customer who had placed the most recent order of all)

SELECT CUSTOMER.*,MAX(ORDERNUMBER) FROM CUSTOMER, ORDER WHERE ORDERHEADER.CUSTNR (+) = CUSTOMER.CUSTNR GROUP BY CUSTNR

Note that now the (+) is appearing after the ORDER field, because we want imaginary nulls in the ORDER table.

This query could equally well be written:

SELECT CUSTOMER.*,MAX(ORDERNUMBER) FROM CUSTOMER, ORDER WHERE CUSTOMER.CUSTNR = ORDERHEADER.CUSTNR (+) GROUP BY CUSTNR

In SQL2, rather than using a WHERE clause, joins can be specified using various join operators and a USING or ON clause. The ON clause contains the part of the condition that is concerned with limiting which rows of the two tables being joined correspond.

Using SQL2 syntax the last two queries would probably be expressed as:

SELECT ORDERHEADER.*,CUSTOMER.* FROM ORDER JOIN CUSTOMER ON CUSTOMER.CUST_NR = ORDERHEADER.CUST_NR WHERE ORDERHEADER.ORDER_NR=?
SELECT CUSTOMER.*,MAX(ORDERNUMBER) FROM CUSTOMER JOIN ORDER ON ORDERHEADER.CUSTNR=CUSTOMER.CUSTNR GROUP BY CUSTNR

However the queries can equally well be expressed the other way around:

SELECT ORDERHEADER.*,CUSTOMER.* FROM CUSTOMER JOIN ORDER ON CUSTOMER.CUST_NR = ORDERHEADER.CUST_NR WHERE ORDERHEADER.ORDER_NR=?
SELECT CUSTOMER.*,MAX(ORDERNUMBER) FROM ORDER JOIN CUSTOMER ON ORDERHEADER.CUSTNR=CUSTOMER.CUSTNR GROUP BY CUSTNR

These queries would be bound to return the same result set, though possibly in a different order. (This is an example of where optimisation of queries can cause trouble - in DP4 the "natural" form of the queries is more likely to be executed more optimally, than back to front equivalents.) In all four cases this syntax joins the tables using the inner join, so will not list the order if the customer is missing, or the customers without orders. If desired the word INNER can precede the word JOIN. (Probably this is rarely if ever done, though doubtless pedantic reference books will recommend programmers to use it "to avoid ambiguity")

The first two cases can be converted into outer joins by using the word OUTER:

SELECT ORDERHEADER.*,CUSTOMER.* FROM ORDER OUTER JOIN CUSTOMER ON CUSTOMER.CUST_NR = ORDERHEADER.CUST_NR WHERE ORDERHEADER.ORDER_NR=?
SELECT CUSTOMER.*,MAX(ORDERNUMBER) FROM CUSTOMER OUTER JOIN ORDER ON ORDERHEADER.CUSTNR=CUSTOMER.CUSTNR GROUP BY CUSTNR

However, if we consider the alternate form of these queries it should become obvious that this syntax is potentially ambiguous. In SQL2 there are both LEFT and RIGHT outer joins.

In some sense the table on the left of a LEFT outer join is the "main" table as is the table on the right of a RIGHT outer join. Normally the SQL engine will want to discover records in that table first before attempting to discover which if any records in the other table match.

The notions of left and right outer join are essentially completely equivalent. However, there are probably cases where one or the other (usually LEFT) is more natural than the other.

If a SQL2 join expression specifies the use of LEFT or RIGHT then the word OUTER is optional (and superfluous). If OUTER is specified on its own it implies a left outer join.

It should be noted that using the SQL2 syntax a left outer join will most naturally correspond to an Oracle statement where the (+) appeared on the right hand side of a comparison.

In DP4 SQL you should refrain from using right outer joins where possible. Internally the DP4 SQL engine generates right outer joins by interchanging the two tables and using a left join instead. In a few cases join expressions involving nested joins cannot be converted properly resulting in an internal error from the DP4 SQL engine. This will be fixed when possible if it becomes necessary.

In ODBC, outer joins are considered to be an extension, and are implemented using a so called escape syntax.

SELECT ORDERHEADER.ORDERNUMBER,CUSTOMER.NAME FROM ORDER LEFT OUTER JOIN CUSTOMER ON CUSTOMER.CUST_NR = ORDERHEADER.CUST_NR WHERE ORDERHEADER.ORDER_NR=?

becomes instead:

SELECT ORDERHEADER.ORDERNUMBER,CUSTOMER.NAME FROM {OJ ORDER LEFT OUTER JOIN CUSTOMER ON CUSTOMER.CUST_NR = ORDERHEADER.CUST_NR } WHERE ORDERHEADER.ORDER_NR=?

In the ODBC version of outer joins I believe that the word OUTER is mandatory in theory. ODBC drivers for databases such as Oracle are likely to mangle this syntax into the form expected by the database's native SQL support. This is a bit of a joker in the pack, as some drivers generate completely invalid SQL as a result. For example one early driver generated SELECT statements with two WHERE clauses if both an OUTER and an ordinary join were used.

Multiple joins

If we return to our example where we also joined the account manager record from the employee table you will see that a nested outer join may be necessary. In this case the FROM part of the query using inner joins would be something like:

ORDER JOIN (CUSTOMER JOIN EMPLOYEE ON CUSTOMER.ACCOUNT_MANAGER.EMP_NR=EMPLOYEE.EMP_NR) ON ORDERHEADER.CUSTNR=CUSTOMER.CUSTNR

Our interpretation of the syntax of nested joins is as follows:

However inner,union or cross joins are combined with each other or a single outer join the end result will be the same. Where multiple OUTER joins are used this may not be the case since in a nested JOIN expression different ways of expressing the same join can yield differing rows with imaginary null records.