Wednesday, July 23, 2008

Semi Join algorithm

This article was written in oracle9i release 9.2.0.6.0. The below test result may varry from Oracle one release to another release.

A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned.

Suppose you have the DEPT and EMP tables in the SCOTT schema.

SQL> select count(*) from emp;

COUNT(*)
----------
458752
SQL> select count(*) from dept;

COUNT(*)
----------
4
SQL> EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCOTT',TABNAME => 'EMP
',ESTIMATE_PERCENT => 10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1', CASCADE => TRUE);

PL/SQL procedure successfully completed.

SQL> EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'SCOTT',TABNAME => 'DEP
T',ESTIMATE_PERCENT => 10, METHOD_OPT => 'FOR ALL COLUMNS SIZE 1', CASCADE => TRUE);

PL/SQL procedure successfully completed.

Let us test the semi join.

We want a list all departments in dept table if department has atleast one employee in emp table.

Here is conventional way to write the query.

SELECT D.deptno, D.dname FROM dept D, emp E WHERE E.deptno = D.deptno ORDER BY D.deptno;

The problem here is, if the department has 100 employees, then department will appear 100 times. We can eliminate the duplicate records by using the DISTINCT clause. But again, oracle would do more work to get the output.

Here is total time it took to complete the query

SQL> declare
2 l_start NUMBER DEFAULT DBMS_UTILITY.GET_TIME;
3 v_cnt number;
4 begin
5 SELECT count(distinct D.deptno) into v_cnt
6 FROM dept D, emp E
7 WHERE E.deptno = D.deptno
8 ORDER BY D.deptno;
9 dbms_output.put_line(to_char(ROUND(ROUND((DBMS_UTILITY.GET_TIME - l_start)/100, 2)/60,3)));
10 end;
11 /
.009
Here is the execution plan for the query

SQL> SELECT count(distinct D.deptno)
2 FROM dept D, emp E
3 WHERE E.deptno = D.deptno
4 ORDER BY D.deptno;

COUNT(DISTINCTD.DEPTNO)
-----------------------
3
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=160 Card=1 Bytes
=6)

1 0 SORT (GROUP BY)
2 1 NESTED LOOPS (Cost=160 Card=455570 Bytes=2733420)
3 2 INDEX (FAST FULL SCAN) OF 'IDX_EMP' (NON-UNIQUE) (Cost
=160 Card=455570 Bytes=1366710)

4 2 INDEX (UNIQUE SCAN) OF 'SYS_C0010206' (UNIQUE)

We can use semi join between the DEPT and EMP tables instead of a conventional join:

The below query will list the departments that have at least one employee. Whether a department has one employee or 100, the department will appear only once in the query output. Moreover, Oracle will move on to the next department as soon as it finds the first employee in a department, instead of finding all of the employees in each department.

Here is total time it took to complete the query

SQL> declare
2 l_start NUMBER DEFAULT DBMS_UTILITY.GET_TIME;
3 v_cnt number;
4 begin
5 SELECT count(D.deptno) into v_cnt
6 FROM dept D
7 WHERE EXISTS
8 (
9 SELECT 1
10 FROM emp E
11 WHERE E.deptno = D.deptno
12 )
13 ORDER BY D.deptno;
14 dbms_output.put_line(to_char(ROUND(ROUND((DBMS_UTILITY.GET_TIME - l_start)/100, 2)/60,3)));
15 end;
16 /
.001

Here is the execution plan

SQL> SELECT count(D.deptno)
2 FROM dept D
3 WHERE EXISTS
4 (
5 SELECT 1
6 FROM emp E
7 WHERE E.deptno = D.deptno
8 )
9 ORDER BY D.deptno;

COUNT(D.DEPTNO)
---------------
3

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=641 Card=1 Bytes =6)

1 0 SORT (AGGREGATE)
2 1 NESTED LOOPS (SEMI) (Cost=641 Card=3 Bytes=18)
3 2 INDEX (FULL SCAN) OF 'SYS_C0010206' (UNIQUE)
4 2 INDEX (FAST FULL SCAN) OF 'IDX_EMP' (NON-UNIQUE) (Cost
=160 Card=341678 Bytes=1025034)

Here is the another good candidate for SEMI-JOIN algorithim.

Display list of gold-status customers who have placed an order within the last three days. You might start with a query that looks like:

SELECT C.short_name, C.customer_id
FROM customers C
WHERE C.customer_type = 'Gold'
AND EXISTS(
SELECT 1
FROM orders O
WHERE O.customer_id = C.customer_id
AND O.order_date > SYSDATE - 3)
ORDER BY C.short_name
/

Note : If the query contains a DISTINCT/UNION clause, then Oracle cannot transform EXISTS or IN clauses into something that could use a semi-join access path. Semi join can be used in Nested loop, Hash Join & Sort Merge. If semi join method is not possible, then optimizer use the conventional join with lowest cost. Oracle provides hints(NL_SJ, HASH_SJ, and MERGE_SJ) to enforce semi join algorithm. The hint is applied to the subquery of the EXISTS or IN clause, not the main body of the query itself.

No comments: