Feb 2, 2008

OraFusion.com - Designing Efficient ETL Processes in Oracle

OraFusion.com - Designing Efficient ETL Processes in Oracle

Oracle Articles

Designing Efficient ETL Processes in Oracle:


Introduction:

Data conversion - the process of migrating application data - is a major component of many database related projects. This process essentially entails three sub-processes: 1) Extracting data from the source system, 2) Transforming data to the format required by the target system and 3) Loading data into the target system - hence the abbreviation ETL. Oracle offers a variety of database features for implementing ETL processes. Additionally, there is the relatively new Oracle Warehouse Builder - a full featured ETL tool. There are also other tools by third party vendors such as Informatica. However, despite this recent proliferation of ETL tools, many Oracle based data migration projects continue to use homegrown programs (mainly PL/SQL) for ETL. This is mainly because smaller projects cannot justify the cost and learning time required for effective use of commercial ETL tools. Besides, doing it oneself offers much more control over the process.

As with any programming project there are efficient and (infinitely more) inefficient ways to design and implement ETL processes. This article offers some tips for the design of efficient ETL programs in an Oracle environment. The tips are listed in no particular order. I hope you find them helpful in your projects

Tip 1: Use external tables to load data from external flat files

The traditional workhorse for loading external (flat file) data into Oracle is the SQL*Loader utility. Since version 9i, external tables offer an easier, and almost as performant, route to load data. Essentially, external tables enable us to access data in an external file via an Oracle table. Here's a simple example:
The following data is available in a text file called test.txt:

scott,cio
kailash,dba
jim,developer

We want to pull this data into Oracle using an external table. The first step is to create a directory object in Oracle, which points to the operating system directory in which the above file resides. Assume this is an operating system directory called c:\test. The SQL to make this directory accessible from Oracle is:

create or replace directory test_dir as 'c:\test';

The user must have the create any directory privilege in order to execute this SQL. We then define an external table based on our text file:

create table ext_test(
name varchar2(10),
job varchar2(10))
organization external
(type oracle_loader
default directory test_dir
access parameters (fields terminated by ',')
location (test_dir:'test.txt'))
reject limit 0

It would take us too far afield to go into each of the clauses in the above statement. Check the Oracle docs for further details. The relevant book is the Oracle Utilities Guide. There are also several third-party articles available online - do a google search on "oracle external table" to locate some of these.

To access the external data we simply issue the following SQL statement from within Oracle:

select * from ext_test;

There are several access parameter options available to handle more complicated record formats - check the Oracle documentation for details.

Tip 2: Use set-based operations to process data

Most ETL processes deal with large numbers of records. Each record must be cleaned at the field level and then loaded into the target system. Programmers unfamiliar with SQL tend to use procedural methods ("for" loops, for example) to deal with such situations. With some thought, however, it is often possible to construct an SQL statement that is equivalent to the procedural technique. The method using SQL should be preferred, as SQL processes the entire batch of records as a single set. Provided the statement isn't too complex, the SQL statement will generally run faster than the equivalent procedural code. Here's an (admittedly contrived) example to illustrate the point:

--execute from SQL Plus

set timing on

create table test (
id number,
name varchar2(128));

alter table test add constraint pk_test primary key (id);

declare
cursor c_test is
select
object_id,
object_name
from
all_objects;
begin
for r_test in c_test loop
insert into test
(id, name)
values
(r_test.object_id,lower(r_Test.object_name));
end loop;
commit;
end;
.
/

truncate table test;

declare
begin
insert into test (id, name)
select
object_id,
lower(object_name)
from
all_objects;
commit;
end;
.
/

You may need a larger number of rows to see an appreciable difference between the two methods. The advantage of the SQL based technique becomes apparent with greater data volumes.

Here are some tips that may help in converting procedural code to SQL:

  1. Replace procedural control statements with SQL decode or case: Often a procedural statement such as if....then....else can be replaced by a single SQL statement that uses the decode function or case statement. For simple conditionals the SQL option will almost always outperform procedural code. The decode function is useful when your branching criterion is based on discrete values - for example "Male" or "Female". The case statement would be preferred if the criterion is based on a continuous range - eg: any criterion that uses > or &lt. Here's an example of the two in action:

    --DECODE: Classify by gender

    select
    ename,decode(gender,'M','Male','F','Female','Unknown')
    from
    employees;

    --CASE: is the current day a weekend or weekday?

    select
    case when to_char(sysdate,'D')>'1' and to_char(sysdate,'D')<'6' then
    'Weekday'
    else
    'Weekend'
    end as day_type
    from
    dual;

  2. Pivoting: A common requirement of ETL processes is to pivot a table about a particular column, i.e. to convert rows to columns based on values in a specific column. This is easily achieved using the decode function. Consider the following example: we want to flatten out data in a table called SALES based on the MONTH column (which contains numbers ranging from 1 to 12). The SALES table has the following columns: REGION, YEAR, MONTH and SALES_DOLLARS (primary key REGION, YEAR and MONTH). The sales data is to be flattened out with each month's dollars appearing in a separate column - JAN_SALES, FEB_SALES .... DEC_SALES - appearing as a separate column. Here's how this can be done:

    select
    region
    year,
    min(decode(month,1,amount)) jan_sales,
    min(decode(month,2,amount)) feb_sales,
    min(decode(month,3,amount)) mar_sales,
    min(decode(month,4,amount)) apr_sales,
    min(decode(month,5,amount)) may_sales,
    min(decode(month,6,amount)) jun_sales,
    min(decode(month,7,amount)) jul_sales,
    min(decode(month,8,amount)) aug_sales,
    min(decode(month,9,amount)) sep_sales,
    min(decode(month,10,amount)) oct_sales,
    min(decode(month,11,amount)) nov_sales,
    min(decode(month,12,amount)) dec_sales
    from
    sales
    group by
    region,
    year

    The grouping is required to collapse twelve records for the region / year into a single record. Note that you can use any grouping function (AVG, MAX, SUM, for example) if all key columns other than MONTH are to be preserved in the flattened table. If this isn't so, you will have to use SUM as you will be summing over records for a fixed month. In either case you should be sure you understand what you're getting!

  3. "Unpivoting": This is the converse of the above. Here we want to move data from several columns into a single column, based on some criterion. Say we have a table called FLAT_SALES with columns REGION, YEAR, JAN_SALES,....DEC_SALES (produced by the above query). An ETL process requires us to get the data in normalised format, i.e. columns REGION, YEAR, MONTH, SALES_DOLLARS, as in the original SALES table. Here's a query that will achieve this:

    select
    t1.region region,
    t1.year year,
    t2.r month,
    decode(t2.r,
    1,jan_sales,
    2,feb_sales,
    3,mar_sales,
    4,apr_sales,
    5,may_sales,
    6,jun_sales,
    7,jul_sales,
    8,aug_sales,
    9,sep_sales,
    10,oct_sales,
    11,nov_sales,
    12,dec_sales) sales_dollars
    from
    flat_sales t1,
    (select
    rownum r
    from
    all_objects
    where
    rownum<=12) t2

    The table t2 serves to produce 12 rows for each original row via a cartesian product (no joins). The query uses a decode on rownum to pick out the sales value for each month from the flattened table. Note that we could have used any table or view with 12 or more rows in the definition of t2. I chose ALL_OBJECTS as this view is generally available to all schemas, and contains a large number of rows.

  4. Using built-in functions: One can achieve fairly complex data transformations using Oracle's character SQL functions. The nice thing about these functions is that they can be embedded in SQL, so you get the full advantage of set-based operations. The usual suspects include: LENGTH, LOWER, LPAD, REPLACE, RPAD, SUBSTR, UPPER, (L)(R)TRIM. Some "lesser known" functions I have found useful in my ETL efforts are:
    • CHR Returns the character associated with the specified ASCII code. This is useful when you want to remove non-printable characters such as carriage return (CR) and linefeed (LF) from input strings. You would do this using CHR in tandem with REPLACE like so:
      REPLACE(input_string,CHR(13)||CHR(10)), 13 and 10 being the ASCII codes for CR and LF respectively.
    • INITCAP This capitalizes the first letter in each word of the input string. Every other letter is lower cased.
    • INSTR Returns the location of a specified string within the input string.
    All these functions have a range of invocation options. Check the Oracle SQL documentation for full details.

  5. Analytics: Since version 8i, Oracle has introduced analytical extensions to SQL. These enable one to do a range of procedural operations using SQL (yes, you read that right - procedural operations using SQL). Some operations possible using analytical SQL include: sub-grouping (different levels of grouping within a statement), ranking and other statistics over subsets of data, comparing values in different rows, and presenting summary and detailed data using a single SQL statement. Check out my article on analytic SQL to find out more.
Tom Kyte's site is a great resource for tips on converting procedural code to SQL - try searching on the keywords "ETL" and "analytic functions" for a start.

OK, after extolling the virtues of using SQL we have to admit that there are situations in which procedural code becomes unavoidable. In such cases you can still take advantage of set based processing by using bulk binding within your procedural code. Sections of the Oracle documentation on bulk binding are available here and here (free registration required). An introductory article on bulk binding is available here (dbasupport.com).

Tip 3: TRUNCATE tables when deleting all data

Deleting all data from a table is best done using the truncate statement. A truncate runs faster than a delete because it simply resets the table's high water mark to zero. Truncate is a very quick operation. Indexes are also truncated along with the table. A truncate is DDL and therefore cannot be rolled back - be sure you consider this before using truncate. Note that tables referenced by enabled foreign key constraints cannot be truncated unless the keys are disabled first. Here's a block of code that uses truncate to clean out the SALES table (which is referenced by enabled foreign key constraints):

declare

cursor c_referenced_by is
select
t1.constraint_name constraint_name,
t1.table_name table_name
from
user_constraints t1,
user_constraints t2
where
t1.constraint_type='R'
and
t1.r_constraint_name=t2.constraint_name
and
t2.table_name ='SALES';

begin

for r_referenced_by in c_referenced_by loop

execute immediate
'alter table '||r_referenced_by.table_name||' disable constraint '
||r_referenced_by.constraint_name;

end loop;

execute immediate
'truncate table sales';

end;

Tip 4: Use direct path inserts for loading data

A direct path insert offers a quick way to load large volumes of data into a table. It does this by bypassing the buffer cache and writing directly to datafiles. A direct path insert is done using the insert into ...select.. idiom, together with an append hint. The append hint causes the rows to inserted above the table's high water mark - any free space below the high water mark is not used. Here is the syntax for a direct path insert into SALES from SALES_STAGE:

insert /*+ append*/ into
sales
(region,
year,
month,
sales_dollars)
select
region,
year,
month,
sales_dollars
from
sales_stage;

A direct path insert by itself is faster than a conventional insert. Just how much mileage one gets depends on the volume of data loaded. The performance of direct path inserts can be further enhanced by doing the following before the load:
  1. Bypassing constraint checks by disabling all constraints. Remember to re-enable these after the load is done.
  2. Suppressing redo generation by putting the table in nologging mode. If your table has a large number of non-unique indexes, you might also consider setting these to unusable state. This will suppress redo generation associated with index maintenance. The indexes can then be rebuilt after the load. Note that unique indexes should not be set unusable because the load will fail with an ORA-26026: unique index initially in unusable state error. Unique indexes can, however, be dropped before and recreated after the load. This is, in effect, what happens when the primary key constraint is disabled (as discussed in the previous point) - the unique index associated with the primary key is dropped.
    An important consequence of suppressing redo is that the operation is unrecoverable. Be sure to coordinate such operations with your DBA so that she can schedule a backup of the database or relevant tablespaces after the load.
Here's a block of code that bypasses constraints checks and suppresses redo generation for the above insert:

declare

cursor c_constraints is
select
table_name,
constraint_name
from
user_constraints
where
table_name='SALES';

cursor c_indexes is
select
index_name
from
user_indexes
where
table_name='SALES'
and
uniqueness<>'UNIQUE';

begin

for r_constraints in c_constraints loop

execute immediate
'alter table '||r_constraints.table_name||' disable constraint '
||r_constraints.constraint_name;

end loop;

--optional truncate to clean out the table
--disable referencing constraints, if needed (see tip 3)

--execute immediate
--'truncate table sales';

for r_indexes in c_indexes loop

execute immediate
'alter index '||r_indexes.index_name||' unusable';

end loop;

execute immediate
'alter session set skip_unusable_indexes=true';

execute immediate
'alter table sales nologging';

insert /*+ append*/ into
sales
(region,
year,
month,
sales_dollars)
select
region,
year,
month,
sales_dollars
from
sales_stage;

execute immediate
'alter session set skip_unusable_indexes=false';

for r_indexes in c_indexes loop

execute immediate
'alter index '||r_indexes.index_name||' rebuild nologging';

end loop;

execute immediate 'alter table sales logging';

for r_constraints in c_constraints loop

execute immediate
'alter table '||r_constraints.table_name||' enable constraint '

||r_constraints.constraint_name;

end loop;

commit;

end;

For small data volumes, the overhead of disabling constraints and indexes, and making the table nologging will swamp the benefits gained. In general, the larger the load the greater the benefit of the foregoing actions. As always, benchmark before implementation in your loads.
The above code uses a lot of dynamic SQL so it can be heavy on database resources. The use of dynamic SQL is unavoidable because we need to perform the operations within a module (PL/SQL block, procedure or package). However, this should not cause performance problems for business users because ETL batch processes normally run during off-peak hours.

Tip 5: Use the MERGE command for upserts

A common requirement is the need to perform an "upsert" - i.e.: update a row if it exists, insert it if it doesnt. In Oracle (versions 9i and better) this can be done in one step using the MERGE statement. Here's an example that uses the EMP table of SCOTT schema. The example uses a table EMP_STAGE that holds updated and new records that are to be upserted into EMP. You need to be logged into the SCOTT schema (or any other schema with a copy of SCOTT.EMP) in order to run the example:

--create EMP_STAGE

create table emp_stage as select * from emp where 1=2;

--insert update records in EMP_STAGE

insert into emp_stage values
(7369,'SMITH','CLERK',7902,to_date('17-12-1980','dd-mm-yyyy'),1800,NULL,20);

insert into emp_stage values
(7499,'ALLEN','SALESMAN',7698,to_date('20-2-1981','dd-mm-yyyy'),2200,300,30);

insert into emp_stage values
(7521,'WARD','SALESMAN',7698,to_date('22-2-1981','dd-mm-yyyy'),1250,500,30);

insert into emp_stage values
(7839,'KING','PRESIDENT',NULL,to_date('17-11-1981','dd-mm-yyyy'),9500,NULL,10);

insert into emp_stage values
(7782,'CLARK','MANAGER',7839,to_date('9-6-1981','dd-mm-yyyy'),8500,NULL,10);

--insert new records in EMP_STAGE

insert into emp_stage values
(7940,'WEBSTER','DBA',7782,to_date('23-1-1985','dd-mm-yyyy'),7000,NULL,10);

insert into emp_stage values
(7945,'HAMILL','DEVELOPER',7782,to_date('21-5-1985','dd-mm-yyyy'),6000,NULL,10);

insert into emp_stage values
(7950,'PINCHON','ANALYST',7782,to_date('20-10-1985','dd-mm-yyyy'),6000,NULL,10);

commit;

--MERGE records into EMP

merge into
emp e
using
emp_stage es
on
(e.empno=es.empno)
when matched then
update
set
e.sal=es.sal
when not matched then
insert
(e.empno,
e.ename,
e.job,
e.mgr,
e.hiredate,
e.sal,
e.comm,
e.deptno)
values
(es.empno,
es.ename,
es.job,
es.mgr,
es.hiredate,
es.sal,
es.comm,
es.deptno);

commit;

Three rows are inserted and five existing rows are updated by the above.
In Oracle 10g one can also add a conditional clauses to the insert and update portions of the merge statement. For example:

--MERGE records into EMP, except for DEVELOPERS

merge into
emp e
using
emp_stage es
on
(e.empno=es.empno)
when matched then
update
set
e.sal=es.sal
when not matched then
insert
(e.empno,
e.ename,
e.job,
e.mgr,
e.hiredate,
e.sal,
e.comm,
e.deptno)
values
(es.empno,
es.ename,
es.job,
es.mgr,
es.hiredate,
es.sal,
es.comm,
es.deptno)
where
es.job<>'DEVELOPER'
;

In this case the record for employee 7945 is not inserted.

The conditional clause can be in the update portion as well. In 10g it is also possible to delete rows from the destination table based on conditional criteria. Check the documentation for details.

Tip 6: Use heterogeneous services to access data in non-Oracle relational databases

Database links are often used to transfer small volumes of data between Oracle databases. It is less well known that database links can also be set up between Oracle and non-Oracle databases. This is a useful feature, as ETL processes often need to access and transfer data to Oracle from third-party databases such as MS SQL Server. The standard way to do this is by exporting data from the non-Oracle database to a flat file, and then importing the data into Oracle via SQL Loader or external tables. Oracle Heterogeneous Services provides a single-step option to achieve the transfer. Heterogeneous Services come in two flavours:
  1. Generic Heterogeneous Services: This option, which is bundled with the Oracle Server uses ODBC to connect to the non-Oracle database. A tutorial on accessing SQL Server using Generic Heterogeneous Services is available here .
  2. Transparent Gateways: These are extra cost options that are optimised for specific databases. They offer better performance compared to the Generic option because they are designed to exploit optimisations and features specific to particular databases. The mechanics of setting up Transparent Gateways is quite similar to the Generic option. Check the documentation that comes with the specific Transparent Gateway for further details. The documentation is hidden away in the relevant gateway installation directory. For example, the documentation for the MS SQL Server gateway sits in ORACLE_HOME/tg4msql.A tutorial on accessing SQL Server using Transparent Gateways is available here .
A warning on data transfer techniques that uses database links: the efficiency of the transfer depends on your network bandwidth and the quantity of data to be transferred. I have used it (with great success) to transfer small to moderate quantities of data (~100000 rows, rowsize ~1000 bytes) over co-located machines within corporate networks.

Closing Remarks

ETL processes present a technical challenge as they entail complex transformations and loads of large quantities of data within ever-shrinking time windows. As developers we need to use all the tricks in the book to speed up our ETL procedures. In this article I've outlined some of the techniques that I have used, with success, in several projects. I hope you find them useful in your work.

Back to the top

No comments:

Post a Comment