Oracle Articles Designing Efficient ETL Processes in Oracle: Introduction: Data conversion - the process of migrating application data - is a major component of many database related projects. This process essentially entails three sub-processes: 1) Extracting data from the source system, 2) Transforming data to the format required by the target system and 3) Loading data into the target system - hence the abbreviation ETL. Oracle offers a variety of database features for implementing ETL processes. Additionally, there is the relatively new Oracle Warehouse Builder - a full featured ETL tool. There are also other tools by third party vendors such as Informatica. However, despite this recent proliferation of ETL tools, many Oracle based data migration projects continue to use homegrown programs (mainly PL/SQL) for ETL. This is mainly because smaller projects cannot justify the cost and learning time required for effective use of commercial ETL tools. Besides, doing it oneself offers much more control over the process. As with any programming project there are efficient and (infinitely more) inefficient ways to design and implement ETL processes. This article offers some tips for the design of efficient ETL programs in an Oracle environment. The tips are listed in no particular order. I hope you find them helpful in your projects Tip 1: Use external tables to load data from external flat files The traditional workhorse for loading external (flat file) data into Oracle is the SQL*Loader utility. Since version 9i, external tables offer an easier, and almost as performant, route to load data. Essentially, external tables enable us to access data in an external file via an Oracle table. Here's a simple example: The following data is available in a text file called test.txt: scott,cio We want to pull this data into Oracle using an external table. The first step is to create a directory object in Oracle, which points to the operating system directory in which the above file resides. Assume this is an operating system directory called c:\test. The SQL to make this directory accessible from Oracle is: kailash,dba jim,developer create or replace directory test_dir as 'c:\test'; The user must have the create any directory privilege in order to execute this SQL. We then define an external table based on our text file: create table ext_test( It would take us too far afield to go into each of the clauses in the above statement. Check the Oracle docs for further details. The relevant book is the Oracle Utilities Guide. There are also several third-party articles available online - do a google search on "oracle external table" to locate some of these.name varchar2(10), job varchar2(10)) organization external (type oracle_loader default directory test_dir access parameters (fields terminated by ',') location (test_dir:'test.txt')) reject limit 0 To access the external data we simply issue the following SQL statement from within Oracle: select * from ext_test; There are several access parameter options available to handle more complicated record formats - check the Oracle documentation for details.Tip 2: Use set-based operations to process data Most ETL processes deal with large numbers of records. Each record must be cleaned at the field level and then loaded into the target system. Programmers unfamiliar with SQL tend to use procedural methods ("for" loops, for example) to deal with such situations. With some thought, however, it is often possible to construct an SQL statement that is equivalent to the procedural technique. The method using SQL should be preferred, as SQL processes the entire batch of records as a single set. Provided the statement isn't too complex, the SQL statement will generally run faster than the equivalent procedural code. Here's an (admittedly contrived) example to illustrate the point: --execute from SQL Plus You may need a larger number of rows to see an appreciable difference between the two methods. The advantage of the SQL based technique becomes apparent with greater data volumes.set timing on create table test ( id number, name varchar2(128)); alter table test add constraint pk_test primary key (id); declare cursor c_test is select object_id, object_name from all_objects; begin for r_test in c_test loop insert into test (id, name) values (r_test.object_id,lower(r_Test.object_name)); end loop; commit; end; . / truncate table test; declare begin insert into test (id, name) select object_id, lower(object_name) from all_objects; commit; end; . / Here are some tips that may help in converting procedural code to SQL:
OK, after extolling the virtues of using SQL we have to admit that there are situations in which procedural code becomes unavoidable. In such cases you can still take advantage of set based processing by using bulk binding within your procedural code. Sections of the Oracle documentation on bulk binding are available here and here (free registration required). An introductory article on bulk binding is available here (dbasupport.com). Tip 3: TRUNCATE tables when deleting all data Deleting all data from a table is best done using the truncate statement. A truncate runs faster than a delete because it simply resets the table's high water mark to zero. Truncate is a very quick operation. Indexes are also truncated along with the table. A truncate is DDL and therefore cannot be rolled back - be sure you consider this before using truncate. Note that tables referenced by enabled foreign key constraints cannot be truncated unless the keys are disabled first. Here's a block of code that uses truncate to clean out the SALES table (which is referenced by enabled foreign key constraints): declare Tip 4: Use direct path inserts for loading data cursor c_referenced_by is select t1.constraint_name constraint_name, t1.table_name table_name from user_constraints t1, user_constraints t2 where t1.constraint_type='R' and t1.r_constraint_name=t2.constraint_name and t2.table_name ='SALES'; begin for r_referenced_by in c_referenced_by loop execute immediate 'alter table '||r_referenced_by.table_name||' disable constraint ' ||r_referenced_by.constraint_name; end loop; execute immediate 'truncate table sales'; end; A direct path insert offers a quick way to load large volumes of data into a table. It does this by bypassing the buffer cache and writing directly to datafiles. A direct path insert is done using the insert into ...select.. idiom, together with an append hint. The append hint causes the rows to inserted above the table's high water mark - any free space below the high water mark is not used. Here is the syntax for a direct path insert into SALES from SALES_STAGE: insert /*+ append*/ into A direct path insert by itself is faster than a conventional insert. Just how much mileage one gets depends on the volume of data loaded. The performance of direct path inserts can be further enhanced by doing the following before the load: sales (region, year, month, sales_dollars) select region, year, month, sales_dollars from sales_stage;
declare For small data volumes, the overhead of disabling constraints and indexes, and making the table nologging will swamp the benefits gained. In general, the larger the load the greater the benefit of the foregoing actions. As always, benchmark before implementation in your loads.cursor c_constraints is select table_name, constraint_name from user_constraints where table_name='SALES'; cursor c_indexes is select index_name from user_indexes where table_name='SALES' and uniqueness<>'UNIQUE'; begin for r_constraints in c_constraints loop execute immediate 'alter table '||r_constraints.table_name||' disable constraint ' ||r_constraints.constraint_name; end loop; --optional truncate to clean out the table --disable referencing constraints, if needed (see tip 3) --execute immediate --'truncate table sales'; for r_indexes in c_indexes loop execute immediate 'alter index '||r_indexes.index_name||' unusable'; end loop; execute immediate 'alter session set skip_unusable_indexes=true'; execute immediate 'alter table sales nologging'; insert /*+ append*/ into sales (region, year, month, sales_dollars) select region, year, month, sales_dollars from sales_stage; execute immediate 'alter session set skip_unusable_indexes=false'; for r_indexes in c_indexes loop execute immediate 'alter index '||r_indexes.index_name||' rebuild nologging'; end loop; execute immediate 'alter table sales logging'; for r_constraints in c_constraints loop execute immediate 'alter table '||r_constraints.table_name||' enable constraint ' ||r_constraints.constraint_name; end loop; commit; end; The above code uses a lot of dynamic SQL so it can be heavy on database resources. The use of dynamic SQL is unavoidable because we need to perform the operations within a module (PL/SQL block, procedure or package). However, this should not cause performance problems for business users because ETL batch processes normally run during off-peak hours. Tip 5: Use the MERGE command for upserts A common requirement is the need to perform an "upsert" - i.e.: update a row if it exists, insert it if it doesnt. In Oracle (versions 9i and better) this can be done in one step using the MERGE statement. Here's an example that uses the EMP table of SCOTT schema. The example uses a table EMP_STAGE that holds updated and new records that are to be upserted into EMP. You need to be logged into the SCOTT schema (or any other schema with a copy of SCOTT.EMP) in order to run the example: --create EMP_STAGE Three rows are inserted and five existing rows are updated by the above.create table emp_stage as select * from emp where 1=2; --insert update records in EMP_STAGE insert into emp_stage values (7369,'SMITH','CLERK',7902,to_date('17-12-1980','dd-mm-yyyy'),1800,NULL,20); insert into emp_stage values (7499,'ALLEN','SALESMAN',7698,to_date('20-2-1981','dd-mm-yyyy'),2200,300,30); insert into emp_stage values (7521,'WARD','SALESMAN',7698,to_date('22-2-1981','dd-mm-yyyy'),1250,500,30); insert into emp_stage values (7839,'KING','PRESIDENT',NULL,to_date('17-11-1981','dd-mm-yyyy'),9500,NULL,10); insert into emp_stage values (7782,'CLARK','MANAGER',7839,to_date('9-6-1981','dd-mm-yyyy'),8500,NULL,10); --insert new records in EMP_STAGE insert into emp_stage values (7940,'WEBSTER','DBA',7782,to_date('23-1-1985','dd-mm-yyyy'),7000,NULL,10); insert into emp_stage values (7945,'HAMILL','DEVELOPER',7782,to_date('21-5-1985','dd-mm-yyyy'),6000,NULL,10); insert into emp_stage values (7950,'PINCHON','ANALYST',7782,to_date('20-10-1985','dd-mm-yyyy'),6000,NULL,10); commit; --MERGE records into EMP merge into emp e using emp_stage es on (e.empno=es.empno) when matched then update set e.sal=es.sal when not matched then insert (e.empno, e.ename, e.job, e.mgr, e.hiredate, e.sal, e.comm, e.deptno) values (es.empno, es.ename, es.job, es.mgr, es.hiredate, es.sal, es.comm, es.deptno); commit; In Oracle 10g one can also add a conditional clauses to the insert and update portions of the merge statement. For example: --MERGE records into EMP, except for DEVELOPERS In this case the record for employee 7945 is not inserted.merge into emp e using emp_stage es on (e.empno=es.empno) when matched then update set e.sal=es.sal when not matched then insert (e.empno, e.ename, e.job, e.mgr, e.hiredate, e.sal, e.comm, e.deptno) values (es.empno, es.ename, es.job, es.mgr, es.hiredate, es.sal, es.comm, es.deptno) where es.job<>'DEVELOPER'; The conditional clause can be in the update portion as well. In 10g it is also possible to delete rows from the destination table based on conditional criteria. Check the documentation for details. Tip 6: Use heterogeneous services to access data in non-Oracle relational databases Database links are often used to transfer small volumes of data between Oracle databases. It is less well known that database links can also be set up between Oracle and non-Oracle databases. This is a useful feature, as ETL processes often need to access and transfer data to Oracle from third-party databases such as MS SQL Server. The standard way to do this is by exporting data from the non-Oracle database to a flat file, and then importing the data into Oracle via SQL Loader or external tables. Oracle Heterogeneous Services provides a single-step option to achieve the transfer. Heterogeneous Services come in two flavours:
Closing Remarks ETL processes present a technical challenge as they entail complex transformations and loads of large quantities of data within ever-shrinking time windows. As developers we need to use all the tricks in the book to speed up our ETL procedures. In this article I've outlined some of the techniques that I have used, with success, in several projects. I hope you find them useful in your work. Back to the top |
Feb 2, 2008
OraFusion.com - Designing Efficient ETL Processes in Oracle
OraFusion.com - Designing Efficient ETL Processes in Oracle
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment