Feb 29, 2008

Linux中十大超级命令


Linux中有非常多很有用的命令,比如ps, top, vmstat等等,这些是应用比较广泛的,广为人知的。还有一些命令,非常有用,但是应用并不是十分广泛,如果充分利用这些命令,有时可以完成一些复杂的任务,起到很奇妙的效果。今天就来介绍几个这样的命令:

Could the command-line tools you've forgotten or never knew save time and some frustration?

One incarnation of the so called 80/20 rule has been associated with software systems. It has been observed that 80% of a user population regularly uses only 20% of a system's features. Without backing this up with hard statistics, my 20+ years of building and using software systems tells me that this hypothesis is probably true. The collection of linux command-line programs is no exception to this generalization. Of the dozens of shell-level commands offered by Linux, perhaps only ten commands are commonly understood and utilized, and the remaining majority are virtually ignored.

Which of these dogs of the linux shell have the most value to offer? I'll briefly describe ten of the less popular but useful Linux shell commands, those which I have gotten some mileage from over the years. Specifically, I've chosen to focus on commands that parse and format textual content.

The working examples presented here assume a basic familiarity with command-line syntax, simple shell constructs and some of the not-so-uncommon linux commands. Even so, the command-line examples are fairly well commented and straightforward. Whenever practical, the output of usage examples is presented under each command-line execution.

The following eight commands parse, format and display textual content. Although not all provided examples demonstrate this, be aware that the following commands will read from standard input if file arguments are not presented.

  • Head/Tail

显示一个文件指定的前面或者后面多少行,也可以用来实时显示文件的动态情况。比较适合用来跟踪日至文件的变化情况。

As their names imply, head and tail are used to display some amount of the top or bottom of a text block. head presents beginning of a file to standard output while tail does the same with the end of a file. Review the following commented examples:

## (1) displays the first 6 lines of a file
head -6 readme.txt
## (2) displays the last 25 lines of a file
tail -25 mail.txt

Here's an example of using head and tail in concert to display the 11th through 20th line of a file.

# (3)
head -20 file | tail -10

Manual pages show that the tail command has more command-line options than head. One of the more useful tail option is -f. When it is used, tail does not return when end-of-file is detected, unless it is explicitly interrupted. Instead, tail sleeps for a period and checks for new lines of data that may have been appended since the last read.

## (4) display ongoing updates to the given
## log file

tail -f /usr/tmp/logs/daemon_log.txt

Imagine that a dæmon process was continually appending activity logs to the /usr/adm/logs/daemon_log.txt file. Using tail -f at a console window, for example, will more or less track all updates to the file in real time. (The -f option is applicable only when tail's input is a file).

If you give multiple arguments to tail, you can track several log files in the same window.

## track the mail log and the server error log
## at the same time.

tail -f /var/log/mail.log /var/log/apache/error_log

  • tac–Concatenate in Reverse

反过来显示一个指定的文件或者输入,同cat反过来记忆比价有效。

What is cat spelled backwards? Well, that's what tac's functionality is all about. It concatenates file order and their contents in reverse. So what's its usefulness? It can be used on any task that requires ordering elements in a last-in, first-out (LIFO) manner. Consider the following command line to list the three most recently established user accounts from the most recent through the least recent.

# (5) last 3 /etc/passwd records - in reverse
$ tail -3 /etc/passwd | tac
curly:x:1003:100:3rd Stooge:/homes/curly:/bin/ksh
larry:x:1002:100:2nd Stooge:/homes/larry:/bin/ksh
moe:x:1001:100:1st Stooge:/homes/moe:/bin/ksh

  • nl–Numbered Line Output

nl简单的一个行过滤器,可以在文件的前面标注上行号,并可以从指定的行号开始,还可以指定分割符号等等,请看下面的描述和例子。
nl is a simple but useful numbering filter. I displays input with each line numbered in the left margin, in a format dictated by command-line options. nl provides a plethora of options that specify every detail of its numbered output. The following commented examples demonstrate some of of those options:

# (6) Display the first 4 entries of the password
# file - numbers to be three columns wide and
# padded by zeros.
$ head -4 /etc/passwd | nl -nrz -w3
001root:x:0:1:Super-User:/:/bin/ksh
002daemon:x:1:1::/:
003bin:x:2:2::/usr/bin:
004sys:x:3:3::/:
#
# (7) Prepend ordered line numbers followed by an
# '=' sign to each line — start at 101.
$ nl -s= -v101 Data.txt
101=1st Line …
102=2nd Line …
103=3rd Line …
104=4th Line …
105=5th Line …
…….

  • fmt–format

fmt可以用来给指定的文件限制宽度,使显示起来符合规则,好看些。另外还有些特殊的参数可以使用,参考下面的描述。
The fmt command is a simple text formatter that focuses on making textual data conform to a maximum line width. It accomplishes this by joining and breaking lines around white space. Imagine that you need to maintain textual content that was generated with a word processor. The exported text may contain lines whose lengths vary from very short to much longer than a standard screen length. If such text is to be maintained in a text editor (like vi), fmt is the command of choice to transform the original text into a more maintainable format. The first example below shows fmt being asked to reformat file contents as text lines no greater than 60 characters long.

# (8) No more than 60 char lines
$ fmt -w 60 README.txt > NEW_README.txt
#
# (9) Force uniform spacing:
# 1 space between words, 2 between sentences
$ echo "Hello World. Hello Universe." |
fmt -u -w80
Hello World. Hello Universe.

  • fold–Break Up Input

fold有点类似于fmt地用法,但是用法更加特殊,参考下面的描述。
fold is similar to fmt but is used typically to format data that will be used by other programs, rather than to make the text more readable to the human eye. The commented examples below are fairly easy to follow:

# (10) format text in 3 column width lines
$ echo oxoxoxoxo | fold -w3
oxo
xox
oxo
# (11) Parse by triplet-char strings -
# search for 'xox'
$ echo oxoxoxoxo | fold -w3 | grep "xox"
xox
# (12) One way to iterate through a string of chars
$ for i in $(echo 12345 | fold -w1)
> do
> ### perform some task …
> print $i
> done
1
2
3
4
5

上面的例子似乎不那么有效,更加有效的方法是:

for i in `seq 5`; do print $i;done

seq(1) allows to define start, stop, step and more.

  • tr

tr is a simple pattern translator. Its practical application overlaps a bit with other, more complex tools, such as sed and awk [with larger binary footprints]. tr is quite useful for simple textual replacements, deletions and additions. Its behavior is dictated by "from" and "to" character sets provided as the first and second argument. The general usage syntax of tr is as follows:

# (13) tr usage
tr [options] "set1″ ["set2″] <> output

Note that tr does not accept file arguments; it reads from standard input and writes to standard output. When two character sets are provided, tr operates on the characters contained in "set1″ and performs some amount of substitution based on "set2″.

  • pr

相对于tr或者其他命令,pr命令主要用于打印的格式。
pr shares features with simpler commands like nl and fmt, but its command-line options make it ideal for converting text files into a format that's suitable for printing. pr offers options that allow you to specify page length, column width, margins, headers/footers, double line spacing and more.

Aside from being the best suited formatter for printing tasks, pr also offers other useful features. These features include allowing you to view multiple files vertically in adjacent columns or columnizing a list in a fixed number of columns.

  • Miscellaneous

The following two commands are specialized parsers used to pick apart file path pieces.

Basename/Dirname
The basename and dirname commands are useful for presenting portions of a given file path. Quite often in scripting situations, it's convenient to be able to parse and capture a file name or the containing-directory name portions of a file path. These commands reduce this task to a simple one-line command. (There are other ways to approach this using the Korn shell or sed "magic", but basename and dirname are more portable and straightforward).

basename is used to strip off the directory, and optionally, the file suffix parts of a file path. Consider the following trivial examples:

:# (14) Parse out the Java Class name
$ basename
/usr/local/src/java/TheClass.java .java
TheClass
# (15) Parse out the file name.
$ basename srcs/C/main.c
main.c

dirname is used to display the containing directory path, as much of the path as is provided. Consider the following examples:

# (16) absolute and relative directory examples
$ dirname /homes/curly/.profile
/homes/curly
$ dirname curly/.profile
curly
#
# (17) From any korn-shell script, the following
# line will assign the directory from where
# the script was launched
SCRIPT_HOME="$(dirname $(whence $0))"
#
# (18)
# Okay, how about a non-trivial practical example?
# List all directories (under $PWD that contain a
# file called 'core'.
$ for i in $(find $PWD -name core )^
> do
> dirname $i
> done | sort -u
bin
rje/gcc
src/C

至此,这几个命令就介绍完了,希望能给你的工作带来帮助,如果要更详细地用法,可以参考man命令的帮助。

作者:RickyZhu From:Ricky's Space RickyZhu Blog

Tag:

不错,作者有很好的Linux的功底.特全文转载.



Haha,Search 周详珍 via Google

 
 

Happy Chinese New Year!

一个人的除夕,异乡的除夕

我拉亮了所有房间的灯

倚着阳台看几个小男孩在楼下跑来跑去,玩着鞭炮

满天的焰火,照亮了这个城市的夜空

噼里啪啦的宣告着新年的到来

凝视着焰火划出的美妙弧线

儿时家人围炉而坐的光景在夜空里隐约

仰望着星空,深邃的星空里头有很多我年少时的梦想

...

罕见的大雪,让我拥有独自在异乡过年的体验

辗转多个城市,恋了多年的姑娘已为人妻

人生,谁能说得清呢

...

回到房间,很想听水木年华的<在他乡>

喧闹繁华的城市,热闹而安静的夜晚,依然怀揣着梦想的我

对自己说:新年快乐!

February 07, From: zhouxiangz.space.live.com



很多年了,你可还在四处漂泊?

"也许这真的是个漂泊的年代,注定有些人正在路上,有些人准备登程,而另外一些人尽管已经停下,但关于漂泊的记忆早就融入血脉,成为生命的一部分。我们有一千个理由走向远方,比如家贫,比如想求取一份更象样的生活,甚至仅仅是渴望着浪迹天涯……
漂泊让我们拥有一个共同的生活背景——异乡。在异乡的土地上,我们曾经欢笑,曾经抛洒汗水,也流淌过泪滴。异乡的城市,承载了千万个你我的光荣与梦想,屈辱和挣扎,还有刻骨铭心的爱。我们肯定把一部分青春遗失在异乡了,也许无人留意,无人在乎,但总有一片静默的草地知道,总有一缕从出租屋窗户射入的阳光知道。那是珍藏于沧桑岁月深处的往事前尘。
我们以何其卑微的方式改变着异乡,异乡也以何其强大、冷漠或魅惑的方式改变着我们,包括我们的生命轨迹,我们的情感和未来,以及注视这个世界的目光。
不曾漂泊的人生是平淡的,在漂泊中迷失的人生是可悲的。铁打的异乡流水的你我。明天你又会在哪个城市驻足?你背着行囊默然走过的长街,也许只有车水马龙的喧嚣。你抬首楼隙久久瞩目的天空,也许找不见熟悉的鸟群,但你一定不能忘记,无论如何选择,都要做你自己,不辜负自己,并因此付出辛劳和勇气。我们都是漂泊者,这在我们生到这个时代就注定了,让我们互道珍重并且深深祝福。"
(某一天我在杂志上看到这一篇,心悄悄的晃动了下,喜欢它,转载下来,送给和我一样在异乡漂泊的你们)
From:朋友北北的Blog

Feb 2, 2008

OraFusion.com - Designing Efficient ETL Processes in Oracle

OraFusion.com - Designing Efficient ETL Processes in Oracle

Oracle Articles

Designing Efficient ETL Processes in Oracle:


Introduction:

Data conversion - the process of migrating application data - is a major component of many database related projects. This process essentially entails three sub-processes: 1) Extracting data from the source system, 2) Transforming data to the format required by the target system and 3) Loading data into the target system - hence the abbreviation ETL. Oracle offers a variety of database features for implementing ETL processes. Additionally, there is the relatively new Oracle Warehouse Builder - a full featured ETL tool. There are also other tools by third party vendors such as Informatica. However, despite this recent proliferation of ETL tools, many Oracle based data migration projects continue to use homegrown programs (mainly PL/SQL) for ETL. This is mainly because smaller projects cannot justify the cost and learning time required for effective use of commercial ETL tools. Besides, doing it oneself offers much more control over the process.

As with any programming project there are efficient and (infinitely more) inefficient ways to design and implement ETL processes. This article offers some tips for the design of efficient ETL programs in an Oracle environment. The tips are listed in no particular order. I hope you find them helpful in your projects

Tip 1: Use external tables to load data from external flat files

The traditional workhorse for loading external (flat file) data into Oracle is the SQL*Loader utility. Since version 9i, external tables offer an easier, and almost as performant, route to load data. Essentially, external tables enable us to access data in an external file via an Oracle table. Here's a simple example:
The following data is available in a text file called test.txt:

scott,cio
kailash,dba
jim,developer

We want to pull this data into Oracle using an external table. The first step is to create a directory object in Oracle, which points to the operating system directory in which the above file resides. Assume this is an operating system directory called c:\test. The SQL to make this directory accessible from Oracle is:

create or replace directory test_dir as 'c:\test';

The user must have the create any directory privilege in order to execute this SQL. We then define an external table based on our text file:

create table ext_test(
name varchar2(10),
job varchar2(10))
organization external
(type oracle_loader
default directory test_dir
access parameters (fields terminated by ',')
location (test_dir:'test.txt'))
reject limit 0

It would take us too far afield to go into each of the clauses in the above statement. Check the Oracle docs for further details. The relevant book is the Oracle Utilities Guide. There are also several third-party articles available online - do a google search on "oracle external table" to locate some of these.

To access the external data we simply issue the following SQL statement from within Oracle:

select * from ext_test;

There are several access parameter options available to handle more complicated record formats - check the Oracle documentation for details.

Tip 2: Use set-based operations to process data

Most ETL processes deal with large numbers of records. Each record must be cleaned at the field level and then loaded into the target system. Programmers unfamiliar with SQL tend to use procedural methods ("for" loops, for example) to deal with such situations. With some thought, however, it is often possible to construct an SQL statement that is equivalent to the procedural technique. The method using SQL should be preferred, as SQL processes the entire batch of records as a single set. Provided the statement isn't too complex, the SQL statement will generally run faster than the equivalent procedural code. Here's an (admittedly contrived) example to illustrate the point:

--execute from SQL Plus

set timing on

create table test (
id number,
name varchar2(128));

alter table test add constraint pk_test primary key (id);

declare
cursor c_test is
select
object_id,
object_name
from
all_objects;
begin
for r_test in c_test loop
insert into test
(id, name)
values
(r_test.object_id,lower(r_Test.object_name));
end loop;
commit;
end;
.
/

truncate table test;

declare
begin
insert into test (id, name)
select
object_id,
lower(object_name)
from
all_objects;
commit;
end;
.
/

You may need a larger number of rows to see an appreciable difference between the two methods. The advantage of the SQL based technique becomes apparent with greater data volumes.

Here are some tips that may help in converting procedural code to SQL:

  1. Replace procedural control statements with SQL decode or case: Often a procedural statement such as if....then....else can be replaced by a single SQL statement that uses the decode function or case statement. For simple conditionals the SQL option will almost always outperform procedural code. The decode function is useful when your branching criterion is based on discrete values - for example "Male" or "Female". The case statement would be preferred if the criterion is based on a continuous range - eg: any criterion that uses > or &lt. Here's an example of the two in action:

    --DECODE: Classify by gender

    select
    ename,decode(gender,'M','Male','F','Female','Unknown')
    from
    employees;

    --CASE: is the current day a weekend or weekday?

    select
    case when to_char(sysdate,'D')>'1' and to_char(sysdate,'D')<'6' then
    'Weekday'
    else
    'Weekend'
    end as day_type
    from
    dual;

  2. Pivoting: A common requirement of ETL processes is to pivot a table about a particular column, i.e. to convert rows to columns based on values in a specific column. This is easily achieved using the decode function. Consider the following example: we want to flatten out data in a table called SALES based on the MONTH column (which contains numbers ranging from 1 to 12). The SALES table has the following columns: REGION, YEAR, MONTH and SALES_DOLLARS (primary key REGION, YEAR and MONTH). The sales data is to be flattened out with each month's dollars appearing in a separate column - JAN_SALES, FEB_SALES .... DEC_SALES - appearing as a separate column. Here's how this can be done:

    select
    region
    year,
    min(decode(month,1,amount)) jan_sales,
    min(decode(month,2,amount)) feb_sales,
    min(decode(month,3,amount)) mar_sales,
    min(decode(month,4,amount)) apr_sales,
    min(decode(month,5,amount)) may_sales,
    min(decode(month,6,amount)) jun_sales,
    min(decode(month,7,amount)) jul_sales,
    min(decode(month,8,amount)) aug_sales,
    min(decode(month,9,amount)) sep_sales,
    min(decode(month,10,amount)) oct_sales,
    min(decode(month,11,amount)) nov_sales,
    min(decode(month,12,amount)) dec_sales
    from
    sales
    group by
    region,
    year

    The grouping is required to collapse twelve records for the region / year into a single record. Note that you can use any grouping function (AVG, MAX, SUM, for example) if all key columns other than MONTH are to be preserved in the flattened table. If this isn't so, you will have to use SUM as you will be summing over records for a fixed month. In either case you should be sure you understand what you're getting!

  3. "Unpivoting": This is the converse of the above. Here we want to move data from several columns into a single column, based on some criterion. Say we have a table called FLAT_SALES with columns REGION, YEAR, JAN_SALES,....DEC_SALES (produced by the above query). An ETL process requires us to get the data in normalised format, i.e. columns REGION, YEAR, MONTH, SALES_DOLLARS, as in the original SALES table. Here's a query that will achieve this:

    select
    t1.region region,
    t1.year year,
    t2.r month,
    decode(t2.r,
    1,jan_sales,
    2,feb_sales,
    3,mar_sales,
    4,apr_sales,
    5,may_sales,
    6,jun_sales,
    7,jul_sales,
    8,aug_sales,
    9,sep_sales,
    10,oct_sales,
    11,nov_sales,
    12,dec_sales) sales_dollars
    from
    flat_sales t1,
    (select
    rownum r
    from
    all_objects
    where
    rownum<=12) t2

    The table t2 serves to produce 12 rows for each original row via a cartesian product (no joins). The query uses a decode on rownum to pick out the sales value for each month from the flattened table. Note that we could have used any table or view with 12 or more rows in the definition of t2. I chose ALL_OBJECTS as this view is generally available to all schemas, and contains a large number of rows.

  4. Using built-in functions: One can achieve fairly complex data transformations using Oracle's character SQL functions. The nice thing about these functions is that they can be embedded in SQL, so you get the full advantage of set-based operations. The usual suspects include: LENGTH, LOWER, LPAD, REPLACE, RPAD, SUBSTR, UPPER, (L)(R)TRIM. Some "lesser known" functions I have found useful in my ETL efforts are:
    • CHR Returns the character associated with the specified ASCII code. This is useful when you want to remove non-printable characters such as carriage return (CR) and linefeed (LF) from input strings. You would do this using CHR in tandem with REPLACE like so:
      REPLACE(input_string,CHR(13)||CHR(10)), 13 and 10 being the ASCII codes for CR and LF respectively.
    • INITCAP This capitalizes the first letter in each word of the input string. Every other letter is lower cased.
    • INSTR Returns the location of a specified string within the input string.
    All these functions have a range of invocation options. Check the Oracle SQL documentation for full details.

  5. Analytics: Since version 8i, Oracle has introduced analytical extensions to SQL. These enable one to do a range of procedural operations using SQL (yes, you read that right - procedural operations using SQL). Some operations possible using analytical SQL include: sub-grouping (different levels of grouping within a statement), ranking and other statistics over subsets of data, comparing values in different rows, and presenting summary and detailed data using a single SQL statement. Check out my article on analytic SQL to find out more.
Tom Kyte's site is a great resource for tips on converting procedural code to SQL - try searching on the keywords "ETL" and "analytic functions" for a start.

OK, after extolling the virtues of using SQL we have to admit that there are situations in which procedural code becomes unavoidable. In such cases you can still take advantage of set based processing by using bulk binding within your procedural code. Sections of the Oracle documentation on bulk binding are available here and here (free registration required). An introductory article on bulk binding is available here (dbasupport.com).

Tip 3: TRUNCATE tables when deleting all data

Deleting all data from a table is best done using the truncate statement. A truncate runs faster than a delete because it simply resets the table's high water mark to zero. Truncate is a very quick operation. Indexes are also truncated along with the table. A truncate is DDL and therefore cannot be rolled back - be sure you consider this before using truncate. Note that tables referenced by enabled foreign key constraints cannot be truncated unless the keys are disabled first. Here's a block of code that uses truncate to clean out the SALES table (which is referenced by enabled foreign key constraints):

declare

cursor c_referenced_by is
select
t1.constraint_name constraint_name,
t1.table_name table_name
from
user_constraints t1,
user_constraints t2
where
t1.constraint_type='R'
and
t1.r_constraint_name=t2.constraint_name
and
t2.table_name ='SALES';

begin

for r_referenced_by in c_referenced_by loop

execute immediate
'alter table '||r_referenced_by.table_name||' disable constraint '
||r_referenced_by.constraint_name;

end loop;

execute immediate
'truncate table sales';

end;

Tip 4: Use direct path inserts for loading data

A direct path insert offers a quick way to load large volumes of data into a table. It does this by bypassing the buffer cache and writing directly to datafiles. A direct path insert is done using the insert into ...select.. idiom, together with an append hint. The append hint causes the rows to inserted above the table's high water mark - any free space below the high water mark is not used. Here is the syntax for a direct path insert into SALES from SALES_STAGE:

insert /*+ append*/ into
sales
(region,
year,
month,
sales_dollars)
select
region,
year,
month,
sales_dollars
from
sales_stage;

A direct path insert by itself is faster than a conventional insert. Just how much mileage one gets depends on the volume of data loaded. The performance of direct path inserts can be further enhanced by doing the following before the load:
  1. Bypassing constraint checks by disabling all constraints. Remember to re-enable these after the load is done.
  2. Suppressing redo generation by putting the table in nologging mode. If your table has a large number of non-unique indexes, you might also consider setting these to unusable state. This will suppress redo generation associated with index maintenance. The indexes can then be rebuilt after the load. Note that unique indexes should not be set unusable because the load will fail with an ORA-26026: unique index initially in unusable state error. Unique indexes can, however, be dropped before and recreated after the load. This is, in effect, what happens when the primary key constraint is disabled (as discussed in the previous point) - the unique index associated with the primary key is dropped.
    An important consequence of suppressing redo is that the operation is unrecoverable. Be sure to coordinate such operations with your DBA so that she can schedule a backup of the database or relevant tablespaces after the load.
Here's a block of code that bypasses constraints checks and suppresses redo generation for the above insert:

declare

cursor c_constraints is
select
table_name,
constraint_name
from
user_constraints
where
table_name='SALES';

cursor c_indexes is
select
index_name
from
user_indexes
where
table_name='SALES'
and
uniqueness<>'UNIQUE';

begin

for r_constraints in c_constraints loop

execute immediate
'alter table '||r_constraints.table_name||' disable constraint '
||r_constraints.constraint_name;

end loop;

--optional truncate to clean out the table
--disable referencing constraints, if needed (see tip 3)

--execute immediate
--'truncate table sales';

for r_indexes in c_indexes loop

execute immediate
'alter index '||r_indexes.index_name||' unusable';

end loop;

execute immediate
'alter session set skip_unusable_indexes=true';

execute immediate
'alter table sales nologging';

insert /*+ append*/ into
sales
(region,
year,
month,
sales_dollars)
select
region,
year,
month,
sales_dollars
from
sales_stage;

execute immediate
'alter session set skip_unusable_indexes=false';

for r_indexes in c_indexes loop

execute immediate
'alter index '||r_indexes.index_name||' rebuild nologging';

end loop;

execute immediate 'alter table sales logging';

for r_constraints in c_constraints loop

execute immediate
'alter table '||r_constraints.table_name||' enable constraint '

||r_constraints.constraint_name;

end loop;

commit;

end;

For small data volumes, the overhead of disabling constraints and indexes, and making the table nologging will swamp the benefits gained. In general, the larger the load the greater the benefit of the foregoing actions. As always, benchmark before implementation in your loads.
The above code uses a lot of dynamic SQL so it can be heavy on database resources. The use of dynamic SQL is unavoidable because we need to perform the operations within a module (PL/SQL block, procedure or package). However, this should not cause performance problems for business users because ETL batch processes normally run during off-peak hours.

Tip 5: Use the MERGE command for upserts

A common requirement is the need to perform an "upsert" - i.e.: update a row if it exists, insert it if it doesnt. In Oracle (versions 9i and better) this can be done in one step using the MERGE statement. Here's an example that uses the EMP table of SCOTT schema. The example uses a table EMP_STAGE that holds updated and new records that are to be upserted into EMP. You need to be logged into the SCOTT schema (or any other schema with a copy of SCOTT.EMP) in order to run the example:

--create EMP_STAGE

create table emp_stage as select * from emp where 1=2;

--insert update records in EMP_STAGE

insert into emp_stage values
(7369,'SMITH','CLERK',7902,to_date('17-12-1980','dd-mm-yyyy'),1800,NULL,20);

insert into emp_stage values
(7499,'ALLEN','SALESMAN',7698,to_date('20-2-1981','dd-mm-yyyy'),2200,300,30);

insert into emp_stage values
(7521,'WARD','SALESMAN',7698,to_date('22-2-1981','dd-mm-yyyy'),1250,500,30);

insert into emp_stage values
(7839,'KING','PRESIDENT',NULL,to_date('17-11-1981','dd-mm-yyyy'),9500,NULL,10);

insert into emp_stage values
(7782,'CLARK','MANAGER',7839,to_date('9-6-1981','dd-mm-yyyy'),8500,NULL,10);

--insert new records in EMP_STAGE

insert into emp_stage values
(7940,'WEBSTER','DBA',7782,to_date('23-1-1985','dd-mm-yyyy'),7000,NULL,10);

insert into emp_stage values
(7945,'HAMILL','DEVELOPER',7782,to_date('21-5-1985','dd-mm-yyyy'),6000,NULL,10);

insert into emp_stage values
(7950,'PINCHON','ANALYST',7782,to_date('20-10-1985','dd-mm-yyyy'),6000,NULL,10);

commit;

--MERGE records into EMP

merge into
emp e
using
emp_stage es
on
(e.empno=es.empno)
when matched then
update
set
e.sal=es.sal
when not matched then
insert
(e.empno,
e.ename,
e.job,
e.mgr,
e.hiredate,
e.sal,
e.comm,
e.deptno)
values
(es.empno,
es.ename,
es.job,
es.mgr,
es.hiredate,
es.sal,
es.comm,
es.deptno);

commit;

Three rows are inserted and five existing rows are updated by the above.
In Oracle 10g one can also add a conditional clauses to the insert and update portions of the merge statement. For example:

--MERGE records into EMP, except for DEVELOPERS

merge into
emp e
using
emp_stage es
on
(e.empno=es.empno)
when matched then
update
set
e.sal=es.sal
when not matched then
insert
(e.empno,
e.ename,
e.job,
e.mgr,
e.hiredate,
e.sal,
e.comm,
e.deptno)
values
(es.empno,
es.ename,
es.job,
es.mgr,
es.hiredate,
es.sal,
es.comm,
es.deptno)
where
es.job<>'DEVELOPER'
;

In this case the record for employee 7945 is not inserted.

The conditional clause can be in the update portion as well. In 10g it is also possible to delete rows from the destination table based on conditional criteria. Check the documentation for details.

Tip 6: Use heterogeneous services to access data in non-Oracle relational databases

Database links are often used to transfer small volumes of data between Oracle databases. It is less well known that database links can also be set up between Oracle and non-Oracle databases. This is a useful feature, as ETL processes often need to access and transfer data to Oracle from third-party databases such as MS SQL Server. The standard way to do this is by exporting data from the non-Oracle database to a flat file, and then importing the data into Oracle via SQL Loader or external tables. Oracle Heterogeneous Services provides a single-step option to achieve the transfer. Heterogeneous Services come in two flavours:
  1. Generic Heterogeneous Services: This option, which is bundled with the Oracle Server uses ODBC to connect to the non-Oracle database. A tutorial on accessing SQL Server using Generic Heterogeneous Services is available here .
  2. Transparent Gateways: These are extra cost options that are optimised for specific databases. They offer better performance compared to the Generic option because they are designed to exploit optimisations and features specific to particular databases. The mechanics of setting up Transparent Gateways is quite similar to the Generic option. Check the documentation that comes with the specific Transparent Gateway for further details. The documentation is hidden away in the relevant gateway installation directory. For example, the documentation for the MS SQL Server gateway sits in ORACLE_HOME/tg4msql.A tutorial on accessing SQL Server using Transparent Gateways is available here .
A warning on data transfer techniques that uses database links: the efficiency of the transfer depends on your network bandwidth and the quantity of data to be transferred. I have used it (with great success) to transfer small to moderate quantities of data (~100000 rows, rowsize ~1000 bytes) over co-located machines within corporate networks.

Closing Remarks

ETL processes present a technical challenge as they entail complex transformations and loads of large quantities of data within ever-shrinking time windows. As developers we need to use all the tricks in the book to speed up our ETL procedures. In this article I've outlined some of the techniques that I have used, with success, in several projects. I hope you find them useful in your work.

Back to the top