25 Aug 2010

Database vs ETL processing - a comparison

Source data

The POINT table/file holds 100 records similar to this. Point ranges from 1000 to 5000.

CustNo
Point
1054
2781
1055
3050
1056
3714
1057
3385
1058
3910
1059
3041
1060
3301
1061
4615
1062
4300
1063
5126

The CUSTOMER table/file holds 100 records similar to shown below.

CustNo
Name
Address
Town
Postcode
Sex
1054
Kirby Roman
390-4028 Vitae Avenue
Twin Falls
QO8 9DS
F
1055
Brianna Kaufman
Ap #414-8109 Nunc St.
Aberdeen
MD01 6PY
F
1056
Quon Maldonado
1214 Mi St.
Nevada City
EJ7 3BW
F
1057
Sarah Burke
897-4509 Integer Rd.
DuBois
J2 0ZW
F
1058
Alice Pickett
3730 Sed Street
Grambling
C6G 8YI
F
1059
Odette Richard
P.O. Box 920, 6838 Felis Av.
Oklahoma City
O5 0QM
F
1060
Ursa Nunez
310 Euismod Street
Pierre
P30 8FZ
F
1061
Caryn Walker
Ap #969-5153 Ipsum. Rd.
Idaho Falls
GF1 2VR
F
1062
Dai Lawrence
907-2556 Suspendisse Rd.
Pass Christian
B3R 2JV
F
1063
Claudia Calhoun
P.O. Box 932, 3401 Cras Avenue
Poughkeepsie
XU8 7LC
F

Requirement

1. Select only customers with points > 1000
2. Join with customer details on CustNo to populate customer details with points
3. Based on customer gender, create salutation field in output as Mr/Ms First name Surname
4. Create 3 feeds named Gold, Silver & Bronze based on following logic
• Gold - point > 4001
• Silver - point between 2001 and 4000
• Bronze - point between 1001 and 2000

All output feeds will have following fields:

CustNo
Salutation
Address
City PostCode
Point
Category (Gold/Silver/Bronze)

The output will look like this.

CustNo
Name
Address
Town Postcode
Point
Category
1073
Ms Cassidy Stokes
9679 Malesuada Ave
Sedalia BX4G 0HV
4120
Gold
1074
Ms Germane Olson
P.O. Box 631, 2307 Lobortis Road
Darlington N1H 3IM
5402
Gold
1044
Mr Emery Andrews
6721 Quis Av.
Needham X3Z 9SJ
3158
Silver
1045
Mr Leonard Logan
P.O. Box 554, 591 Massa. Road
Minnetonka LF3F 6ON
2569
Silver
1030
Mr Stewart Sanford
Ap #246-7429 Vitae Av.
Gadsden V26 6GZ
1522
Bronze

NB. There will actually be 3 output feeds - for gold, silver and bronze customers.

ETL Solution

The Clover ETL solution will be like shown in next figure.




Database Solution

The solution shown here uses Oracle database.

There will be a table named PROCESS_FLOW which will control the overall flow as explained further.

The sample contents of this table will be like this for the particular example we are discussing here.


STEP
SOURCE_VIEW
TARGET_TABLE
ACTION
DESCRIPTION
10
CUSTOMER_FILTER
CUSTOMER_FILTER_TABLE
TRUNCATE INSERT
selecting customers with point > 1000
20
CUSTOMER_JOIN
CUSTOMER_JOIN_TABLE
TRUNCATE INSERT
join above step with customer on CustNo
30
CUSTOMER_GOLD
CUSTOMER_GOLD_TABLE
TRUNCATE INSERT
select category = "GOLD" from above output



CREATE OR REPLACE VIEW CUSTOMER_FILTER AS SELECT * FROM POINT WHERE POINT > 1000;
CREATE TABLE CUSTOMER_FILTER_TABLE AS SELECT * FROM CUSTOMER_FILTER WHERE 1=2;
INSERT INTO CUSTOMER_FILTER_TABLE SELECT * FROM CUSTOMER_FILTER;
COMMIT;

CREATE OR REPLACE VIEW CUSTOMER_JOIN AS
SELECT P.CUSTNO, C.CUSTNAME, C.ADDRESS, C.TOWN, C.POSTCODE, P.POINT, C.SEX
,CASE WHEN P.POINT > 4000 THEN 'GOLD' WHEN (P.POINT > 2000 AND P.POINT <= 4000) THEN 'SILVER' WHEN (P.POINT > 1000 AND P.POINT <= 2000) THEN 'BRONZE' END CATEGORY FROM CUSTOMER_FILTER_TABLE P JOIN CUSTOMER C ON P.CUSTNO = C.CUSTNO; INSERT INTO CUSTOMER_JOIN_TABLE SELECT CJ.CUSTNO, DECODE(SEX,'M','Mr','F','Ms','??') || ' ' || CJ.CUSTNAME , CJ.ADDRESS, CJ.TOWN || ' ' || CJ.POSTCODE , NULL,CJ.POINT, CJ.SEX,CJ.CATEGORY FROM CUSTOMER_JOIN CJ; COMMIT; SELECT * FROM CUSTOMER_JOIN_TABLE; -- POSTCODE WILL BE NULL AS WE COMBINED THAT WITH TOWN IN PREVIOUS FIELD CREATE OR REPLACE VIEW CUSTOMER_GOLD AS SELECT * FROM CUSTOMER_JOIN_TABLE WHERE CATEGORY = 'GOLD'; CREATE TABLE CUSTOMER_GOLD_TABLE AS SELECT * FROM CUSTOMER_GOLD WHERE 1=2; INSERT INTO CUSTOMER_GOLD_TABLE SELECT * FROM CUSTOMER_GOLD; COMMIT; SELECT * FROM CUSTOMER_GOLD; -- list of GOLD customers


So, we can see how involved database processing steps are!

Difference between DB & ETL processing

Topic
DB
ETL
Parallel processing
DB process is sequential.
ETL process is parallel.
Phasing
DB process each step in a different phase.
Independent ETL components may be in same phase.
I/O
DB process lands data in disk (= tables).
ETL process can run in pipeline manner, without landing data to disk thus reducing I/O.
Intuitiveness
DB process is not intuitive. It makes maintaining them very difficult.
ETL process is a diagram. Very easy to understand and modify.
Platform specific
DB process is DB specific.
ETL process is generic. Heterogeneous environments can implement ETL processes in similar fashion.
Debugging
Debugging in DB is very cumbersome.
Debugging on ETL is often visual - so easy to spot problem.

Conclusion

An ETL process is much easier to maintain than DB specific process.

No comments:

Post a Comment