Cloud Computing and ETL
Nov 03, 2009 by David Loshin in Customer Data Integration, Data Enrichment, Data Integration, Data Migration
I have been working on a small project involving thoughts about cloud computing, high performance programming models, and transient applications such as the T (transformation) part of ETL. What I mean by “transient” is that the operation is not an ongoing operational activity, but basically is a batch process that is executed when needed. That being said, if the data sets being extracted have to be subjected to a lot of modifications (parsing, standardizing, normalization, aggregation, reductions, summarization, etc.) and that takes a long time on a single server, would it not make sense to attempt to speed up the execution by employing multiple processors? This certainly makes sense if the operations are largely independent, and a lot of the T is. For example, parsing out the tokens in name strings can be done in parallel, as can standardization and normalization.
The programming is eminently doable, so the challenge is then the E and the L. The time to extract the data is going to be the same, as is the time for loading. but the time to shove your data into a cloud and pull it out again puts a damper on the runtime, which makes you have to consider how to interleave the execution of the transformations with the forwarding of the results back to the target data set. If the database is an analytical database that lives in the cloud also, that might ease that challenge.
Perhaps some interesting ideas that probably need a little more though, so if you have any experiences, let me know!





mahfoud bala
Jun 29, 2012
Hi,
MapReduce tasks are working on text files. to perform Extract tasks on the cloud with the MapReduce paradigm, it will develop MapReaders who reads from sources format such as relational databases (MySQL, Oracle, PostgreSQL, ..), XML documents, Excel files , … and then transform theses sources tuples in text format to load them in the DSA store (NFS or DFS on Master Node)