I have been working on a small project involving thoughts about cloud computing, high performance programming models, and transient applications such as the T (transformation) part of ETL. What I mean by “transient” is that the operation is not an ongoing operational activity, but basically is a batch process that is executed when needed. That being said, if the data sets being extracted have to be subjected to a lot of modifications (parsing, standardizing, normalization, aggregation, reductions, summarization, etc.) and that takes a long time on a single server, would it not make sense to attempt to speed up the execution by employing multiple processors? This certainly makes sense if the operations are largely independent, and a lot of the T is. For example, parsing out the tokens in name strings can be done in parallel, as can standardization and normalization.
The programming is eminently doable, so the challenge is then the E and the L. The time to extract the data is going to be the same, as is the time for loading. but the time to shove your data into a cloud and pull it out again puts a damper on the runtime, which makes you have to consider how to interleave the execution of the transformations with the forwarding of the results back to the target data set. If the database is an analytical database that lives in the cloud also, that might ease that challenge.
Perhaps some interesting ideas that probably need a little more though, so if you have any experiences, let me know!