期刊名称:International Journal of Applied Mathematics and Computer Science
电子版ISSN:2083-8492
出版年度:2019
卷号:29
期号:1
页码:1-11
DOI:10.2478/amcs-2019-0005
出版社:De Gruyter Open
摘要:Today’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness
of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the
risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for
data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a
design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed
environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs
without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets
based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers
and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.