文章基本信息

标题：Integrated Querying of SQL database data and S3 data in Amazon Redshift
本地全文：下载
作者：Mengchu Cai ; Martin Grund ; Anurag Gupta 等
期刊名称：Bulletin of the Technical Committee on Data Engineering
出版年度：2018
卷号：41
期号：2
页码：82-90
出版社：IEEE Computer Society
摘要：Amazon Redshift features integrated, in-place access to data residing in a relational database and to data residing in the Amazon S3 object storage. In this paper we discuss associated query planning and processing aspects. Redshift plays the role of the integration query processor, in addition to the usual processor of queries over Redshift tables. In particular, during query execution, every compute node of a Redshift cluster issues (sub)queries over S3 objects, employing a novel multi-tenant (sub)query execu- tion layer, called Amazon Redshift Spectrum, and merges/joins the results in an streaming and parallel fashion. The Spectrum layer offers massive scalability, with independent scaling of storage and compu- tation. Redshifts optimizer determines how to minimize the amount of data scanned by Spectrum, the amount of data communicated to Redshift and the number of Spectrum nodes to be used. In particular, Redshifts query processor dynamically prunes partitions and pushes subqueries to Spectrum, recogniz- ing which objects are relevant and restricting the subqueries to a subset of SQL that is amenable to Spectrums massively scalable processing. Furthermore, Redshift employs novel dynamic optimization techniques in order to formulate the subqueries. One such technique is a variant of semijoin reduction, which is combined in Redshift with join-aggregation query rewritings. Other optimizations and rewrit- ings relate to the memory footprint of query processing in Spectrum, and the SQL functionality that it is being supported by the Spectrum layer. The users of Redshift use the same SQL syntax to access scalar Redshift and external tables..