摘要:Motivated by the recent active research on online analytical processing (OLAP), we develop a computation and storage efficient algorithm for estimating equation (EE) estimation in massive data sets using a “divide-and-conquer” strategy. In each partition of the data set, we compress the raw data into some low dimensional statistics and then discard the raw data. Then, we obtain an approximation to the EE estimator, the aggregated EE (AEE) estimator, by solving an equation aggregated from the saved low dimensional statistics in all partitions. Such low dimensional statistics are taken as the EE estimates and first-order derivatives of the estimating equations in each partition. We show that, under proper partitioning and some regularity conditions, the AEE estimator is strongly consistent and asymptotically equivalent to the EE estimator. A major application of the AEE technique is to support fast OLAP of EE estimations for data warehousing technologies such as data cubes and data streams. It can also be used to reduce the computation time and conquer the memory constraint problem posed by massive data sets. Simulation studies show that the AEE estimator provides efficient storage and remarkable deduction in computational time, especially in its applications to data cubes and data streams.
关键词:massive data sets; estimating equation; data compression; aggregation; consistency; asymptotic normality; data cube