期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2015
卷号:80
期号:2
出版社:Journal of Theoretical and Applied
摘要:Big-Data is very popular word to perform huge data processing; it brings so many opportunities to the academia, industry and society. Big data hold great promise for discovery of patterns and heterogeneities which are not possible with small data. Big Data faces many challenges like unique computational and statistical challenges including scalability and storage. Among these challenges some maybe mentioned as noise accumulation, spurious correlation, incidental endogeneity and measurement errors. Most of the problems occur based on the size of the data associated with large number of attributes. Irrelevant attributes add noise to the data and increase the size of the model. Moreover datasets with many attributes may contain groups of data that are correlated. All these attributes may be measuring the same feature. One way of dealing with this problem is to eliminate some attributes (dimensions) which do not exhibit large variance and hence do not affect the clusters. Several techniques exist to ignore certain attributes or dimensions such as Principle component analysis (PCA), Singular Value Decomposition (SVD) etc. We review these techniques in this paper with respect to clustering. We plan to use principle component analysis and Kernel methods for Dimensionality reduction which is an essential preprocessing technique for large scale data sets. It can be used to improve both the efficiency and effectiveness of classifiers.
关键词:Big Data ; Dimensionality Reduction; Feature Extraction; Fuzzy ; Term Data