摘要:The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery.