摘要:This paper explores how to improve BOW model for human action recognition in real environment. Traditional codebook learning uses single appearance based local features, thus spatial and temporal correlations of local features are ignored. This leads to a considerable amount of mismatch between sample vectors and noisy visual words resulted from background clutters. To improve the performance of BOW modeling in real settings, we propose a novel action modeling approach. First, two-level feature selection is applied in the pre-process phase of codebook learning to remove noisy features, thus descriptive and discriminative features are obtained. Then spatial-temporal pyramid matching (STPM) is employed in the feature coding process, in which we model human actions considering not only the appearance similarity between local features but also the spatial relationship of features in space and time. We validate our approach on several benchmark datasets and experimental results show that our approach significantly outperforms K-means clustering on more challenge datasets such as KTH, UCF sports and Youtube datasets.