出版社:Electronics and Telecommunications Research Institute
摘要:Human–object interaction (HOI) detection is a popular computer vision task that detects interactions between humans and objects. This task can be useful in many applications that require a deeper understanding of semantic scenes. Current HOI detection networks typically consist of a feature extractor followed by detection layers comprising small filters (eg, 1 × 1 or 3 × 3). Although small filters can capture local spatial features with a few parameters, they fail to capture larger context information relevant for recognizing interactions between humans and distant objects owing to their small receptive regions. Hence, we herein propose a three‐stream HOI detection network that employs a context convolution module (CCM) in each stream branch. The CCM can capture larger contexts from input feature maps by adopting combinations of large separable convolution layers and residual‐based convolution layers without increasing the number of parameters by using fewer large separable filters. We evaluate our HOI detection method using two benchmark datasets, V‐COCO and HICO‐DET, and demonstrate its state‐of‐the‐art performance.