There are multiple cues in an image which reveal what action a person is
performing. For example, a jogger has a pose which is characteristic for the
action, but the scene (e.g. road, trail) and the presence of other joggers can
be an additional source of information. In this work, we exploit the simple
observation that actions are accompanied by contextual cues to build a strong
action
... [Show full abstract] recognition system. We adapt RCNN to use more than one region for
classification while still maintaining the ability to localize the action. We
call our system R*CNN. The action-specific models and the feature maps are
trained jointly, allowing for action specific representations to emerge. R*CNN
achieves 89% mean AP on the PASAL VOC Action dataset, outperforming all other
approaches in the field by a significant margin.