Thesis or Dissertation To Better Exploration of Action Recognition in Videos

Hang Nga, Do  ,  Hang Nga, Do

pp.1 - 113 , 2015-03-25 , The University of Electro-Communications
Our overall purpose in this dissertation is automatic construction of a large-scale action database with Web data, which could be helpful for the better exploration of action recognition. We conducted large-scale experiments on 100 human actions and 12 nonhuman actions and obtained promissing results. This disseration is constructed with 6 chapters. In the followings, we briey introduce the content of each chapter.In Chapter 1, recent approaches on action recognition as well as the necessity of building a large-scale action database and its di culties are described. Then our works to solve the problem are concisely explained.In Chapter 2, the rst work which introduces a framework of extracting automatically relevant video shots of speci c actions from Web videos is described in details. This framework at rst, selects relevant videos among thousands of Web videos for a given action using tag co-occurance and then, divides selected videos into video shots. Video shots are then ranked based on their visual linkage. The top ranked video shots are supposed to be the most related shots of the action. Moreover, our method of adopting Web images to shot ranking is also introduced. Finally, large-scale experiments on 100 human actions and 12 non-human actions and their results are described.In Chapter 3, the second work which aims to further improve shot ranking of the above framework by proposing a novel ranking method is introduced. Our proposed ranking method, which is called VisualTextualRank, is an extension of a conventional method,VisualRank, which is applied to shot ranking in Chapter 2. VisualTextualRank effectively employs both textual information and visual information extracted from the data. Our experiment results showed that using our method instead of the conventional ranking method could obtain more relevant shots.In Chapter 4, the third work which aims to obtain more informative and representative features of videos is described. Based on a conventional method of extracting spatiotemporal features which was adopted in Chapter 2 and Chapter 3, we propose to extract spatio-temporal features with triangulation of dense SURF keypoints. Shape features of the triangles along with visual features and motion features of their points are taken into account to form our features. By applying our method of feature extraction to the framework introduced in Chapter 2, we show that more relevant video shots can be retrieved at the top. Furthermore, the e ectiveness of our method is also validated on action classi cation for UCF-101 and UCF-50 which are well-known large-scale data sets. The experiment results demonstrate that our features are comparable and complementary to the state-of-the-art.In Chapter 5, the nal work which focuses on recognition of hand motion based actions is introduced. We propose a system of hand detection and tracking for unconstrained videos and extract hand movement based features from detected and tracked hand regions.These features are supposed to help improve results for hand motion based actions. To evaluate the performance of our system on hand detection, we use Video-Pose2.0 dataset which is a challenging dataset with uncontrolled videos. To validate the e ectiveness of our features, we conduct experiments on ne-grained action recognition with \playing instruments" group in UCF-101 data set. The experiment results show the e ciency of our system.In Chapter 6, our works with their major points and ndings are summarized. We also consider the potential of applying the results obtained by our works to further researches.

Number of accesses :  

Other information