The amount of video data on the Internet has been rapidly increasing. Those video have large variety and in most case with low quality. Robust techniques for video indexing are strongly demanded. In automatic video semantic indexing, a user submits a textual input query for a desired object or a scene to a search system, which returns video shots that include the object or scene. In this application, many techniques developed in speech research have been successfully employed. For example, a new method using Gaussian-mixture model (GMM) supervectors and support vector machines (SVMs) was recently proven to be very effective. In this method, speech technologies such as speaker verification and speaker adaptation techniques play very important roles. In this lecture, we first introduce the activities of NIST TRECVID workshop which is a showcase of the state-of-the-art video search technologies, and then, discuss several techniques such as SIFT and HOG features, Bag of Visual Words, Fisher kernel, Multi-modal framework, and Fast tree search, to achieve robustness against the variety of the Internet video.