Departmental Bulletin Paper 〈プロジェクト紹介〉超大規模コーパス構築プロジェクト 日本語Webコーパスの構築 : 利活用
Building NINJAL Web Japanese Corpus : Use and Application

浅原, 正幸  ,  Masayuki, ASAHARA

6 ( 1 )  , pp.1 - 10 , 2015-06 , 国立国語研究所
ISSN:2185-0100 print2185-0119 online
In 2011, the National Institute for Japanese Language and Linguistics launched a corpus compilation project with the aim of constructing a ten-billion-word Web corpus. The project was split into the following four sub-projects: page collection, linguistic annotation, release, and preservation. In the page collection stage, crawling began during the fourth quarter of 2012. We crawled 100 million URLs every three months as fixed-point observations. This paper presents the basic statistics of the crawled data and discusses possible theoretical and practical implications of these language resources.

Number of accesses :  

Other information