Departmental Bulletin Paper 視覚文字表現と深層学習による文書分類法

島田, 大輔

Languages such as Chinese and Japanese have a significantly large number of characters as compared to other languages, and each of their sentences consists of several concatenated words with wide varieties of inflected forms; thus appropriate word segmentation is quite difficult. In this study, we propose a new andefficient document classification technique for such languages. The proposed method is characterized into a new “image-based character embedding” method and character-level convolutional neural networks methodwith “wildcard training.” The first method encodes each character based on its visual structures and preserves them. Further, the second method treats some of the input characters as wildcards to prevent over-fitting of the classifier. We confirmed that our method showed superior performance to conventional ones for Japanese document classification tasks without data pre-processing.Key Words : document classification, deep learning, convolutional neural network, Japanese character

Number of accesses :  

Other information