Word segmentation and part-of-speech tagging for Thai language using minimum text and conditional random field
Files
Issued Date
2020
Available Date
Copyright Date
Resource Type
Series
Edition
Language
eng
File Type
application/pdf
No. of Pages/File Size
70 leaves
ISBN
ISSN
eISSN
Other identifier(s)
b212173
Identifier(s)
Access Rights
Access Status
Rights
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Rights Holder(s)
Physical Location
National Institute of Development Administration. Library and Information Center
Bibliographic Citation
Citation
Kannikar Paripremkul (2020). Word segmentation and part-of-speech tagging for Thai language using minimum text and conditional random field. Retrieved from: https://repository.nida.ac.th/handle/662723737/5642.
Title
Word segmentation and part-of-speech tagging for Thai language using minimum text and conditional random field
Alternative Title(s)
Author(s)
Editor(s)
Advisor(s)
Advisor's email
Contributor(s)
Contributor(s)
Abstract
Thai word segmentation and Part-of-Speech (POS) tagging is still a very active research area. However, previous studies mostly focus on rule-based models or generative models such as the Hidden Markov Model (HMM), which may not suitable for segmenting an unknown word. In this research, we present a novel technique to deal with the problem of word segmentation for a language without explicit word boundary delimiters, like Thai, Chinese, or Korean. This research proposes a machine learning model called the Conditional Random Field (CRF) to segment Thai formal and informal words, including unknown words, teen slang, and loanwords.
To avoid word ambiguity, the word segmentation method is separated into three parts: (1) Minimum Text Unit (MTU) segmentation (the smallest unit of a Thai word), (2) syllable segmentation, and (3) word segmentation. In word segmentation, Longest Matching with pattern rules is used to assign word units. Pattern rules that follow Thai language structure for combining characters are also created to avoid segmentation errors. In order to select features for the CRF, existing research and the Thai language system are evaluated. For the character features of the CRF, we present both a general character and more fine-grained levels of vowels—front vowels, for example, can be separated into two categories: (a) front vowels that can have other characters placed in front of them and (b) front vowel that cannot have other character placed in front of them.
In the POS tagging procedure, each word is assigned a POS tag by the CRF model. POS tags are revised from an existing corpus to reduce the complexity of usage by grouping uncertain POS tags together. Training data from this existing corpus is re-segmented using the proposed word segmentation method, primarily focusing on the accuracy of word units according to the official Thai dictionary. For the features used in the POS tagging, we experiment with several options and chose those features that were found to be best suited for the CRF method.
The performance of the proposed techniques is evaluated using common measurements, namely precision, recall, and F-score. The results are also compared to those of other state-of-the-art methods. In word segmentation, our proposed techniques are compared to a system using a convolutional neural network (CNN) that segments text to words. In terms of POS tagging performance, we compare our techniques to a well-known open API for the Thai language called PythaiNLP, which uses a perceptron algorithm for tagging parts of speech. The approaches proposed by this research are proven successful by high scores in all test data, especially in word segmentation. Our analysis also suggests a need to collect more training data, which may improve segmentation accuracy as well as the results of POS tagging, since both parts of the model are related.
Table of contents
Description
Thesis (Ph.D. (Computer Science and Information Systems))--National Institute of Development Administration, 2020