Word segmentation and part-of-speech tagging for Thai language using minimum text and conditional random field

Kannikar Paripremkul

Word segmentation and part-of-speech tagging for Thai language using minimum text and conditional random field

dc.contributor.advisor	Ohm Sornil	th
dc.contributor.author	Kannikar Paripremkul	th
dc.date.accessioned	2022-03-03T04:22:08Z
dc.date.available	2022-03-03T04:22:08Z
dc.date.issued	2020	th
dc.date.issuedBE	2563	th
dc.description	Thesis (Ph.D. (Computer Science and Information Systems))--National Institute of Development Administration, 2020	th
dc.description.abstract	Thai word segmentation and Part-of-Speech (POS) tagging is still a very active research area. However, previous studies mostly focus on rule-based models or generative models such as the Hidden Markov Model (HMM), which may not suitable for segmenting an unknown word. In this research, we present a novel technique to deal with the problem of word segmentation for a language without explicit word boundary delimiters, like Thai, Chinese, or Korean. This research proposes a machine learning model called the Conditional Random Field (CRF) to segment Thai formal and informal words, including unknown words, teen slang, and loanwords. To avoid word ambiguity, the word segmentation method is separated into three parts: (1) Minimum Text Unit (MTU) segmentation (the smallest unit of a Thai word), (2) syllable segmentation, and (3) word segmentation. In word segmentation, Longest Matching with pattern rules is used to assign word units. Pattern rules that follow Thai language structure for combining characters are also created to avoid segmentation errors. In order to select features for the CRF, existing research and the Thai language system are evaluated. For the character features of the CRF, we present both a general character and more fine-grained levels of vowels—front vowels, for example, can be separated into two categories: (a) front vowels that can have other characters placed in front of them and (b) front vowel that cannot have other character placed in front of them. In the POS tagging procedure, each word is assigned a POS tag by the CRF model. POS tags are revised from an existing corpus to reduce the complexity of usage by grouping uncertain POS tags together. Training data from this existing corpus is re-segmented using the proposed word segmentation method, primarily focusing on the accuracy of word units according to the official Thai dictionary. For the features used in the POS tagging, we experiment with several options and chose those features that were found to be best suited for the CRF method. The performance of the proposed techniques is evaluated using common measurements, namely precision, recall, and F-score. The results are also compared to those of other state-of-the-art methods. In word segmentation, our proposed techniques are compared to a system using a convolutional neural network (CNN) that segments text to words. In terms of POS tagging performance, we compare our techniques to a well-known open API for the Thai language called PythaiNLP, which uses a perceptron algorithm for tagging parts of speech. The approaches proposed by this research are proven successful by high scores in all test data, especially in word segmentation. Our analysis also suggests a need to collect more training data, which may improve segmentation accuracy as well as the results of POS tagging, since both parts of the model are related.	th
dc.format.extent	70 leaves	th
dc.format.mimetype	application/pdf	th
dc.identifier.doi	10.14457/NIDA.the.2020.128	th
dc.identifier.other	b212173	th
dc.identifier.uri	https://repository.nida.ac.th/handle/662723737/5642	th
dc.language.iso	eng	th
dc.publisher	National Institute of Development Administration	th
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	th
dc.subject	Word Segmentation	th
dc.subject	Part-of-Speech Tagging	th
dc.subject	Minimum Text Unit	th
dc.subject	Conditional Random Field	th
dc.subject.other	Thai language	th
dc.title	Word segmentation and part-of-speech tagging for Thai language using minimum text and conditional random field	th
dc.type	text--thesis--doctoral thesis	th
mods.genre	Dissertation	th
mods.physicalLocation	National Institute of Development Administration. Library and Information Center	th
thesis.degree.department	School of Applied Statistics	th
thesis.degree.discipline	Computer Science and Information Systems	th
thesis.degree.grantor	National Institute of Development Administration	th
thesis.degree.level	Doctoral	th
thesis.degree.name	Doctor of Philosophy	th