Word segmentation and part-of-speech tagging for Thai language using minimum text and conditional random field

dc.contributor.advisorOhm Sornilth
dc.contributor.authorKannikar Paripremkulth
dc.descriptionThesis (Ph.D. (Computer Science and Information Systems))--National Institute of Development Administration, 2020th
dc.description.abstractThai word segmentation and Part-of-Speech (POS) tagging is still a very active research area. However, previous studies mostly focus on rule-based models or generative models such as the Hidden Markov Model (HMM), which may not suitable for segmenting an unknown word. In this research, we present a novel technique to deal with the problem of word segmentation for a language without explicit word boundary delimiters, like Thai, Chinese, or Korean. This research proposes a machine learning model called the Conditional Random Field (CRF) to segment Thai formal and informal words, including unknown words, teen slang, and loanwords. To avoid word ambiguity, the word segmentation method is separated into three parts: (1) Minimum Text Unit (MTU) segmentation (the smallest unit of a Thai word), (2) syllable segmentation, and (3) word segmentation. In word segmentation, Longest Matching with pattern rules is used to assign word units. Pattern rules that follow Thai language structure for combining characters are also created to avoid segmentation errors. In order to select features for the CRF, existing research and the Thai language system are evaluated. For the character features of the CRF, we present both a general character and more fine-grained levels of vowels—front vowels, for example, can be separated into two categories: (a) front vowels that can have other characters placed in front of them and (b) front vowel that cannot have other character placed in front of them. In the POS tagging procedure, each word is assigned a POS tag by the CRF model. POS tags are revised from an existing corpus to reduce the complexity of usage by grouping uncertain POS tags together. Training data from this existing corpus is re-segmented using the proposed word segmentation method, primarily focusing on the accuracy of word units according to the official Thai dictionary. For the features used in the POS tagging, we experiment with several options and chose those features that were found to be best suited for the CRF method. The performance of the proposed techniques is evaluated using common measurements, namely precision, recall, and F-score. The results are also compared to those of other state-of-the-art methods. In word segmentation, our proposed techniques are compared to a system using a convolutional neural network (CNN) that segments text to words. In terms of POS tagging performance, we compare our techniques to a well-known open API for the Thai language called PythaiNLP, which uses a perceptron algorithm for tagging parts of speech. The approaches proposed by this research are proven successful by high scores in all test data, especially in word segmentation. Our analysis also suggests a need to collect more training data, which may improve segmentation accuracy as well as the results of POS tagging, since both parts of the model are related.th
dc.format.extent70 leavesth
dc.publisherNational Institute of Development Administrationth
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.th
dc.subjectWord Segmentationth
dc.subjectPart-of-Speech Taggingth
dc.subjectMinimum Text Unitth
dc.subjectConditional Random Fieldth
dc.subject.otherThai languageth
dc.titleWord segmentation and part-of-speech tagging for Thai language using minimum text and conditional random fieldth
dc.typetext--thesis--doctoral thesisth
mods.physicalLocationNational Institute of Development Administration. Library and Information Centerth
thesis.degree.departmentSchool of Applied Statisticsth
thesis.degree.disciplineComputer Science and Information Systemsth
thesis.degree.grantorNational Institute of Development Administrationth
thesis.degree.nameDoctor of Philosophyth
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
2.83 MB
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Thumbnail Image
202 B
Plain Text