Show simple item record

WORD SEGMENTATION AND PART-OF-SPEECH TAGGING FOR THAI LANGUAGE USING MINIMUM TEXT AND CONDITIONAL RANDOM FIELD

dc.contributorKannikar Paripremkulen
dc.contributorKannikar Paripremkulth
dc.contributor.advisorOhm Sornilen
dc.contributor.advisorโอม ศรนิลth
dc.contributor.otherNational Institute of Development Administration. School of Applied Statisticsen
dc.date.accessioned2022-03-03T04:22:08Z
dc.date.available2022-03-03T04:22:08Z
dc.date.issued13/8/2021
dc.identifier.urihttps://repository.nida.ac.th/handle/662723737/5642
dc.descriptionDoctor of Philosophy (Computer Science and Information Systems) (Ph.D.(Computer Science and Information Systems))en
dc.descriptionปรัชญาดุษฎีบัณฑิต (วิทยาการคอมพิวเตอร์และระบบสารสนเทศ) (Ph.D.(Computer Science and Information Systems))th
dc.description.abstractThai word segmentation and Part-of-Speech (POS) tagging is still a very active research area. However, previous studies mostly focus on rule-based models or generative models such as the Hidden Markov Model (HMM), which may not suitable for segmenting an unknown word. In this research, we present a novel technique to deal with the problem of word segmentation for a language without explicit word boundary delimiters, like Thai, Chinese, or Korean. This research proposes a machine learning model called the Conditional Random Field (CRF) to segment Thai formal and informal words, including unknown words, teen slang, and loanwords. To avoid word ambiguity, the word segmentation method is separated into three parts: (1) Minimum Text Unit (MTU) segmentation (the smallest unit of a Thai word), (2) syllable segmentation, and (3) word segmentation. In word segmentation, Longest Matching with pattern rules is used to assign word units. Pattern rules that follow Thai language structure for combining characters are also created to avoid segmentation errors. In order to select features for the CRF, existing research and the Thai language system are evaluated. For the character features of the CRF, we present both a general character and more fine-grained levels of vowels—front vowels, for example, can be separated into two categories: (a) front vowels that can have other characters placed in front of them and (b) front vowel that cannot have other character placed in front of them. In the POS tagging procedure, each word is assigned a POS tag by the CRF model. POS tags are revised from an existing corpus to reduce the complexity of usage by grouping uncertain POS tags together. Training data from this existing corpus is re-segmented using the proposed word segmentation method, primarily focusing on the accuracy of word units according to the official Thai dictionary. For the features used in the POS tagging, we experiment with several options and chose those features that were found to be best suited for the CRF method. The performance of the proposed techniques is evaluated using common measurements, namely precision, recall, and F-score. The results are also compared to those of other state-of-the-art methods. In word segmentation, our proposed techniques are compared to a system using a convolutional neural network (CNN) that segments text to words. In terms of POS tagging performance, we compare our techniques to a well-known open API for the Thai language called PythaiNLP, which uses a perceptron algorithm for tagging parts of speech. The approaches proposed by this research are proven successful by high scores in all test data, especially in word segmentation. Our analysis also suggests a need to collect more training data, which may improve segmentation accuracy as well as the results of POS tagging, since both parts of the model are related.en
dc.description.abstract-th
dc.language.isoen
dc.publisherNational Institute of Development Administration
dc.rightsNational Institute of Development Administration
dc.subjectWord Segmentationen
dc.subjectPart-of-Speech Taggingen
dc.subjectMinimum Text Uniten
dc.subjectConditional Random Fielden
dc.subjectLongest Matchen
dc.subject.classificationComputer Scienceen
dc.titleWORD SEGMENTATION AND PART-OF-SPEECH TAGGING FOR THAI LANGUAGE USING MINIMUM TEXT AND CONDITIONAL RANDOM FIELDen
dc.titleWORD SEGMENTATION AND PART-OF-SPEECH TAGGING FOR THAI LANGUAGE USING MINIMUM TEXT AND CONDITIONAL RANDOM FIELDth
dc.typeDissertationen
dc.typeดุษฎีนิพนธ์th


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record