• English
    • ไทย
  • English 
    • English
    • ไทย
  • Login
View Item 
  •   Wisdom Repository Home
  • หน่วยงาน
  • สำนักบรรณสารการพัฒนา
  • In Processing
  • Dissertations, Theses, Term Papers
  • View Item
  •   Wisdom Repository Home
  • หน่วยงาน
  • สำนักบรรณสารการพัฒนา
  • In Processing
  • Dissertations, Theses, Term Papers
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Browse

All of Wisdom RepositoryCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsBy Submit DateResource TypesThis CollectionBy Issue DateAuthorsTitlesSubjectsBy Submit DateResource Types

My Account

Login

WORD SEGMENTATION AND PART-OF-SPEECH TAGGING FOR THAI LANGUAGE USING MINIMUM TEXT AND CONDITIONAL RANDOM FIELD

WORD SEGMENTATION AND PART-OF-SPEECH TAGGING FOR THAI LANGUAGE USING MINIMUM TEXT AND CONDITIONAL RANDOM FIELD

by Kannikar Paripremkul; Kannikar Paripremkul; Ohm Sornil; โอม ศรนิล

Title:

WORD SEGMENTATION AND PART-OF-SPEECH TAGGING FOR THAI LANGUAGE USING MINIMUM TEXT AND CONDITIONAL RANDOM FIELD
WORD SEGMENTATION AND PART-OF-SPEECH TAGGING FOR THAI LANGUAGE USING MINIMUM TEXT AND CONDITIONAL RANDOM FIELD

Advisor:

Ohm Sornil
โอม ศรนิล

Issued date:

13/8/2021

Publisher:

National Institute of Development Administration

Abstract:

Thai word segmentation and Part-of-Speech (POS) tagging is still a very active research area. However, previous studies mostly focus on rule-based models or generative models such as the Hidden Markov Model (HMM), which may not suitable for segmenting an unknown word. In this research, we present a novel technique to deal with the problem of word segmentation for a language without explicit word boundary delimiters, like Thai, Chinese, or Korean. This research proposes a machine learning model called the Conditional Random Field (CRF) to segment Thai formal and informal words, including unknown words, teen slang, and loanwords. To avoid word ambiguity, the word segmentation method is separated into three parts: (1) Minimum Text Unit (MTU) segmentation (the smallest unit of a Thai word), (2) syllable segmentation, and (3) word segmentation. In word segmentation, Longest Matching with pattern rules is used to assign word units. Pattern rules that follow Thai language structure for combining characters are also created to avoid segmentation errors. In order to select features for the CRF, existing research and the Thai language system are evaluated. For the character features of the CRF, we present both a general character and more fine-grained levels of vowels—front vowels, for example, can be separated into two categories: (a) front vowels that can have other characters placed in front of them and (b) front vowel that cannot have other character placed in front of them. In the POS tagging procedure, each word is assigned a POS tag by the CRF model. POS tags are revised from an existing corpus to reduce the complexity of usage by grouping uncertain POS tags together. Training data from this existing corpus is re-segmented using the proposed word segmentation method, primarily focusing on the accuracy of word units according to the official Thai dictionary. For the features used in the POS tagging, we experiment with several options and chose those features that were found to be best suited for the CRF method. The performance of the proposed techniques is evaluated using common measurements, namely precision, recall, and F-score. The results are also compared to those of other state-of-the-art methods. In word segmentation, our proposed techniques are compared to a system using a convolutional neural network (CNN) that segments text to words. In terms of POS tagging performance, we compare our techniques to a well-known open API for the Thai language called PythaiNLP, which uses a perceptron algorithm for tagging parts of speech. The approaches proposed by this research are proven successful by high scores in all test data, especially in word segmentation. Our analysis also suggests a need to collect more training data, which may improve segmentation accuracy as well as the results of POS tagging, since both parts of the model are related.
-

Keyword(s):

Word Segmentation
Part-of-Speech Tagging
Minimum Text Unit
Conditional Random Field
Longest Match

Type:

Text

Language:

eng

URI:

https://repository.nida.ac.th/handle/662723737/5642
Show full item record

Files in this item (CONTENT)

Thumbnail
View
  • 5720431002.pdf ( 2,896.33 KB )

ทรัพยากรสารสนเทศทั้งหมดในคลังปัญญา ใช้เพื่อประโยชน์ทางการเรียนการสอนและการค้นคว้าเท่านั้น และต้องมีการอ้างอิงแหล่งที่มาทุกครั้งที่นำไปใช้ ห้ามดัดแปลงเนื้อหา และทำสำเนาต่อ รวมถึงไม่ให้อนุญาตนำไปใช้ประโยชน์เพื่อการค้า ไม่ว่ากรณีใด ๆ ทั้งสิ้น



This item appears in the following Collection(s)

  • Dissertations, Theses, Term Papers [191]

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license.

Copyright © National Institute of Development Administration | สถาบันบัณฑิตพัฒนบริหารศาสตร์
Library and Information Center | สำนักบรรณสารการพัฒนา
Email: NIDAWR@nida.ac.th    Chat: Facebook Messenger    Facebook: NIDAWisdomRepository
 

 

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license.

Copyright © National Institute of Development Administration | สถาบันบัณฑิตพัฒนบริหารศาสตร์
Library and Information Center | สำนักบรรณสารการพัฒนา
Email: NIDAWR@nida.ac.th    Chat: Facebook Messenger    Facebook: NIDAWisdomRepository
 

 

‹›×