|dc.description.abstract||In information age, data has become increasingly large, in both dimension (the
number of features) and volume. Data mining processes, such as data classification
and data clustering, performed on high dimensional data can be time-consuming and
can produce poor results due to the problem so called curse of dimensionality.
Feature selection is one of the fundamental techniques that selects only the most
significant features and eliminates irrelevant and redundant features from the entire
set of features.
Filter-based feature selection is the technique to be focused in this dissertation.
This technique can take less time to select significant features, especially for high
dimensional data, but can not guarantee an optimal feature set.
Filter-based feature selection comprises of two important parts; searching
process and criterion function evaluation. Floating search is commonly used for the
searching process. It is a heuristic search, which does not take much time, however,
can not guarantee an optimal feature set. The latter part relies on a criterion function,
which is an independent measure to evaluate and select feature subsets without
actually performing data mining algorithm. Therefore, it does not inherit any bias of
the data mining algorithm. Usually, only one criterion function is used so one
chararteristic of data is considered at a time. In this dissertation, two criterion
functions are proposed for the feature evaluation. The two functions can compliment
each other and two or more characteristics of data can be considered together to
effectively select features.
Noise, ambiguity and uncertainty of data, which are frequently found in the
real-world problem, can effect data mining process. Hence, fuzzy logic was applied to
cope with these problems in this dissertation. A membership function was needed in
the fuzzy logic to fuzzify original data and to infer data into fuzzy value. The fuzzy
value was then passed through feature selection process instead of the original data.
Genetic algorithm (GA) was used to determine the irregular shape of the membership
function instead of by human expert.
From the experiments, the proposed two criterion functions was found to be
effective to select features that can increase accuracy of data classification. The
proposed method outperforms two existing methods, the hybrid and one criterion
function filter-based methods. The experimental results also show that the proposed
method with fuzzy logic enhances classification accuracy. It outperforms some
wrapper-based feature selection methods, which have been widely known to achieve
higher accuracy than filter-based methods.
The proposed feature selection method can also be used to reduce data
dimension for unsupervised learning problems, such as data clustering. Unlike the
supervised learning problems, there is no class label attribute of data objects to guide
and cluster them into groups. Hence, it is not an easy task to select discriminant
features for unsupervised learning problems. The criterion functions or measures for
unsupervised learning problem were also proposed to be used for the proposed
method. The experimental results showed that the proposed method can help
improving clustering accuracy when compared with the results from other
approaches. Therefore, the proposed feature selection method can be used for both
supervised and unsupervised learning problems.||th