﻿Feature Subset Selection Using A Genetic Algorithm
Jihoon Yang and Vasant Honavar
Arti cial Intelligence Research Group
Department of Computer Science
226 Atanaso Hall
Iowa State University
Ames, IA 50011. U.S.A.
fyang|honavarg@cs.iastate.edu
Abstract
Many practical pattern classi cation applications require a careful selection of attributes or features (from
amuch larger set) to represent the patterns to be classi ed. This feature subset selection problem is a multicriterion
optimization problem. We propose a solution to this problem using a genetic algorithm. Our experiments
demonstrate the feasibility of this approach for feature subset selection in the automated design of neural
network pattern classi ers.
1 Introduction
Many practical pattern classi cation tasks (e.g., medical diagnosis) require learning of an appropriate classi cation
function that assigns a given input pattern (typically represented using a vector of attribute or feature values) to
one of a nite set of classes. The choice of features, attributes, or measurements used to represent patterns that are
presented to a classi er a ect (among other things):
The accuracy of the classi cation function that can be learned using an inductive learning algorithm (e.g., a
decision tree induction algorithm or a neural network learning algorithm): The attributes used to describe the
patterns implicitly de ne a pattern language. If the language is not expressive enough, it would fail to capture
the information that is necessary for classi cation and hence regardless of the learning algorithm used, the
accuracy of the classi cation function learned would be limited by this lack of information.
The time needed for learning a su ciently accurate classi cation function: For a given representation of
the classi cation function, the attributes used to describe the patterns implicitly determine the search space
that needs to be explored by the learning algorithm. An abundance of irrelevant attributes can unnecessarily
increase the size of the search space, and hence the time needed for learning a su ciently accurate classi cation
function.
The number of examples needed for learning a su ciently accurate classi cation function: All other things
being equal, the larger the number of attributes used to describe the patterns in a domain of interest, the
larger is the number of examples needed to learn a classi cation function to a desired accuracy.
The cost of performing classi cation using the learned classi cation function: In many practical applications
e.g., medical diagnosis, patterns are described using observable symptoms as well as results of diagnostic tests.
Di erent diagnostic tests might have di erent costs as well as risks associated with them. For instance, an
invasive exploratory surgery can be much more expensive and risky than say, a blood test.
This presents us with a feature subset selection problem in automated design of pattern classi ers. The feature
subset selection problem refers the task of identifying and selecting a useful subset of attributes to be used to
represent patterns from a larger set of often mutually redundant, possibly irrelevant, attributes with di erent
Vasant Honavar's research is partially supported by grants from the National Science Foundation (IRI-9409580 and IRI-9643299)
and the John Deere Foundation.
1
associated measurement costs and/or risks. An example of such a scenario which is of signi cant practical interest
is the task of selecting a subset of clinical tests (each with di erent nancial cost, diagnostic value, and associated
risk) to be performed as part of a medical diagnosis task. Other examples of feature subset selection problem include
large scale data mining applications, power system control, and so on. Since exhaustive search over all possible
combinations of features is not feasible, most current approaches to feature subset selection assume monotonicity
of some measure of classi cation performance so that adding features is guaranteed to not worsen performance and
rely on a combination of heuristic search and branch and bound search [Ripley, 1996]. While they appear to work
well with conventional statistical classi ers, their performance can be quite poor with non-linear classi ers such as
neural networks [Ripley, 1996]. Such feature selection techniques fail to deal with multiple selection criteria (e.g.,
classi cation accuracy, feature measurement cost, etc.). Furthermore, in many practical scenarios the monotonicity
assumption is not satis ed. Thus, alternative approaches to the feature selection problem are of interest.
2 Genetic Selection of Feature Subsets for Neural Network Pattern
Classi ers
The feature subset selection problem described above is essentially a multi-criterion optimization problem. The
multiple criteria to be optimized include the accuracy of classi cation, cost and risk associated with classi cation
which in turn depends on the selection of attributes used to describe the patterns. Evolutionary algorithms o er
a particularly attractive approach tomulti-criteria optimization problems. This paper explores an approach to
feature subset selection using a genetic algorithm. The results presented in this paper are based on using a neural
network pattern classi ers although the general approach can be used with any inductive learning algorithm. (The
interested reader is referred to [Honavar, 1994; Langley, 1995; Mitchell, 1997] for surveys of di erent approaches
to inductive learning). Neural networks { densely interconnected networks of relatively simple computing elements
(e.g., threshold or sigmoid neurons) { o er an attractive framework for the design of pattern classi ers for realworld
real-time pattern classi cation tasks on account of their potential for parallelism and fault and noise tolerance
[Ripley, 1996; Hassoun, 1995; Gallant, 1993]. The classi cation function realized by a neural network is determined
by the functions computed by the neurons, the connectivity of the network, and the parameters (weights) associated
with the connections. It is well-known that multi-layer networks of non-linear computing elements (e.g., threshold
neurons) can realize any classi cation function : < n ! C or : D n ! C where C is a nite set of classes and n
is a nite number of discrete or real valued attributes, < is the set of real numbers, and D is a nite set of discrete
values. If the attributes are symbolic, they have to be rst mapped to numeric values using appropriate coding
schemes. While evolutionary algorithms are generally quite e ective for rapid global search of large search spaces in
mult-modal optimization problems, neural networks o er a particularly attractive approach to netuning solutions
once promising regions in the search space have been identi ed [Mitchell, 1996]. Against this background, genetic
algorithms o er an attractive approach to feature subset selection for neural network pattern classi ers.
However, the use of genetic algorithms for feature subset selection for neural network pattern classi ers trained
using traditional neural network training algorithms such as backpropagation presents some practical problems:
Traditional neural network learning algorithms (e.g., backpropagation) perform an error gradient guided search
for a suitable setting of weights in the weight space determined by a user-speci ed network architecture. This
ad hoc choice of network architecture often inappropriately constrains the search for an appropriate setting
of weights. For example, if the network has fewer neurons than necessary, the learning algorithm will fail to
nd the desired classi cation function. If the network has far more neurons than necessary, it can result in
over tting of the training data leading to poor generalization. In either case, it would make it di cult to
evaluate the usefulness of a feature subset employed to describe (or represent) the training patterns used to
train the neural network.
Gradient based learning algorithms although mathematically well-founded for unimodal search spaces, can
get caught in local minima of the error function. This can complicate the evaluation of the usefulness of a
feature subset employed to describe (or represent) the training patterns used to train the neural networks.
Atypical run of a genetic algorithm involves many generations. In each generation, evaluation of an individual
(a feature subset) involves training neural networks and computing their accuracy and cost. This can make
the tness evaluation rather expensive since gradient based algorithms are typically quite slow. The problem
is further exacerbated by the fact that multiple neural networks have to be used to sample the space of ad-hoc
2
choices of network architecture in order to get a reliable tness estimate for each feature subset represented
in the population.
Fortunately, constructive neural network learning algorithms [Honavar & Uhr, 1993; Gallant, 1990; Gallant,
1993] eliminate the need for ad-hoc, and often inappropriate a-priori choices of network architectures; and can
potentially discover near-minimal networks whose size is commensurate with the complexity of the classi cation
task that is implicitly speci ed by the training data. Several new, provably convergent, and relatively e cient
constructive learning algorithms for multi-category real as well as discrete valued pattern classi cation tasks have
begun to appear in the literature [Parekh et al., 1995; Parekh et al., 1996; Yang et al., 1997; Yang et al., 1996].
Many of these algorithms have demonstrated very good performance in terms of reduced network size, learning time,
and generalization in a number of experiments with both arti cial and fairly large real-world data sets [Honavar &
Uhr, 1993; Parekh et al., 1996; Yang et al., 1997; Yang et al., 1996; Gallant, 1990].
DistAl [Yang et al., 1997] is a simple and fast constructive neural network learning algorithm for pattern classi
cation. The results presented in this paper are based experiments using neural networks constructed by DistAl.
The key idea behind DistAl is to add hidden neurons one at a time based on a greedy strategy which ensures that
the hidden neuron correctly classi es a maximal subset of training patterns belonging to a single class. Correctly
classi ed examples can then be eliminated from further consideration. The process terminates when this process
results in an empty training set (when the network correctly classi es the entire training set). When this happens,
the training set becomes linearly separable in the transformed space de ned by the hidden neurons. In fact, it is
possible to set the weights on the hidden to output neuron connections without going through an iterative process.
It is straightforward to show that DistAl is guaranteed to converge to 100% classi cation accuracy on any nite
training set in time that is polynomial in the number of training patterns. Experiments reported in [Yang et al.,
1997] show that DistAl, despite its simplicity, yields classi ers that compare quite favorably with those generated
using more sophisticated (and substantially more computationally demanding) constructive learning algorithms.
This makes DistAl an attractive choice for experimenting with evolutionary approaches to feature subset selection
for neural network pattern classi ers.
3 Implementation Details
Our experiments were run using a standard genetic algorithm [Goldberg, 1989; Mitchell, 1996] with rank-based
selection strategy. The reported results are based on 10 independent runs for each classi cation task. The relevant
parameter settings were:
Population size: 50
Number of generation: 300
Probability of crossover: 0.5
Probability ofmutation: 0.01
probability of selection of the highest ranked individual: 0.6
Each individual in the population represents a candidate solution to the feature subset selection problem. Let
m be the total number of attributes available to choose from to represent the patterns to be classi ed. In a medical
diagnosis task, these would be observable symptoms and a set of possible diagnostic tests that can be performed on
the patient. (Note that given m such attributes, there exist 2 m possible feature subsets. Thus, for large values of
m, exhaustive search is not feasible).
Each individual in the population is represented by a binary vector of dimension m. If a bit is a 1, it means
that the corresponding attribute is selected. A value of 0 indicates that the corresponding attribute is not selected.
The tness of an individual is determined by evaluating the neural network constructed by DistAl using a training
set whose patterns are represented using only the selected subset of features. If an individual has n bits turned on,
the corresponding neural network has n input nodes.
The tness function has to combine two di erent criteria { the accuracy of the classi cation function realized
by the neural network and the cost of performing classi cation. The accuracy of the classi cation function can
be estimated by calculating the percentage of patterns in a test set (generated independently from the training
set) that are correctly classi ed by the neural network in question. A number of di erent measures of the cost of
classi cation suggest themselves: cost of measuring the value of a particular attribute needed for classi cation (or
3
the cost of performing the necessary test in a medical diagnosis application), the risk involved, etc. To keep things
simple, we chose a relatively simple form of a 2-criteria tness function de ned as follows:
fitness(x) =accuracy(x) ,
cost(x)
+ costmax
accuracy(x)+1
where fitness(x) is the tness of the feature subset represented by x, accuracy(x) is the test accuracy of the neural
network classi er trained using DistAl using the feature subset represented by x, and costmax is an upper bound on
the costs of candidate solutions. In this case, it is simply the sum of the costs associated with all of the attributes.
This is clearly a somewhat ad hoc choice. However, it does discourage trivial solutions (e.g., a zero cost solution
with a very low accuracy) from being selected over reasonable solutions which yield high accuracy at a moderate
cost. It also ensures that 8x 0 fitness(x) (100 + costmax). In practice, de ning suitable tradeo s between
the multiple objectives has to be based on knowledge of the domain. In general, it is a non-trivial task to combine
multiple optimization criteria into a single tness function. A wide variety of approaches have been examined in
the utility theory literature [Keeney & Rai a, 1976].
4 Experiments
4.1 Description of Datasets
The experiments reported here used real world datasets as well as carefully constructed arti cial data sets to explore
the feasibility of using genetic algorithms for feature subset selection for neural network classi ers. The real-world
datasets were obtained from the machine learning data repository at the University of California at Irvine [Murphy
& Aha, 1994].
4.1.1 Arti cial Datasets
This dataset was constructed from the training data for the parity problem to explore the e ectiveness of the
genetic algorithm in selecting an appropriate subset of relevant attributes in the presence of redundant attributes so
as to minimize the cost and maximize the accuracy of the resulting neural network pattern classi er. The modi ed
training set is constructed as follows: The original attributes are replicated once (to introduce redundancy) thereby
doubling the number of attributes. Then an additional set of irrelevant attributes are generated and are assigned
random boolean values. The numberofsuch attributes was 7 for the original 3-parity dataset and 12 for the original
6-parity dataset. In the case of 3-parity 100 7-bit random vectors were generated and augmented with the 6-bit
vectors (corresponding to the original 3 bits plus an identical set of 3 bits). In the case of 6-bit parity, 300 random
12-bit vectors were generated and augmented with 12 bits (corresponding to the original 6 bits plus an identical set
of 6 bits). Each attribute in the resulting dataset is assigned a random cost between 0 and 9. In each case, 2/3 of
the dataset was used for training and the rest for testing.
4.1.2 Real-world Datasets
In our experiments with real world data sets, our objective was to compare the neural networks built using feature
subsets selected by the genetic algorithm with those that use the entire set of attributes available. In the absence
of information about the costs of measuring the attributes, we assumed that each attribute has zero cost associated
with it. Thus the focus was on identifying a minimal subset of attributes that yield high accuracy neural network
classi ers. In practical applications where fairly accurate estimates of the measurement costs of various attributes
are known, it would be straightforward to use them in tness evaluation.
Glass Identi cation (Glass): This dataset contains 214 examples belonging to 6 classes. Each pattern is
described by 9 real-valued attributes. 142 examples are randomly selected for training and the remaining 72
are used for testing.
BUPA Liver Disorders (Liver): This dataset contains 345 examples belonging to 2 classes. Each pattern is
described by 6 real-valued attributes. 230 randomly chosen examples are used for training and the remaining
115 examples are used for testing.
4
Wine Recognition (Wine): This dataset contains a total of 178 examples belonging to 3 classes. Each pattern
is described by 13 real-valued attributes. 120 examples are randomly selected for training and the remaining
58 examples are used for testing.
Wisconsin Diagnostic Breast Cancer (WDBC): This dataset contains a total of 569 examples belonging to
2 classes. Each pattern is described by 30 real-valued attributes. 380 examples are randomly selected for
training and the remaining 189 examples are used for testing.
4.2 Experimental Results
As pointed out earlier, the results reported here are based on 10 independent runs of the genetic algorithm for each
classi cation task.
In following tables, Dim denotes the number of attributes selected, Accuracy is the measured accuracy on the
test set, and Hidden is the number of hidden neurons in the neural network constructed by DistAl.
4.2.1 Arti cial Datasets
The results are shown in Table 1.
All attributes GA-selected subset
Dataset Dim Accuracy Hidden Dim Accuracy Hidden
3-bit Parity 13 63.6 9 3.5 0.7 100 0.0 3.9 1.6
6-bit Parity 24 64.0 30 6.1 0.3 100 0.0 4.5 1.5
Table 1: Comparison of neural network pattern classi ers constructed using the entire set of attributes against those
constructed using the best GA-selected subset. In both cases, DistAl was used to construct the neural network from
the training set.
The highest tness feature subset identi ed by the GA contained the original attributes from the parity problem
(although it did not always choose the minimum cost attributes). Furthermore, the generalization accuracy of the
resulting neural network was 100%. On the other hand, when the entire set of attributes was used, the generalization
accuracy of the resulting neural network was signi cantly lower. This is explained by the fact that the additional
attributes were assigned random binary values (and hence have little structure that can be exploited by an inductive
learning algorithm such asDistAl). The fact that some of the runs resulted in feature subsets with not necessarily
minimum cost suggests the possibility of improving the results by the use of a more principled choice of a tness
function that combines accuracy and cost.
4.2.2 Real-world Datasets
As pointed out earlier, in the absence of cost estimates for the attributes used to describe the patterns, the experiments
with real-world datasets assumed the same cost (more speci cally, zero cost) for each attribute. Thus, these
experiments compare the measured generalization of neural networks constructed by DistAl using a feature subset
selected by the GA with that of neural networks constructed using all of the attributes present in the respective
datasets.
The results shown in Table 2 indicate that the networks constructed using GA-selected subset of attributes
compare quite favorably with networks that use all of the attributes. Notably, the generalization accuracy on Wine
dataset improved from 96.6% to 99% using only about half of the attributes. The corresponding networks used a
comparable number of hidden neurons (but substantially fewer connections).
5 Summary and Discussion
The results presented in this paper indicate that genetic algorithms o er an attractive approach to solving the feature
subset selection problem (under a di erent cost and performance constraints) in inductive learning of neural network
pattern classi ers. This nds applications in cost-sensitive design of classi ers for tasks such as medical diagnosis,
5
All attributes GA-selected subset
Dataset Dim Accuracy Hidden Dim Accuracy Hidden
Glass 9 54.2 27 6.5 1:0 56.7 1:8 33.0 2:8
Liver 6 71.3 23 5.3 0:6 71.0 0:8 23.2 0:6
WDBC 30 92.1 14 16.4 3:4 92.1 0:0 14.0 0:0
Wine 13 96.6 4 7.0 1:5 99.1 1:2 4.6 0:9
Table 2: Comparison of neural network pattern classi ers for some real-world datasets constructed using the entire
set of attributes against those constructed using the best GA-selected subset. In both cases, DistAl was used to
construct the neural network from the training set.
computer vision, among others. Other applications of interest include automated data mining and knowledge
discovery from datasets with an abundance of irrelevant or redundant attributes. In such cases, identifying a relevant
subset that adequately captures the regularities in the data can be particularly useful. The GA-based approach
to feature subset selection does not rely on monotonicity assumptions that are used in traditional approaches to
feature selection which often limits their applicability to real-world classi cation and knowledge acquisition tasks.
Some directions for further research include: Application of GA-based approaches to feature subset selection
to large-scale pattern classi cation tasks that arise in power systems control, gene sequence recognition, and data
mining and knowledge discovery; Extensive experimental (and whenever feasible, theoretical) comparison of the performance
of the proposed approach with that of conventional methods for feature subset selection; More principled
design of multi-objective tness functions for feature subset selection using domain knowledge as well as mathematically
well-founded tools of multi-attribute utility theory. Some of these topics are the focus of our ongoing
research.
References
Gallant, S. (1990). Perceptron Based Learning Algorithms. IEEE Transactions on Neural Networks, 1(2), 179{191.
Gallant, S. (1993). Neural Network Learning and Expert Systems. Cambridge, MA: MIT Press.
Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. New York: Addison-
Wesley.
Hassoun, M. (1995). Fundamentals of Arti cial Neural Networks. Boston, MA: MIT Press.
Honavar, V. (1994). Toward Learning Systems That Integrate Multiple Strategies and Representations. Pages 615{
644 of: Honavar, V., & Uhr, L. (eds), Arti cial Intelligence and Neural Networks: Steps Toward Principled
Integration. Academic Press: New York.
Honavar, V., & Uhr, L. (1993). Generative Learning Structures and Processes for Connectionist Networks. Information
Sciences, 70, 75{108.
Keeney, R., & Rai a, H. (1976). Decisions with Multiple Objectives: Preferences and Value Tradeo s. New York:
Wiley.
Langley, P. (1995). Elements of Machine Learning. Palo Alto, CA: Morgan Kaufmann.
Mitchell, M. (1996). An Introduction to Genetic algorithms. Cambridge, MA: MIT Press.
Mitchell, T. (1997). Machine Learning. New York: McGraw Hill.
Murphy, P., & Aha, D. (1994). UCI Repository of Machine Learning Databases. Department of Information and
Computer Science, University of California, Irvine, CA. [http://www.ics.uci.edu/ MLRepository.html].
Parekh, R., Yang, J., & Honavar, V. (1995). Constructive Neural Network Learning Algorithms for Multi-Category
Classi cation. Tech. rept. TR95-15a. Department of Computer Science, Iowa State University.
6
Parekh, R., Yang, J., & Honavar, V. (1996). Constructive Neural Network Learning Algorithms for Real-Valued
Multi-Category Pattern Classi cation. Tech. rept. TR 96-14. Department of Computer Science, Iowa State
University.
Ripley, B. (1996). Pattern Recognition and Neural Networks. New York: Cambridge University Press.
Yang, J., Parekh, R., & Honavar, V. (1996). MTiling - A Constructive Neural Network Learning Algorithm
for Multi-Category Pattern Classi cation. Pages 182{187 of: Proceedings of the World Congress on Neural
Networks'96.
Yang, J., Parekh, R., & Honavar, V. (1997). Design of a Family of E cient Constructive Learning Algorithms based
on Sequential Learning. To appear.
7
