﻿Towards a More Efficient Evolutionary Induction
of Bayesian Networks
Carlos Cotta 1 and Jorge Muruzábal 2
1 Dept. Lenguajes y Ciencias de la Computación, ETSI Informática,
University of Málaga, Campus de Teatinos, 29071 - Málaga, SPAIN
ccottap@lcc.uma.es
2 Grupo de Estadística y Ciencias de la Decisión, ESCET,
University Rey Juan Carlos, 28933 - Móstoles, SPAIN
j.muruzabal@escet.urjc.es
Abstract. Bayesian networks (BNs) constitute a useful tool to model
the joint distribution of a set of random variables of interest. This paper
is concerned with the network induction problem. We propose a number
of hybrid recombination operators for extracting BNs from data. These
hybrid operators make use of phenotypic information in order to guide
the processing of information during recombination. The performance of
these new operators is analyzed with respect to that of their genotypic
counterparts. It is shown that these hybrid operators provide notably
improved and rather robust results. Some remarks on the future of the
area are also laid out.
1 Introduction
A Bayesian Network (BN) is a graphical model postulating a joint distribution
for a target set of random variables. One of the main advantageous features of
this model is the fact that it provides a neat separation between qualitative and
quantitative aspects of this distribution. On one hand, the qualitative aspects are
given by the underlying graphical structure, a Directed Acyclic Graph (DAG) G.
On the other hand, quantitative aspects are provided by the set of probabilities
attached to this DAG, say θ = θ(G).
Two well-defined problems can be identified within this context: the network
induction problem (learning an appropriate BN model), and the inference
problem (determining the predictive conditional distribution at some variable
of interest given a BN model and the values taken by certain other variables).
While the latter arises when a BN has already been identified and is to be deployed
in a given application, the former appears as a previous step. The focus
of this work is precisely on this induction problem.
The main issue in the induction problem is learning the structure or DAG G
(a variety of methods can be used to learn the probabilities θ). This turns out
to be NP -hard [6], and hence the use of heuristic algorithms is in order [11]. In
this sense, evolutionary algorithms [2] (EAs) emerge as interesting candidates
for this task. Here we concentrate on the use of EAs for BN induction. More
precisely, we explore in detail the role of recombination for this purpose.
The organization of the paper is as follows. Section 2 presents details of the
BN framework and lays out the basic learning problem addressed later. Section
3 introduces the new operators and Section 4 reviews the empirical evidence.
Finally, Section 5 closes with some discussion and prospects for future research.
2 Background
This section provides basic ideas and notational details about both BNs and some
scoring metrics used for evaluation purposes. A brief overview of EA approaches
for evolving DAGs is provided too.
2.1 Bayesian Networks
As mentioned above, a BN is a tuple (G, θ), where G is a DAG and θ = θ(G)
is a set of probability distributions attached to nodes in G. The DAG specifies
a number of links or arcs among variables or nodes. If we denote the whole
set of variables as X = {X1, X2, ..., Xn}, each variable Xj has a set of parents
denoted by Πj = {Xi ∈ X | {Xi → Xj} ∈ G}. Then, the DAG G represents the
(skeleton) joint distribution P (X) = � n
i=1 P (Xi | Πi). Note that at least one of
the Πi is empty; we talk of root nodes in this case.
A standard BN model arises when data are assumed to follow independent
Multinomial distributions, that is, P (Xi = k | Πi = j) = θijk, where j = 1, ..., qi;
k = 1, ..., ri; ri is the number of distinct values that Xi can assume and qi is
the number of different configurations that Πi can present. Hence, θ = {θijk}
collects all parameters in G and we have �
k θijk = 1 for all i and j.
Given a DAG G and a data matrix D with n columns and an arbitrary
number of exchangeable rows (N), the likelihood of the network probabilities
θ is given by the double product of the above Multinomials: P (D|G, θ) =
�n �qi �ri i=1 j=1 k=1 θNijk
ijk , where Nijk is the absolute frequency of value k in Xi
when its parent configuration Πi assumes state j. Given maximum likelihood
estimators (MLE) ˆθ = ˆθ(G, D) of θ, P (D|G, ˆθ) can serve as a rudimentary
scoring metric. We nevertheless focus on an alternative Bayesian measure. Con-
sider the marginal likelihood
�
P (D|G) = P (D|G, θ)π(θ|G)dθ, (1)
where π(θ|G) is a prior distribution on θ. If this measure is combined with a
prior distribution on DAG structures π(G), the log-posterior
F (G) = log π(G|D) = log π(G) + log P (D|G) (2)
is obtained. We will be using this Bayesian measure F for some particular choices
of π(G) and π(θ|G). To be precise, the former is chosen as π(G) ∝ N −g/2 ,
where g = �n i=1 (ri − 1)qi is the number of free θ parameters in the model. This
choice penalizes complex (i.e., highly dense) DAGs, and is closely related to the
asymptotic Bayesian Information Criterion [10] or BIC 1 . As to the π(θ|G), it
is taken to be the product of independent (conjugate) Dirichlet distributions. In
the case of no missing data and noninformative Dirichlet hyperparameters α, it is
then possible to perform in closed form the integration leading to P (D|G) above.
This key result adds to the computational tractability of the approach and will
be considered here too. The hyperparameters α can be interpreted in terms of
equivalent sample size [11]. The noninformative choice αi = 1 is usually adopted
ri
following theoretical considerations related to likelihood equivalence [12].
2.2 Evolutionary Induction of DAGs
Typically, the EA approach for designing BNs evolves DAG structures which
–when submitted for fitness calculation– are augmented with ˆ θ parameters and
fed into a scoring function. The internal representation of DAGs turns out to be
a crucial issue here. In essence, choices regarding this aspect can be classified
within two main categories: direct and indirect.
Direct approaches are those in which the search is conducted over the space of
all possible DAGs, say SDAG. An obvious potential problem in these approaches
is the generation of infeasible solutions (i.e., digraphs with cycles). This can be
avoided in two different ways. On one hand, a precedence order among variables
can be assumed; then, it suffices to evolve the upper triangular portion of the
adjacency matrix of the graph to obtain feasible DAGs (alternatively, closed
operators in SDAG can be defined; we will return to this point below). On the
other hand, a repair function can be used to remove cycles before evaluation.
See [14] for a comparison of both approaches.
As regards indirect approaches, these use an auxiliary space Saux to conduct
the search. Elements from Saux are then fed to a suitable (decoder) algorithm to
obtain the actual BNs they represent. Consider, for example, the search in the
space of n−element permutations [13]; the construction heuristic K2 [7] is subsequently
used to build the actual BN. This approach has the advantage of filtering
out infeasible solutions while it also introduces problem-specific knowledge.
There are some drawbacks though. For example, it may be difficult to design parsimonious
operators for exploring Saux in some situations (the Saux −→ SDAG
mapping can hinder the design of such operators). Also, some good regions of
SDAG may be unreachable by certain decoders.
Direct approaches do not face these hurdles as long as reproductive operators
manipulate meaningful information units, that is, the main idea is to depart from
the classical (purely syntactic) crossover in order to achieve semantically-sound
recombination. This goal can be seen as an upgrading process, and hence the
identification of syntactically-correct information units is still required. In this
sense, a generic analysis of the syntaxis of these units has been done in [8]. These
1
BIC = log P (D|G, ˆ�) − g
log N
2
authors show that minimal transmission units 2 have the following structure:
T (Xi � Xj, Φ) = {Xi � Xj}
T (Xi → Xj, Φ) = {Xi → Xj} ∪ {Xr � Xs | C ⊕ sr = 1}
where Φ is the partially-defined descendant DAG at any intermediate step of
recombination, Xi → Xj (resp. Xi � Xj) represents the decision of including in
(resp. excluding from) Φ the arc from Xi to Xj, C ⊕ = C ∞ Φ XOR C ∞ Φ∪{Xi→Xj} ,
and C∞ Ψ is the transitive closure of a graph Ψ. The next section shows how the
units in Eq. (3) can be endowed with semantic information.
3 Bayesian Network Recombination
Semantically-aware operators are defined in terms of phenotypic information.
This implies that recombination no longer takes place at the DAG level, but at
the BN level. Two hybrid-operator templates (and phenotypic measures to be
used therein) are discussed below.
3.1 Genetic vs. Allelic Recombination
Any DAG G can be viewed as the composition of a number of basic units η x ij ,
where i, j are nodes and x ∈ {1, 0} indicates whether the corresponding directed
arc is present in G or not. Informally speaking, each ηij is a gene, whereas each
ηx ij is an allele for that gene. Two recombination approaches are thus possible.
A first possibility is to focus on individual genes. A recombination operator
following this criterion must process all genes (in any suitable –not necessarily
fixed– order) to construct a valid solution. For each gene, a decision must be
made on whether to use the allele from either the father G or the mother H3 .
While a genotypic operator would make these decisions at random, the use of
phenotypic information is proposed here. More precisely, a Boolean function
β –taking the two parent BNs (G and H) and the partially-built child (Ξ)
as input– determines the value of each gene. The pseudocode of this generic
operator (termed PheGT for ‘phenotypic gene transmission’) is as follows:
1. for i ∈ {1..n} do ΠΞ i ← ∅
2. for i ∈ {1..n} do Υi ← (ΠG i ∪ ΠH i )
3. while ∃Υj �= ∅ do
(a) Pick Xi ∈ Υj
(b) Υj ← Υj \ {Xi}
(c) if β(ηij, G, H, Ξ) then
i. ΠΞ j ← ΠΞ j ∪ {Xi}
ii. for [Xk � Xs] ∈ T (Xi → Xj, Ξ) do Υs ←− Υs \ {Xk}
2 Transmission units can be seen as minimal –not necessarily elementary– pieces of
information that have to transmitted as a whole from parents to offspring so as to
ensure feasibility of the latter.
3 Extensions to multiparent recombination are straightforward.
(3)
A different template arises when the emphasis is put on individual alleles. In
this case, all η1 ij alleles taken from either parent are put on a common bag. Then,
the operator iteratively decides which alleles are extracted and injected (together
with the corresponding transmission unit) into the child. The operator may
decide to terminate transmission at any point along the process; all unspecified
genes are given the default value η0 ij in this case. Phenotypic information can be
used here for both deciding the order in which alleles are picked (using a selection
function σ), and for determining when to stop (using a Boolean function τ). The
corresponding template of this operator (termed PheAT for ‘phenotypic allele
transmission’) is as follows:
1. for i ∈ {1..n} do ΠΞ i ← ∅
2. Υ ←− 〈η1 ij | Xi ∈ ΠG j ∪ ΠH j 〉
3. while ¬τ(Υ, Ξ) do
(a) η1 ij ← σ(Υ, Ξ)
(b) Υ ← Υ \ {η1 ij }
(c) ΠΞ j ← ΠΞ j ∪ {Xi}
(d) for [Xk � Xs] ∈ T (Xi → Xj, Ξ) do Υ ← Υ \ {η1 ks }
Some possible instantiations of the above templates (β for PheGT; σ and τ
for PheAT) are discussed next.
3.2 Phenotypic Measures
The mutual information MI(Xj, Xi) criterion has often been the choice for measuring
the merit of single alleles η1 ij [16]. However, this measure is known to have
some limitations due to its myopic nature [9]. The updated MI measure, namely
the Conditional Mutual Information measure [9]
CMI(Xj, Xi || Πj \ {Xi}) = � P (Πj \ {Xi}) � P (Xj, Xi | Πj \ {Xi})
log
P (Xj, Xi | Πj \ {Xi})
P (Xj | Πj \ {Xi})P (Xi | Πj \ {Xi})
reflects the strength of the association between Xj and Xi once the effect of
Πj \ {Xi} is taken into account. While this CMI measure deserves further
attention, in this work we have decided to explore a somewhat simpler but
nonetheless interesting variant thereof. Specifically, given that Xi ∈ Πj in either
parent, we consider the grand average
µij = ri
rjqj
(4)
� V ar(Xi, y, w), (5)
where the sum ranges across both the qj
ri different values w that Πj \ {Xi} can
take and the rj different values y that Xj can take. The inner term V ar(Xi, y, w)
is defined as the variance of the probabilities P (Xj = y | Xi = z, Πj \{Xi} = w)
across the ri different values z that Xi can take. These probabilities are of course
nothing but P (Xj = y | Πj = (z, w)) = θ j(z,w)y with our earlier notation. As
usual, these theoretical µij are replaced in practice by their MLE ˆµ ij based on
ˆθ. If for some (y, w) the estimate of V ar(Xi, y, w) is close to 0, we conclude that
any Xi = z adds nothing new to what w already tells us about y. It is easy
to see that Eq. (4) is also close to 0 in this case. Conversely, if V ar(Xi, y, w) is
relatively large, then it does matter what Xi has to say in that situation, so we
would tend to use both Xi and Πj \ {Xi} when predicting Xj (in this particular
DAG and in general – recall that there is no explicit conservation law for the
Π ′ js from parents to children).
This µ measure can be used within the operator templates presented earlier.
We begin with β (central in PheGT). According to Eq. (5), µij is always in [0,
0.25]. Thus, a first option is to use µ ′ ij = 4µij as our transmission probability:
β(ηij, ·) ≡ URand(0, 1) < µ ′ ij . Since this can be a rather demanding criterion
for arc transmission, we also consider the more relaxed µ ′′
ij = 2√ µij.
As regards PheAT, an allele-selection function σ is required. This admits a
direct instantiation since we can always pick the allele with the highest µij value
(no rescaling required here). A simple (genotypic) criterion has been chosen in
turn for the termination function τ. Specifically, a random number is first drawn
from a Binomial(ν,φ) distribution, where φ = 1/2 and ν/2 approximates the
parents’ mean number of arcs, and arc transmission is terminated as soon as the
child reaches the desired number of arcs (or no transmittable arc remains). With
this choice, we can compare PheAT to a pure genotypic version (picking alleles
at random).
4 Experimental Results
We have tested a steady-state EA (popsize = 100, maxevals = 15000, crossover
rate pX = .9, mutation rate pm = 1/n2 ), using tournament selection (tournament
size = 3). No fine tuning of these parameters was attempted. The initial
population is obtained by generating DAGs at random4 . The goal is to minimize
the fitness function −F (G) = − log P (D|G) + g
2 log N, see Eq. (2) above.
Two networks have been chosen to benchmark the proposed approach: the
ALARM network, a 37-variable network for monitoring patients in the intensive
care unit [3], and the INSURANCE network, a 27-variable BN for evaluating
car insurance risks [4]. However, due to space constraints, we concentrate here
on the former (similar qualitative results have been obtained with the latter).
Training sets of N = 2, 000 examples were simulated from the ALARM network.
We have tried both the phenotypic (PheGT and PheAT) as well as the genotypic
(GT and AT) operators. Two variants of each operator have been considered
in turn: respectful and non-respectful. The property of respect [15] refers
in this case to the initial transmission of all arcs shared by the parent DAGs.
Since acyclicity is enforced at all times, inclusion of (some of) these arcs may
be impossible later5 . Hence, this initial transmission introduces an important
4 The size of the sets Πj is limited by qj < 2 11 .
5 For example consider DAGs G and H such that {Xi → Xj, Xj → Xk} ∈ G, and
{Xk → Xi, Xi → Xj} ∈ H. If arcs {Xj → Xk, Xk → Xi} are transmitted to the
child, it will be impossible to transmit the common arc Xi → Xj as well.
Table 1. Results of the different crossover operators on the ALARM network (averaged
for 10 runs).
−F ∗
− log P (D|G)
Operator best mean ± std.dev best mean ± std.dev.
GT 28718.18 29888.03 ± 584.18 26931.97 28045.57 ± 631.24
AT 29470.42 29901.69 ± 311.24 27338.98 28085.83 ± 382.90
PheGT 29314.62 30038.57 ± 528.11 27646.22 28614.16 ± 567.53
PheAT 25596.55 26115.35 ± 469.76 24106.94 24726.66 ± 534.44
GT R
24290.02 24807.84 ± 328.75 22617.82 23111.70 ± 285.17
AT R
24490.63 24896.07 ± 269.84 22806.67 23139.50 ± 224.46
PheGT R 23929.07 24493.83 ± 317.99 22291.76 22891.72 ± 368.90
PheAT R 23861.68 24430.92 ± 379.92 22216.06 22729.01 ± 299.41
PheGT R 2 23944.63 24245.57 ± 149.02 22422.18 22671.80 ± 170.73
HC 24528.41 24732.48 ± 168.53 22907.42 23000.66 ± 86.04
ALARM 24922.33 22987.90
qualitative change in behavior. As regards the µ ′ ij
vs. µ′′ ij choice in PheGT, we
consider only the respectful variant and mark the latter option with a subscript.
For comparative purposes, a hill climbing (HC) algorithm has also been tested.
This HC performs single-arc insertions and deletions and has been run for the
same number of evaluations as the EAs (re-start was performed each time stagnation
was reached). Table 1 shows the results.
A quick inspection of these results leads to several conclusions of interest.
Note first that the introduction of respect yields a substantial improvement in all
performance measures. It can also be seen that the phenotypic operators clearly
outperform 6 their genotypic counterparts (thus confirming the usefulness of the
phenotypic approach), whereas the HC algorithm lies somewhere in between.
Additionally, the networks provided by PheAT and PheGT are definitely better
(in terms of the selected measures) than the original network. This feature is due
to the finite size of the training set and indicates that our best operators achieve
some refinement in the BNs they produce. Finally, the non-linear mapping lead-
ing to µ ′′
ij in PheGTR 2 provides the best overall results. Clearly, this option boosts
the ability of PheGT for exploring and finding improved structures.
The structural properties of the evolved BNs are consistent with the analysis
above. Table 2 shows the total number of θ parameters (g) as well as the BIC
measure discussed earlier. It can be seen that the networks provided by AT and
PheAT are slightly more complex on average than those produced by GT and
PheGT, whereas all of them tend to be simpler than the true network. Note
also that the phenotypic crossover operators manage to produce networks of
similar BIC than the original ALARM network; moreover, in some cases they
interestingly provide even lower values.
6 Significantly, using a standard (two-sample) t-test. The same holds when using a
test set different from the training set, so as to evaluate overfitting.
Table 2. Structural properties of the networks evolved by the different crossover operators
for the ALARM benchmark (averaged for 10 runs).
#parameters BIC
Operator min mean ± std.dev. max best mean ± std.dev.
GT 415 484.8 ± 65.29 659 28613.96 29779.14 ± 583.73
AT 375 477.8 ± 81.59 629 29372.47 29795.73 ± 308.49
PheGT 266 374.8 ± 62.55 488 29210.28 29946.81 ± 530.89
PheAT 307 365.4 ± 58.89 507 25511.70 26032.53 ± 466.78
GT R
376 446.3 ± 38.17 505 24194.88 24708.69 ± 324.30
AT R
425 462.2 ± 30.74 536 24396.27 24796.51 ± 266.89
PheGT R 386 429.2 ± 27.65 483 23842.55 24433.58 ± 329.83
PheAT R 387 451.2 ± 44.28 514 23730.24 24350.72 ± 418.22
PheGT R 2 381 414.1 ± 20.78 441 23859.39 24155.89 ± 147.98
HC 403 455.7 ± 37.12 518 24434.77 24635.45 ± 164.56
ALARM 509 24049.43
5 Summary and the Likely Future
We have described and evaluated several new recombination operators for evolving
BNs. These operators are based on phenotypic information and thus depart
from previously proposed genotypic crossover and phenotypic mutation. It has
been shown that our phenotypic variants produce satisfactory results in problems
of moderate complexity. In this sense, the observance of the property of respect
has revealed itself as a crucial factor for the performance of these operators.
Perhaps the most challenging line of research in the wider BN induction
problem refers to the possibility of performing the search over the space of
BN equivalence classes, say SEq (rather than SDAG as above). Two BNs are
(Markov) equivalent if they encode the same statistical model, that is, the same
set of independence and conditional independence statements. Let [G] denote
the equivalence class of a DAG G. Given training data generated by G, many
DAG-based algorithms use scoring measures that indeed score equally all members
of [G]. Hence, all that can be reasonably asked in this case is to reach some
DAG in [G]: these algorithms can not be expected to reconstruct G exactly.
Note that the marginal likelihood P (D|G) in Eq. (1) is one of such metrics, yet
our fitness measure F (G) –dependent also on g– is not. It is still interesting to
imagine how such an alternative search process could be carried out by an EA
similar to the above. For one thing, search strategies that spend most of their
time within the same equivalence class would seem rather inefficient (since they
inadvertedly keep proposing the same model). We conclude by briefly providing
some preliminary insights on this matter.
It turns out that equivalence classes can be compactly represented by (certain
class of) partially directed acyclic graphs or PDAGs [1, 5]. PDAGs include
directed as well as undirected arcs. Chickering [5] provides an algorithm that
takes a given DAG G and outputs the PDAG ¯G that uniquely represents its
equivalence class [G]. Since ¯G and G have the same connectivity pattern (ig-
ANAPHYLAXIS
TPR
HYPOVOLEMIA
LVFAILURE
LVEDVOLUME STROKEVOLUME
PCWP CVP
CO
BP
PULMEMBOLUS
PAP
SHUNT
HISTORY
FIO2
PVSAT
SAO2
INTUBATION
CATECHOL
VENTALV
ARTCO2
ERRLOWOUTPUT
HRBP
PRESS
INSUFFANESTH
KINKEDTUBE
MINVOL
HR
HRSAT
DISCONNECT
VENTLUNG
EXPCO2
ERRCAUTER
HREKG
VENTTUBE
MINVOLSET
VENTMACH
Fig. 1. The ALARM network. The reversible arcs are: MINVOLSET→VENTMACH,
PULMEMBOLUS→PAP, ANAPHYLAXIS→TPR and LVFAILURE→HISTORY. We
can visualize the key PDAG representing [ALARM] by making these arcs undirected.
noring directionality), all undirected arcs in ¯G correspond to reversible arcs in
G, whereas all directed arcs in ¯G are compelled: they show up throughtout [G].
A reasonable assessment of BN quality would require correct directionality for
compelled arcs but just connectivity for reversible arcs. In the ALARM network,
for example, we find 4 reversible and 42 compelled arcs, see Figure 1.
To conclude, consider now potential mutation and crossover operators for
some parent PDAGs ¯G and ¯H. A first issue refers to PDAG validity: not all
PDAGs represent equivalence classes. Chickering [5] presents various operators
designed to modify a given ¯G so that the resulting PDAG effectively represents
a different equivalence class. For example, both directed and undirected arcs
can be added or deleted (compelled arcs can sometimes be reversed also). The
familiar mutation operators found in the evolutionary DAG arena (e.g., [16]) can
thus be extended along this way.
As regards crossover, an obvious approach would randomly instantiate ¯G
and ¯H so as to obtain DAGs G and H from which phenotypic measures could
be derived as above. The main challenge remains about how to meaningfully
incorporate this DAG-based information when defining the offspring ¯K derived
from ¯G and ¯H. We are currently exploring some ideas in this direction.
Acknowledgement
The authors are partially supported by grants TIC99-0754-C03-03 and TIC2001-
0175-C03-03 from the Spanish CICYT agency. We appreciate the assistance of
David Chickering in running his LABEL-EDGES algorithm.
References
1. S.A. Andersson, D. Madigan, and M.D. Perlman. A characterization of markov
equivalence classes for acyclic digraphs. Annals of Statistics, 25:505–541, 1997.
2. Th. Bäck, D.B. Fogel, and Z. Michalewicz. Handbook of Evolutionary Computation.
Oxford University Press, New York NY, 1997.
3. I.A. Beinlich, H.J. Suermondt, R.M. Chavez, and G.F. Cooper. The ALARM
monitoring system: A case study with two probabilistic inference techniques for
belief networks. In J. Hunter, J. Cookson, and J. Wyatt, editors, Proceedings
of the Second European Conference on Artificial Intelligence and Medicine, pages
247–256, Berlin, 1989. Springer-Verlag.
4. J. Binder, D. Koller, S. Russell, and K. Kanazawa. Adaptive probabilistic networks
with hidden variables. Machine Learning, 29:213–244, 1997.
5. D.M. Chickering. Learning equivalence classes of bayesian-network structures. Submitted
manuscript, 2001.
6. D.M. Chickering, D. Geiger, and D. Heckermann. Learning bayesian networks is
NP-complete. In D. Fisher and H.-J. Lenz, editors, Learning from data: AI and
Statistics V, pages 121–130, New York NY, 1996. Springer-Verlag.
7. G. Cooper and E. Herskovits. A bayesian method for the induction of probabilistic
networks from data. Machine Learning, 9:309–347, 1992.
8. C. Cotta and J.M. Troya. Analyzing directed acyclic graph recombination. In
B. Reusch, editor, Computational Intelligence: Theory and Applications, volume
2206 of Lecture Notes in Computer Science, pages 739–748. Springer-Verlag, Berlin
Heidelberg, 2001.
9. N. Friedman, I. Nachman, and D. Pe’er. Learning bayesian network structures from
massive datasets: The sparse candidate algorithm. In H. Dubios and K. Laskey,
editors, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence,
pages 206–215, San Francisco CA, 1999. Morgan Kaufmann.
10. D. Geiger, D. Heckerman, and C. Meek. Asymptotic model selection for directed
networks with hidden variables. In E. Horvitz and F.V. Jensen, editors, Proceedings
of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence, pages
283–290, San Francisco CA, 1996. Morgan Kaufmann.
11. D. Heckerman. A tutorial on learning with bayesian networks. In M.I. Jordan,
editor, Learning in Graphical Models, pages 301–354. Kluwer, Dordrecht, 1998.
12. D. Heckerman, D. Geiger, and D.M. Chickering. Learning bayesian networks: the
combination of knowledge and statistical data. Machine Learning, 20(3):197–243,
1995.
13. P. Larrañaga, C.M.H. Kuijpers, R.H. Murga, and Y. Yurramendi. Learning
bayesian network structures by searching for the best ordering with genetic algorithms.
IEEE Transactions on Systems, Man and Cybernetics, 26(4):487–493,
1996.
14. P. Larrañaga, M. Poza, Y. Yurramendi, R.H. Murga, and C.M. H. Kuijpers. Structure
learning of bayesian networks by genetic algorithms: A performance analysis
of control parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence,
10(9):912–926, 1996.
15. N.J. Radcliffe. Equivalence class analysis of genetic algorithms. Complex Systems,
5:183–205, 1991.
16. M.L. Wong, W. Lam, and K.S. Leung. Using evolutionary programming and minimum
description length principle for data mining of bayesian networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 21(2):174–178, 1999.
