﻿NC ms 2933
Universal Approximation Capability of
Cascade Correlation for Structures  
Barbara Hammer ¡
, Alessio Micheli ¢
, and Alessandro Sperduti £
July 21, 2004
Abstract
Cascade correlation (CC) constitutes a training method for neural
networks which determines the weights as well as the neural architec-
ture during training. Various extensions of CC to structured data have
been proposed: recurrent cascade correlation (RCC) for sequences, re-
cursive cascade correlation (RecCC) for tree structures with limited
¤ This work has been partially supported by MIUR grant 2002093941 004. We
would like to thank two anonymous referees for profound and valuable comments
on an earlier version of the manuscript.
¥ Department of Mathematics/Computer Science, University of Osnabrück, D-
49069 Osnabrück, Germany, e-mail: hammer@informatik.uni-osnabrueck.de
Italia
¦ Dipartimento di Informatica, Università di Pisa, Pisa, Italia
§ Dipartimento di Matematica Pura ed Applicata, Università di Padova, Padova,
1
fan-out, and contextual recursive cascade correlation (CRecCC) for
rooted directed positional acyclic graphs (DPAGs) with limited fan-in
and fan-out. We show that these models possess the universal approxi-
mation property in the following sense: given a probability measure  
on the input set, every measurable function from sequences into a real
vector space can be approximated by a sigmoidal RCC up to any de-
sired degree of accuracy up to inputs of arbitrary small probability. Ev-
ery measurable function from tree structures with limited fan-out into a
real vector space can be approximated by a sigmoidal RecCC with mul-
tiplicative neurons up to any desired degree of accuracy up to inputs of
arbitrary small probability. For sigmoidal CRecCC networks with mul-
tiplicative neurons, we show the universal approximation capability for
functions on an important subset of all DPAGs with limited fan-in and
fan-out for which a specific linear representation yields unique codes.
We give one sufficient structural condition for the latter property which
can easily be tested: the enumeration of ingoing and outgoing edges
should be compatible. This property can be fulfilled for every DPAG
with fan-in and fan-out two via reenumeration of children and parents,
and for larger fan-in and fan-out via an expansion of the fan-in and
fan-out and reenumeration of children and parents. In addition, the re-
sult can be generalized to the case of IO-isomorphic transductions of
structures. Thus, CRecCC networks constitute the first neural models
for which the universal approximation capability of functions involving
fairly general acyclic graph structures is proved.
2
1 Introduction
Pattern recognition methods such as neural networks or support vector machines
constitute powerful machine learning tools in a widespread area of applications.
Usually, they process data in the form of real vectors of fixed and finite dimension-
ality. Canonical extensions of these basic models to sequential data which occur in
language processing, bioinformatics, or time-series prediction, for example, are pro-
vided by recurrent networks or statistical counterparts (Bengio and Frasconi, 1996;
Kremer, 2001; Sun, 2001). Recently, an increasing interest in the possibility to deal
also with more complex non-standard data has been observed and several models
have been proposed within this topic: specifically designed kernels such as string
and tree kernels or kernels derived from statistical models constitute a canonical in-
terface for kernel methods for structures (Haussler, 1999; Jaakkola, Diekhans, and
Haussler, 2000; Leslie, Eskin, and Noble, 2002; Lodhi et al., 2000; Watkins, 1999).
Alternatively, recurrent neural models can be extended to complex dependencies
which occur in tree structures, graphs, or spatial data (Baldi et al., 1999; Fras-
coni, Gori, and Sperduti, 1998; Goller and Küchler, 1996; Hammer, 2000; Micheli,
Sona, and Sperduti, 2000; Pollastri et al., 2002; Sperduti, Majidi, and Starita, 1996;
Sperduti and Starita, 1997; Wakuya and Zurada, 2001). These structure processing
models have the advantage that complex data such as spatial data, tree structures, or
graphs can be directly tackled and thus a possibly time consuming and usually not
information preserving encoding of the structures by a finite number of real-valued
features can be avoided. Successful applications of these models have been reported
in different areas such as image and document processing, logic, natural language
processing, chemistry, DNA processing, homology detection, or protein structure
prediction (Baldi et al., 1999; Diligenti, Frasconi, and Gori, 2003; Goller, 1997;
3
Jaakkola, Diekhans, and Haussler, 2000; Leslie, Eskin, and Noble, 2002; Lodhi et
al., 2000; deMauro et al., 2003; Pollastri et al., 2002; Sturt et al., 2003; Vullo and
Frasconi, 2003). Here, we are interested in a variant of so-called recursive networks
which constitute a straightforward generalization of well known recurrent networks
to tree structures and for which excellent results also for large learning tasks have
been reported (Frasconi, 2002).
Training recurrent neural networks faces severe problems such as the problem
of long-term dependencies, and the design of efficient training algorithms is still
a challenging problem of ongoing research (Bengio, Simard, and Frasconi, 1994;
Hammer and Steil, 2002). Since the number of potentially different structures in-
creases exponentially with the size of the data, the space is often only sparsely
covered by a given training set. In addition, the generalization ability might be
fundamentally worse compared to simple pattern recognition tools because the ca-
pacity of the models (measured e.g. in terms of the VC dimension) depends on the
structures (Hammer, 2001). For recursive models for structures, the long-term de-
pendencies problem is often reduced because structural representations via trees or
graphs yield more compact representation of data. However, the other problems
remain. Thus the design of efficient training algorithms which yield sparse models
with good generalization ability is a key issue for recursive models for structures.
Cascade correlation (CC) has been proposed by (Fahlmann and Lebiere, 1990)
as a particularly efficient training algorithm for feedforward networks which simul-
taneously determines the architecture of the network and the parameters. Hidden
neurons are created and trained consecutively, so to maximize the correlation of
the hidden neuron with the current residual error. Thus, hidden neurons can be
used for error correction in consecutive steps, which often leads to very sparse so-
lutions with good generalization ability. CC proved to be particularly appropriate
4
for training problems where standard feedforward network training is difficult such
as the two spirals problem (Fahlmann and Lebiere, 1990). The main difference of
CC networks with respect to feedforward networks lies in the particularly efficient
constructive training method. Since any feedforward network structure can be em-
bedded into the structure of a CC network and vice versa, the principled represen-
tational capabilities of CC networks and standard feedforward networks coincide
(Hornik, Stinchcombe, and White, 1989).
Like standard feedforward networks, CC only deals with vectors of fixed di-
mensionality. Recurrent cascade correlation (RCC) has been proposed as a gener-
alization of CC to sequences of unlimited length (Fahlmann, 1991). Recently, more
powerful models for tree structures and acyclic directed graphs have also been pro-
posed and successfully applied as prediction models for chemical structures: recur-
sive cascade correlation and contextual recursive cascade correlation, respectively
(Bianucci et al., 2000; Micheli, 2003; Micheli, Sona, and Sperduti, 2003b; Micheli,
Sona, and Sperduti, 2002; Micheli, Sona, and Sperduti, 2000; Sperduti, Majidi, and
Starita, 1996). The efficient training algorithm of CC can be transferred to these
latter models and very good results have been reported. Here, we are interested in
the in-principle representational capabilities of these models for different type of
structures.
RCC extends simple CC by recurrent connections such that input sequences can
be processed step by step. Thereby, the iterative training scheme of hidden neurons
is preserved and thus training is very efficient. However, because of this itera-
tive training scheme, hidden neurons have to be independent of all hidden neurons
which are introduced later. This causes the fact that a RCC network, unlike RNNs,
is not fully recurrent. Recurrent connections of a hidden neuron are pointing to-
wards the neuron itself and towards hidden neurons which are introduced later, but
5
not towards previous hidden neurons. Because of this fact, RNNs can in general
not be embedded into RCC networks: the two models differ with respect to the
network architectures. RNNs constitute universal approximators (Funahashi and
Nakamura, 1993) and the question now occurs whether the restricted topology of
RCC networks, which makes the particularly efficient iterative training scheme pos-
sible, preserves this property. This is not clear because local recurrence instead of
full connectivity of neurons might severely limit the representational capabilities: it
has been shown by (Frasconi and Gori, 1996) that the number of functions which
can be implemented by a specific locally recurrent network architecture with the
Heaviside activation function can be limited regardless of the number of neurons
and thus restrictions apply. The work presented in (Giles et al., 1995) explicitely
investigates RCC architectures and shows that these networks equipped with the
Heaviside function or a monotonically increasing activation function cannot imple-
ment all finite automata. Thus, in the long term limit, RCC networks are strictly
weaker than fully connected recurrent networks.
However, the universal approximation capability of recurrent networks usually
refers to the possibility of approximating every given function on a finite time hori-
zon. This question has not yet been answered in the literature and it will be inves-
tigated in this article. We will show that RCC networks are universal approxima-
tors with respect to a finite time horizon, i.e. they can approximate every function
up to inputs of arbitrary small probability arbitrarily well. Hence RCC networks
and recurrent networks do not differ with respect to this notion of approximation
capability. Note that RCC networks, unlike RNN networks, are built and trained
iteratively, adding one hidden neuron at a time. Thus, the architecture is determined
automatically and only small parts of the network are adapted during a training
cycle.
6
As already mentioned, RCC networks have been generalized to recursive cas-
cade correlation (RecCC) networks for tree structured inputs (Bianucci et al., 2000;
Sperduti, Majidi, and Starita, 1996; Micheli, 2003). Recursive networks constitute
the analogous extension of recurrent networks to tree structures (Frasconi, Gori, and
Sperduti, 1998; Goller and Küchler, 1996; Sperduti and Starita, 1997). For the lat-
ter, the universal approximation capability has been established in (Hammer, 2000).
However, RecCC networks share the particularly efficient iterative training scheme
of RCC and CC networks and thus recurrence is restricted to recursive connections
of hidden neurons to neurons introduced later, but not in the opposite direction also
in RecCC networks. Thus it is not surprising that the same restrictions as for RCC
networks apply and some functions, which can be represented by fully recursive
neural networks, cannot be represented by RecCC networks in the long term limit
because of these restrictions (Sperduti, 1997). However, the practically relevant
representational capabilities if restricted to a finite horizon, i.e. input trees with re-
stricted maximum depth, is not yet answered in the literature. We will show in this
article that RecCC networks with multiplicative neurons possess the universal ap-
proximation property for finite time horizon and thus constitute valuable and fast
alternatives to fully connected recursive networks for practical applications.
Recurrent and recursive models put a causality assumption on data processing:
sequences are processed from one end of the sequences to the other; and trees are
processed from the leaves to the root. Thus, the state variables associated to a se-
quence entry depend on the left context, but not on the right one, and state variables
associated to internal nodes in a tree depend on their children, but not the other
way around. This causality assumption might not be justified by the data. For time
series, the causality assumption implies that events in the future depend on events
in the past, but not vice versa, which is reasonable. For spatial data such as sen-
7
tences or DNA strings, the same causality assumption implies that the value at an
interior position of the sequence does only depend on the left part of the sequence,
but not on the right one. This is obviously not true in general. A similar argument
can be stated for tree structures and more general graphs such as they occur e.g. in
logic or chemistry. Therefore extensions of recursive models for sequences and tree
structures have been proposed which also take additional contextual information
into account. The model proposed in (Baldi et al., 1999) trains two RNNs simul-
taneously to predict the secondary structure of proteins. The networks are thereby
transforming the given sequence in reverse directions such that the full informa-
tion is available at each position of the sequence. A similar approach is proposed
in (Pollastri et al., 2002) for grids, whereby four recursive networks are trained
simultaneously in order to integrate all available information for each vertex of a
regular two-dimensional lattice. Similar problems for sequences have been tackled
by a cascade correlation approach including context in (Micheli, Sona, and Sper-
duti, 2000). Another model with similar ideas has been proposed in (Wakuya and
Zurada, 2001). Note that several of these models simply construct disjoint recursive
networks according to different causality assumptions, and afterwards integrate the
full information in a feedforward network. However, all reported experiments show
that the accuracy of the models can significantly be improved by this integration of
context if spatial data or graph structures are dealt with.
The restricted recurrence of cascade correlation offers a particularly simple and
elegant and, as we will see, fundamentally different way to include contextual in-
formation: the hidden neurons are consecutively trained and, once they are trained,
frozen. Thus we can introduce recursive as well as contextual connections of these
neurons to neurons introduced later without getting cyclic (i.e. ill-defined) depen-
dencies. I.e., given a vertex in a tree or graph, the hidden neuron activation for
8
this vertex can have direct access to the state of its children and its parents for
each previous hidden neuron. Thereby, the effective iterative training scheme of CC
is preserved. Obviously, this possibility of context integration essentially depends
on the recurrence being restricted and it cannot be transferred to fully recursive
networks. This generalization of RecCC networks towards context integration has
recently been proposed in (Micheli, Sona, and Sperduti, 2003b; Micheli, Sona, and
Sperduti, 2002). These models are applied to prediction tasks for chemical struc-
tures, and they compare favorably to models without context integration. Plots of
the principal components of the hidden neurons’ states of the contextual models in-
dicate that the additional contextual information is effectively used by the network.
Unlike the above mentioned approaches, contextual information is here directly in-
tegrated into the recursive part of the network such that the information can be used
in an earlier stage than e.g. in (Baldi et al., 1999; Pollastri et al., 2002). Thanks to
these characteristics, in the cascade approach, the processing of contextual informa-
tion is efficiently extended, with respect to the previous approches, from sequence
or grids to more complex structured data such as trees and acyclic graphs.
The question now occurs whether the “in principle” extension to contextual pro-
cessing and the improved accuracy in experiments can be accompanied by more
general theoretical results on the universal approximation capibility of the model.
A first study of the enlarged capacity of the model can be found in (Micheli, Sona,
and Sperduti, 2003b; Micheli, 2003), where it is shown that, in a directed acyclic
graph, contextual RecCC (CRecCC) takes into account the context of each vertex,
including both successors and predecessors of the vertex (as formally proven in
(Micheli, Sona, and Sperduti, 2003a)). Moreover, the integration of context extends
the class of functions which can be approximated in the sense that more general
structures can be distinguished: contextual models can differentiate between ver-
9
tices with different parents but the same content and children. Finally, contextual
RecCC can implement classes of contextual functions where a desired output for a
given vertex may depend on the predecessors of the vertex, up to the whole structure
(contextual transductions) (Micheli, Sona, and Sperduti, 2003b; Micheli, 2003). Of
course, contextual RecCC networks can still implement all the functions which are
computable by RCC models (Micheli, Sona, and Sperduti, 2003b; Micheli, 2003).
We will show in this paper that the proposed notion of context yields univer-
sal approximators for rooted directed acyclic graphs with appropriate positioning
of children and parents. The argumentation can be transferred to the more general
case of IO-isomorphic transduction of such structures. Hence context integration
enlarges the class of functions which can be approximated to acyclic graph struc-
tures.
We will now first formally introduce the various network models and clarify
the notation. We then consider the universal approximation capability of cascade
correlation networks in detail. We start with the simplest case, the universal approx-
imation property of standard RCC networks. Thereby, only two parts of the proof
are specific for the case of sequences, the other parts can be transferred to more
general cases. We then prove the universal approximation capability of RecCCs,
substuting the specific parts of the previous proof, and finally investigate CRecCCs.
The latter case is particularly interesting, since CRecCCs do not constitute univer-
sal approximators for all structures, but only a subset. We provide one example for
graph structures which cannot be differentiated by these models. We then define a
linear encoding of acyclic graph structures and discuss one sufficient property which
can easily be tested and which ensures that this encoding is unique. This property
requires that positioning of children and parents can be done in a compatible way.
We show that this fact can always be achieved via reenumeration and expansion by
10
empty vertices for directed positional acyclic graphs with limited fan-in and fan-out.
Structures which fulfill this property are referred to as directed acyclic bipositional
graphs (DBAGs). Afterwards, we show the universal approximation for contextual
RecCC networks on DBAGs. This result can immediately be extended to general
graph structures for which the proposed encoding is unique. An informal discussion
of this property is given. We conclude with a discussion.
2 Cascade correlation models for structures
Standard feedforward neural networks (FNN) consist of neurons connected in an
acyclic directed graph. Neuron computes the function
of the outputs¡�of its predecessors,
¡£¢¥¤§¦¨¢©�����¢��¢�¡�����¢��
where�¢�����
bias, constitutes � and¦£¢�����
the weights,�¢��� is the
are
the activation function of neuron . The input
neurons, i.e. neurons without predecessors, are directly set to external values, the
output of the network can be found at specified output neurons. All other neurons
are called hidden neurons. Often, neurons are arranged in layers with full connec-
tivity of the neurons of one layer to the consecutive one. Activation functions¦¥¢
which are of interest in this article include the logistic function
the Heaviside function
� ����������¡�� �
sgd�¡���¤
otherwise � �
��
� if¡���� � � �
H�¡���¤
and the identity id�¡���¤�¡. More generally, a function� squashing activation
is a monotonic function such that values�������£�exist with���������� ���¡���¤
�§��� �
11
��. A
of ¢ ¤ ���¡�� �£�,��������¡ ��
network which uses activation functions from a set
functions is referred to as an¢ -FNN. It is well known that FNNs with only
one hidden layer with a squashing activation function and linear outputs fulfill a
universal approximation property (Hornik, Stinchcombe, and White, 1989). The
universal approximation property holds for FNNs, if for a given probability measure
on�£, any ��£���¤, and   �we can find a FNN
which computes a function©such that
Borel-measurable� any¥�¦¨§
� �£�������¡����©¥�¡���§�¦���¥��
There exist alternative characterizations of the hidden neurons activation function
 �¡
for which the universal approximation property holds, e.g. being analytic (Scarselli
and Tsoi, 1998).
Often, more powerful neurons computing products of inputs are considered
(Alquezar and Sanfeliu, 1995; Forcada and Carrasco, 1995; Hornik, Stinchcombe,
and White, 1989). Let us denote the indices of predecessors of neuron by ��������� �.
A multiplicative neuron degree��
of �computes a function of the form
¦¨¢����
����� � ¡£¢¤
to the weights and�¢is the bias as above. We will ex-
refers
plicitely indicate if multiplicative neurons are used. Otherwise, the term ‘neuron’
��� where�¢����������
refers to standard neurons.
�¢���
�����������������¢����������¢�����¢����������¡����������¡���
Standard neural network training first chooses the network architecture and then
fits the parameters on the given training set. In contrast, cascade correlation (CC)
determines the architecture and the parameters simultaneosly. CC starts with a min-
imum architecture without hidden neurons, and iteratively adds hidden neurons un-
less the approximation accuracy is satisfiable. Each hidden neuron is connected to
12
...
x 2
x 1
Figure 1: Left: An example of a cascade correlation network with cascaded hidden
neurons¡£¢added iteratively to the network. Right: A recursive cascade correlation
network where recursive connections, indicated by thick lines marked with   , are
used to store contextual information from the first part of a processed sequence.
Thereby, the activation of the neuron of the previous time step is propagated through
recurrent connections. Note that recurrent connections point from¡�to¡¢¡ but not
vice versa.
all inputs and all previous hidden neurons. The weights associated to these connec-
tions are trained in such a way to maximize the correlation between their oputput
and the residual error of the network constructed so far. Afterwards, the weights
entering the hidden neuron are frozen. All weights to the outputs are re-trained ac-
cording to the given task. If needed, new hidden neurons are iteratively added in the
same way. One CC network with two hidden neurons is depicted in Fig. 1. Since
only one neuron is trained at each stage, this algorithm is very fast. Moreover, it
usually provides excellent classification results with a small number of hidden neu-
rons which, because of the specific training method (Fahlmann and Lebiere, 1990),
serve as error correcting units. This training scheme is also used for all further cas-
cade correlation models which are introduced in this article. Since we are here not
interested in the specific training method but in the in-principle representation and
13
r
...
r
r
approximation capabilities of the architectures, we will not explain the training pro-
cedure in more details. The only point of interest in this paper is that the iterative
training method assumes that hidden neurons are functionally dependent on pre-
vious hidden neurons, while they are functionally independent of hidden neurons
introduced later, because of the iterative training scheme.
2.1 Recurrent models
Obviously, FNNs can only deal with input vectors of finite and fixed size. Recurrent
networks (RNNs) constitute one alternative approach which allows the processing
of sequences of unlimited length. Denote by   a set where the labels are taken from,
e.g.   ¤
continuous labels or   ¤¢¡�������� ¤£
for
sequence of length � over
�£
¥����������§¦©¨  
discrete labels. Assume a
for
is given. Denote the set of finite sequences
over   by  �� . A recurrent network recursively processes a given sequence from the
left to the right. Assume that the output of neuron has to be computed for a given
sequence entry��� at position � of the sequence. Assume � neurons are present. We
assume for simplicity that the immediate predecessors of neuron are the neurons
�. from�to �� 1 The functional dependence of neuron can then be written as
(1)
 
���������¡������
  ���§���¡�����
1It can always be achieved via reenumeration that the predecessors are a subset ��� ¡£¢������¤��¢�����¡�������������¡£¢ 
of this set of neurons. Thus, setting some weights to�, we can represent every FNN
in this form.
14
where ��¢is a function computed by a single neuron, 2 and for the initial case �we
set¡£¢��¡ �, interpreted as the output for the empty sequence, to some specified value,
e.g.�. The computed values depend on the values of all neurons in the previous
�¤
time step and thus, indirectly, on the left part of the sequence. The final output of
the sequence is given by the values¡¢��§¦�obtained after processing the last entry.
Recurrent cascade correlation (RCC) also refers to the activations of neurons in
the previous time step. We assume, that the neurons are enumerated in the order in
which they are created during training, i.e. neuron is added to the RCC network in
the th construction step. Because of the iterative training scheme, hidden neuron
 is functionally dependent only on neurons introduced in previous steps. I.e., the
functional dependence now reads as
�¢�����¡�������������¡£¢  ¡£¢������¤
���������¡£¢���� �������¡�����
where
���
�¢is the function computed by a single neuron. Again,¡¥¢��¡ �is set to some
specified initial value. Note that, unlike RNNs, RCC networks are not fully con-
nected. Recurrent connections which start at neurons ¢ §
 to
because all weights to neuron are, once trained, frozen. One example network
with two hidden neurons is depicted in Fig. 1.
 
 
neuron are dropped,
It is well known that fully recurrent neural networks constitute universal approx-
imators in the sense that, given appropriate activation functions of the neurons, any
2 Here we assume only dependences from the previous time step. Dependences
from time steps �could § ���� � 
be considered and added to eq. (1), since the
definition would still be well defined. Since more information is then available at
one time step, benefits for concrete learning problems could result. Obviously, the
universal approximation capability of eq. (1) immediately transfers to the more gen-
eral definition. Thus, for simplicity, we restrict ourselves to eq. (1) in the following,
and we restrict in a similar way for the case of input trees or structures.
15
measurable function from sequences into a real vector space can be approximated
arbitrarily well, i.e. the distance of the desired output from the network output is at
most¦for inputs of probability at least�� ¥(Hammer, 2000). We will later show
that an analogous result holds also for RCC networks with restricted recurrence.
2.2 Recursive models
Both, RNNs and RCC networks have been generalized to tree structures in the lit-
erature. A tree over  
label���¢¡��  
consists either of an empty structure   � , or of a vertex ¡ with
and a finite number of subtrees some of which may be empty. We
refer to the th children of ¡ by ch¢�£¡�. A tree has limited fan-out ¤ if each vertex
has at most ¤ children or subtrees. In this case we can always assume that exactly ¤
children are present for any nonempty vertex, introducing empty vertices if neces-
sary. The set of finite trees with fan-out ¤ over  
is denoted by
�
�.For
 
simplicity,
we only deal with fan-out ¥ in the following. The generalization to larger fan-out
will be immediate. Fan-out�gives sequences. Since trees form a recursive struc-
ture, recursive processing of each vertex in the tree starting from the leaves up to
the root is possible. This leads to the notion of recursive networks (RecNNs) as
introduced e.g. in (Frasconi, Gori, and Sperduti, 1998; Goller and Küchler, 1996;
Sperduti and Starita, 1997) and recursive cascade correlation (RecCC) as proposed
in (Bianucci et al., 2000; Sperduti, Majidi, and Starita, 1996). As beforehand, for
RecNN we assume direct dependence of neuron from neurons�  in the network,
and we assume the network � has neurons. Given a ¡ vertex in the neuron 
tree,
then computes a function of the form
��¢����¢¡��¡��¢¡��������¡£¢  ¡£¢�£¡��¤
¡��ch��£¡����������¡���ch��£¡����¡��ch ¡�¢¡���������¡���ch ¡�£¡��� ��£¡��
16
where �¢is given by a (possibly multiplicative) neuron. Thus, the activation of ¡
depends on the recursively computed activations of all neurons of the network for
the children of ¡ . The output of a tree is taken as the output for the root vertex of
the tree.
As for RCC networks, the functional dependence of hidden neuron of a RecCC
network for an input vertex ¡ has to be restricted to take the iterative training scheme
into account. It becomes ��¢����£¡��¡��£¡��������¡£¢  ¡£¢�¢¡��¤
¡��ch��£¡���������¡£¢�ch��£¡����¡��ch ¡�¢¡���������¡£¢�ch where
��£¡��
¡�£¡����
�¢can be computed by some (possibly multiplicative) neuron. This formula
can be alternatively written as
¡£¢�¢¡��¤��¢����¢¡��¡���¢¡��������¡£¢ 
whereby we define the operator  
��¢¡�� 
�¡£¢�ch��¢¡���¡£¢�ch
��¡£¢�¢¡����
��¡£¢�£¡����¤ ¡�£¡���. This
 
��¡��¢¡���������� 
pact notation shows how the dependencies can be transferred to larger fan-out ¤ :
 
 
the operator is then substituted by  
 
com-
�¡£¢�ch��£¡����������¡£¢�ch��£¡����.
It has been shown in (Hammer, 2000) that fully recursive networks constitute
��¡£¢�¢¡���¤
universal approximators on tree structures. We will show in this article an analogous
result for RecCC networks with multiplicative neurons.
2.3 Contextual models
The restricted recurrence of RecCC networks allows us to introduce further func-
tional dependencies without getting cycles which might be particularly valuable for
more general structures. We are interested in so-called directed positional acyclic
graphs, which constitute a generalization of tree structures in the sense that vertices
might have more than one parent. Formally, we define DPAGs as specific graph
17
structures: a labeled directed acyclic graph (DAG) over  
consists of a finite set
vert�¡ �of vertices ¡ with labels���¢¡��   assigned to each ¡ and a finite set of edges
edge�¡ �£¢
For ¡ each �
vert�¡ �,such that no cycles can be observed in the
vert�¡ ���¦�¡��
graph.
vert�¡ �¥¤
vert�¡ �its parents are given parent�£¡�¤ ¡§¦
by �
vert�¡ ���£¡�¦���
and the children children�¢¡��¤ ¡§¦
are edge�¡ �£.
edge�¡ �£
As usual, the fan-in and fan-out of vertices are given by fan
�
�parent�£¡��
and out�£¡� �children�¢¡��. fan A rooted directed positional acyclic graph over
¤ in�¢¡�¤
 
(DPAG) consists of a labeled directed acyclic graph   such that all vertices
in the graph can be reached from one specific vertex, the root root�¡ �. In addi-
tion, an ordering of the parents and children is specified, i.e. two
 ©¨�parent�¢¡��
injective functions
¡��������fan and ¡��������fan out�¢¡�£,
��¨�children�¢¡��
in�£¡�£
respectively, are fixed for each ¡ which specify the order in which the children and
parents are enumerated when ¡ considering .
vert�¡ �we
¡ �
¡
For each define its height height�¢¡�as the length of a longest
path from the root to plus one, and the height of the DPAG,��� ©���¡ �as the
maximum height of the contained vertices. I.e. the root has height�. Note that
a tree is just a specific form of DPAGs where each vertex (except for the root)
possesses exactly one parent.
Contextual recursive cascade correlation (CRecCC) (Micheli, Sona, and Sper-
duti, 2003b; Micheli, Sona, and Sperduti, 2002; Micheli, 2003) can also take into
account information about parents. Assume DPAGs with limited fan-in at most
and fan-out at most ¤ ¡
are given. In the following, we restrict ourselves to DPAGs ¤�
with fan-in and fan-out two. The generalization to larger fan-in and fan-out is im-
mediate unless stated otherwise. We denote the set of DPAGs over   with limited
¡ �
¥ fan-in and fan-out by DPAG
. As beforehand, for a vertex ¡ in a DPAG, ch��¢¡�
and ch ¡�¢¡�refer to the first and second child of vertex ¡ , respectively. pa��£¡�and
18
pa ¡�¢¡�refer to the two parents of the vertex. An empty child is denoted by  ¡  .  £¢
refers to an empty parent. The empty DPAG is also denoted by  ¡  . In a CRecCC
network, each hidden neuron has also access to the activation of previous hidden
neurons for all parents of the vertex. Formally, given a vertex ¡ of a DPAG, the
functional dependencies of hidden neuron number , can be written as
¤ ��¢����£¡��¡��¢¡��������¡£¢  ¡£¢�¢¡�
¡��ch��¢¡����������¡£¢�ch��¢¡���¡��ch ¡�£¡����������¡£¢�ch ��pa��¢¡����¡��pa
��£¡��
¡�¢¡���
¡�¢¡���������¡£¢ 
���pa ¡�£¡����
for some fixed �¢  ¢ and�¢  in�, and ¤ values�¢
�¢
constitutes the function computed by the respective (possibly multiplicative) hidden
¡��pa��¢¡���������¡£¢ 
¢ ,¡£¢�£ ¤ � �¢
where¡¢�£ £¢���¤
neuron. For ¤ �, this reduces to
������£¡��¡��ch��£¡���¡��ch ¡��£¡��¤ ¡�¢¡�����
As beforehand, the output of the root vertex is considered output of a given DPAG.
Defining the operator  ¦¥��¡£¢�£¡���¤
functional dependence as     ¡£¢�£¡�¤��¢����¢¡��¡��¢¡��������¡£¢ 
   
�¡£¢�pa��£¡��¡£¢�¨§�©
��¡£¢�£¡����  ¥��¡��¢¡���������  ¥��¡£¢  ��¢¡��
¡�£¡����allows us to recast the
��¡��£¡���������� 
The general functional dependence for larger fan-in and fan-out is thus obtained by ���¢¡����
extending   �and  �¥�to more children and parents, respectively. neuron 
Thus
also depends on the activation of parents and, indirectly, its parents children, via the
previous hidden neuron. This is possible without getting cyclic definitions because
of the restricted recurrence of RecCC networks. In particular, the strict causality
assumption of vertices depending only on successors is here no longer valid and
increasing context is taken into account for large ,eventually involving the whole
structure in the context.
19
The above recurrent dynamic is well defined due to two features: on one hand,
the DPAGs are acyclic and rooted, such that, as in any tree structure, induction over
the structure is possible. Starting from the vertices without further successors we
can continue to process data up to the root. We refer to this principle as structural
induction. On the other hand, the network architecture is inductively constructed
whereby hidden neurons constructed in earlier steps are functionally independent
of neurons introduced later. Hence induction over the hidden neurons is possible,
whereby we can assume that the output of all previously constructed hidden neurons
for all vertices in a DPAG are available. Since RecNNs might be fully connected,
structural induction cannot be applied for general RecNNs: each hidden neuron is
dependent on the output of every other hidden neuron.
Note that we can substitute a CRecCC network for each given fixed DPAG struc-
ture to be processed by an equivalent feedforward network: we take copies of all
neurons for all vertices in the DPAG and unfold the recurrent connections accord-
ing to the given structure over this set. This general principle, unfolding a recursive
network according to a given fixed input structure, can obviously be done for all
network models defined so far. It is based on the principle that no cycles are given
in the dynamics.
We would like to add a last remark to the definition of cascade architectures: in
practice, a set of output neurons is usually introduced and all hidden neurons are
connected to the output neurons. The output connections are trained directly on the
given training set. For simplicity, we will drop this output layer in the following,
since we are not interested in training. We assume that the last hidden neuron of
the cascade correlation network serves as output neuron, to simplify notation. We
will, for simplicity, only deal with the case of functions to� instead of outputs with
higher dimensionality. The generalization of our results to outputs in�¤ will be
20
immediate.
3 Universal approximation ability
Having introduced the cascade correlation models, we will now consider the univer-
sal approximation capability of these models with respect to sequences, tree struc-
tures, and DPAGs, respectively. We investigate the case of sequences first and point
out the structure of this first proof of approximation completeness. It will turn out
that we only have to substitute two steps which are specific for sequences. A gener-
alization to tree structures will be quite simple. The case of graphs, however, will be
much more involved, because fundamentally new results are implied in this part: the
investigation will include an analysis of the possibility to represent acyclic graphs
by linear representations. It will, as a by-product, show that contextual networks
can also approximate so-called IO-isomorphic transductions (as defined below), i.e.
they can deal with the case that not only the inputs, but also the outputs are struc-
tured.
We first want to clarify which type of functions we are interested in. Input sets
are sequences, trees, or DPAGs of arbitrary but finite size. We assume that the
respective labels are contained in a subset  
, the discrete¦-algebra over  
 ¦-algebra
over  
of a real-vector space. For discrete
is considered, otherwise we deal with the Borel
. This algebra is expanded to structures by interpreting structures
as the direct sum of objects of one fixed structure but possibly varying labels, e.g.
sequences decompose into the sum of sequences of a fixed length � . Thus the set
of structures is equipped with the sum of the Borel¦-algebras of objects with a
fixed structure. Note that the choice of the sum-algebra has consequences on the
semantic: for every fixed probability measure   we can find a size   such that
21
most information lies on the structures of size at most   .   depends on the chosen
probability measure and might vary with   . A formal argument for this fact will be
given in the proof of Theorem 3.2. We do not consider infinite length sequences or
infinite size structures in this article.
A given (measurable) function with structural inputs shall be approximated by
a cascaded network. We assume, that the weights of the network are fixed when
processing data, i.e. stationarity of the function �¢computed by neuron with respect
to the structure holds.
There are different possibilities to use recursive models: so-called supersource
transduction refers to the task to map an entire structure to a given value, e.g. map
a chemical molecule to its activity. Thus only the value obtained after having pro-
cessed the whole structure is of interest. We will mainly focus on functions of this
type, whereby we assume that the outputs are element of the real numbers. I.e.
functions to be approximated are measurable functions form�
�,
of the � �¡ 
  denoting the considered structures, i.e. sequences, graphs, or DPAGs. Recursive
models can alternatively be used for IO-isomorphic transduction, which are func-
�¢  �
but possibly different labels. The output��¤£�is obtained substituting each vertex
tions�
 
which map a given structure to a structure of the same form
label���£¡�of the input structure £ by the output value¡¢�¢¡�obtained for this ver-
tex in the recursive model. This setting describes problems such as mapping the
sequence of proteins to its secondary structure, where the form of the secondary
structure of the protein at each aminoacid position has to be predicted. We will
shortly discuss this situation and it will turn out that our results on the approxi-
mation capability of CRecCC networks for supersource transductions will imply
results for IO-isomorphic transductions.
22
3.1 Recurrent CC
We start with the simplest case, recurrent networks for sequence processing. The
key ideas of the proof borrow steps from the universal approximation proof of recur-
sive neural networks as given in (Hammer, 2000). For convenience, we first outline
the general steps of the proof some of which can be reused for recursive cascade
correlation and contextual recursive cascade correlation. The key points are:
(I) Find a linear code for the given structures over discrete labels up to a finite
height; a linear code can be represented as a real number whereby the number
of digits is not priorly limited. This code serves as a candidate to act as an
internal distributed representation of the structure within a neural network.
(II) Show that this code can be computed with the considered network structure;
as a consequence of such a construction, it is clear that the network can turn
a structured input to an internal distributed code suitable for processing with
a standard feedfoward network.
(III) Integrate a universal FNN; a universal FNN can map the encoded structures
to the desired outputs. As a result of these three steps, it is proved that the
considered network is capable of mapping every finite number of discrete
labeled structures to arbitrary output values.
(IV) Integrate discretization of real-valued labels; this allows us to turn structures
with real-valued labels to similar discrete structures which we can process.
As a consequence of this step, we can approximate functions on structures
with real-valued labels by first turning them internally to discrete labels.
(V) Approximate non-standard activation functions; the above construction steps
use specific activation functions (exact identity, Heaviside function, etc.) to
23
make sure that the desired outputs are precisely met. These have to be sub-
stituted by the considered activation functions such as the logistic transfer
function.
Only step (I) and (II) will be specific for the considered network and input structure.
The other steps do not specifically refer to the model but can be similarly done in
the other cases. We will use this fact later to generalize universal approximation for
RCC networks to RecCC and CRecCC networks.
Now we present the construction steps for RCC networks for sequences. The
first step is trivial, since sequences are naturally represented in a linear way. As-
sume  
is a finite alphabet. A sequence over   with elements�¢�  
is given by
¨ , a linear code. This is essentially step (I). This linear representation can
be embedded in the real numbers by concatenating the entries of the sequence. More
¥������������
precisely, assume   ¤
by at most length
¡�������� ¤£
. Then the representation�����������where the�¢are expanded
¢
¢¡  and assume the numbers can be represented
by leading entries�to exactly ¢ places is a unique representation of the sequence
¨ . This observation is obvious because each entry�¢occuppies exactly the
same number of digits within this representation.
¥������������
The next observation of the construction is that this code can be “computed” by
a recursive cascade correlation network. More precisely, we can construct a RCC
network with only one neuron and output�����������(�¢expanded by leading entries
¢ �to digits) for an input sequence ¨ . We define the output of the empty
sequence by�. The functional dependence of neuron¡�of a RCC network is
¥������������  
����§��¡���§�
Assume as a structural assumption that¡������ ��already represents�����
���� ¡�������¤
 
����������§�
24
����������¤ ���£ ����� ���£ ������
 
 
 
��������¤§���§����
��������. Then
 
���������
That means the function
���©��©
gives the desired functionality. Note that
�©�� ��can be computed by a standard RCC
¡��¤ ���£
���£
�©
neuron with linear activation function. Hence step (II) is very easy for the sequence
setting as well; we have found a linear RCC network with only one neuron which
encodes all sequences with a finite input alphabet and finite height�(in this case
also infinite�) uniquely within a real number. Steps (III) to (V) now mainly consist
of general observations which are not specific for RCC networks.
Step (III) is the integration of feedforward parts into a RCC network with-
out violating the function� structure. For any given we denote by �
the extension to sequences which maps each entry of a given
��¤�� �¡ 
�¤ ���£��� ��£
sequence over�£ to the image of the entry under�in�¤ thus giving a sequence
over�¤. Later, we will use the same notation also for more general structures, trees
and DPAGs, to refer to the map which substitutes each entry of a structure by its
image under ¢
. So far, we have defined cascade correlation networks to compute
functions with output domain�. We to�¤
can £ extend for �by interpreting
the output of the £ last ¤, . . .¡�¤§¦©¨ as the output of the network.
§
� neurons¡¥¤§¦©¨ 
� are
Lemma 3.1
�£ computed by FNNs.
is computed by a
�
RCC network. Then the composition
Assume� and© �¤ ���£��� ��¤
can
Assume� � ��¢
also be computed by a RCC network.
©������¡ 
Proof. This statement is easy because we can interpret all neurons
���¢���
and©
as additional hidden neurons of the RCC network, setting connections which are
of�
not used in these subnetworks but which are introduced due to the definition of
RCC architectures to�. More precisely, denote in� the neurons by  �,. .  ¢�� . , ,
¢
whereby connections  �only exist . Denote the neurons in© by
if �  ¢�
25
¡
, again with connections existing only to neurons with larger indices.
��������£¢¡  £
Denote the neurons of the RCC network by ¢��������¢¢¤£ arranged
in the order of their
appearance. Then we can choose the weights of a RCC network such that hidden
neuron number takes the role of
�����
�
���
¡£¢¤
��
if ¥
¢¢ ¢� if ���  ¢
 ��  �
£¢ ¢�� ¢¤£ otherwise
This is possible because the functional dependence of the RCC hidden neurons
ensure direct access to the respective predecessors in the single networks: each
 ¥
 §¦
hidden neuron has access to the activation of all previous neurons. All recursive
connections in the RCC network for  ¢and £¢are set to�. The same holds for all
additional feedforward as well as recursive connections not present in the original
part  ¢, ¢¢, and £¢. Note that recurrence in the first part corresponding to� has
the effect that�  is computed on input sequences. Recurrence in© has no effect,
because the output of the last sequence entry is considered output of the entire
computation. ¨
Note that an analogous result holds for RecCC networks and CRecCC networks.
It is not necessary to change the proof if�and©�����  are computed by RecCC
networks or CRecCC networks.
We obtain as an immediate result for the sequence case:
Theorem 3.2 Assume   is a finite set. Assume is some proba-
¡�������� ¤£  
bility measure on sequences over ¤   is a function. Assume � � ��
Assume�
Then there exists a RCC network with activation functions in ¡ id�¦£
�.
computes a function©such that
§ ¥
 �¡ �  
�����¡���© ©¥�¡���� ¥� ¤
26
which
¦is thereby some activation function for which the universal approximation prop-
erty of FNNs is fulfilled.
Proof.
�  
decomposes into sets of sequences of different length. We denote these
�
by� 
  ¤ ��¢for  sets �,. . . , and an arbitrary enumeration. 3 measurable,�¤
Since� 
��¤¢¡ ¢�� ��   � 
��¢is
��¢��thus we find ¡
¥¦¥¨§
  ��¢���   length�
for
some . Therefore we can find some (the maximum length in the sets
� 
  ) such that larger sequences have probability smaller than¥. For
sequences of length at most�over ��¢for ��  
, we can find a RCC network with the identity
 
as activation function unit which maps these inputs to unique values, as shown
above. Because FNNs with activation function¦ and linear outputs possess the
universal approximation property, we can find a FNN which maps these encodings
exactly to the outputs given by� (Sontag, 1992). The composition of these two
networks can be interpreted as a RCC network because of Lemma 3.1 with the
desired properties. ¨
Note that we have restricted ourselves to outputs in�, for convenience. It is ob-
vious from the above construction that the same result can be obtained for functions
� 
if
�¤
remark holds for all universal approximation results which we present in this paper
�� �
¢�£¤£  ��  �
we integrate several output neurons for RCC networks. The same
thus giving us approximation results for output set�¤ instead of�.
We can extend the above result to sequences with real labels, introducing a first
part which maps real-valued labels to a discrete set.
3 We could simply choose�  ��¢as the set of sequences of length .However, we
would like to achieve compatibility with the more general case of tree structures
or DPAGs, such that we use an arbitrary enumeration of the subsets with fixed
structure.
27
Theorem 3.3 Assume   ¤
Assume   is some probability measure on se-
�£.
quences over  
. is a Borel-measurable function. Assume
� �
�. Then there
�
exists
�
a RCC network with activation functions in
§ �,¦ § Assume� ¥
¡ id�¦�H£
which
computes a function©such that �    �¡
�������¡����©¥�¡���§
¦is thereby some activation function for which the universal approximation prop-
¥�� ¦��
erty of FNNs is fulfilled.
Proof. We can fix some length�such that longer sequences have probability at
that sequences with one entry out-
most¥¥¨§ � §
than¥¥¨§
. We can find a constant
�such
side ¥�����£have probability smaller . 4 For sequences of length at most
labels in ¥�¥����£, we can approximate� by a continuous function such
and
that the probability of values with distance larger than¦¥¥ is at most¥¦¥¨§ . � 5 Hence
substituting¥by¥¦¥¨§ and¦ by¦¥¥, it is sufficient to prove the stated theorem for
continuous functions�on sequences of limited length at most�with limited labels
in ¥�����£.
Note that such function�is uniformly continuous, being a function on a com-
find¥  § pact interval. Hence we can
�with the following property6 ¡
Assume�¨
: given any
of the same structure, i.e. the same length. is an
two sequences   and  
4 This is the same argument as above because the set of sequences decomposes
into subsets of sequences with entries in a fixed interval with boundaries given by
natural numbers.
5 This is a well-known fact about functions over the real numbers, and it follows
e.g. from the approximation results in (Hornik, Stinchcombe, and White, 1989).
6 We use a notation which could refer to sequences as well as trees or DPAGs for
convenience. An identical prove shows an analogous result for these more general
cases.
28
entry of   and�¨¡  is an entry of  
¡
at the same position. Assume that for every such
inequality���¨�¢����¨¡ �¢�� ¥  ��������  positions ¤
���  ¡��� Then���� ��
pair of entries the holds for all ,
��¨�¢and��¨¡ �¢, respectively, denoting component number of the respective entry.
¦. Next we choose points ¢
¨
¢ ¥���� in
such that ¢
, ¢
, ¥  and�¢�� for all ��� ¢�� 
¢ ��������§ ¢�  ¤ . Note that
varying the components of the entries of a sequence within some ¥
interval
does not change the output with respect to�by more than¦. Therefore we can find ��¢��
 ¤ �¥�
¢¤ �
 � ¢�� ������¢
a function��which differs from�by at most¦for all inputs and which maps all
sequences of the same structure where the components of the sequence elements
¢� 
are contained in the same ¥
interval ��¢��to a constant value.
We will now show that this function��can be computed by a RCC network,
which hence approximates�as stated in the theorem. The characteristic function
¢�� 
¥ of
function
��¢��can be computed as�¤£ ¥�¡¦��¥�¡§�¡��¤©¨
��¥�¥����£� ¡���������§�
¢��. ¨ The �¡ �
¢ ������£
¢� 
£
�¡ � ��� �
��¢ 
¥�¡¦��¥��§�¨§�
�£��¡��������¡£���
uniquely encodes the intervals in which the components of the input lie.�thereby
uses an encoding to base
��£�
�of the § intervals in   dimensions. Since��is piece-
� �¢��
§�
that��¤   ��   � wise constant, we find a function�¡
such �¡
where, as beforehand,
denotes the application of�to function�¡
each element in a sequence. This just
maps a code to the output of some fixed sequence with enries in the encoded inter-
vals as indicated by the input. Because of Theorem 3.2, we can find a RCC network
computes�¡
©�which . (Since we are dealing with only a finite length at this point,
the approximation can be done for all input values.) Hence©���   equals with��.
Note that�can be computed by a feedforward network with activation function in
H�id£ . Because of Lemma 3.1,©���   can be computed by a RCC network with
¡
activation function in ¡ . H�id�¦£
¨
29
Usually, a sigmoidal activation function is used as activation function of the
RCC network for all neurons but the last one, which is linear to account for the
fact of a priorly unbounded output set. So we are interested in the possibility to
substitute the above specific activation functions by a sigmoidal function¦. Note
which �
that for function¦
any is continuously differentiable in the vicinity
of some point¡  with nonvanishing derivative¦¡�¡ �the following holds:
����
¥�¡��� �
This convergence is uniform on compacta. If the above property holds
¦�¡ �
we
�� �¥� ¡ ¥�¦¡�¡ � ¦�¡ �
for¦
refer to¦as being locally � �. Moreover, for a squashing function¦with limits�¡ 
and�¢  we find
all¡ ¤
on���  ¥¦�  ©
any¦ ¤ § for
�.The convergence is uniform
�for
�.
These two properties are in particular fulfilled for the standard logistic activation
¨ �¥� ���
��¦¨
�¦�¡¥¥����£ �¥¨��¢ ���£ ��
function. The following theorem allows us to substitute the above specific activation
functions by an activation function �£.
as used in practice.
Assume is some probability measure on se-
 
is �
Theorem 3.4 Assume   ¤
quences over   � �
. �
�. § �,¦ § Assume¦
is ���� � Assume� ¥
�¡��
a Borel-measurable function. Assume
a continuous and locally � �squashing
function. Then there exists a RCC network with activation functions¦ and linear
activation function for the last neuron of the network which computes function©
a
such that �    �¡
�������¡����©¥�¡���§
Proof. Note that a squashing activation function leads to the universal approxima-
¥�� ¦��
tion property of FNNs (Hornik, Stinchcombe, and White, 1989), such that we can
30
construct as in Theorem 3.3 a RCC network which contains the activation functions
id and H in addition to¦. We now show that the activation functions id and H can
be substituted in the network by the above approximations for appropriate¥such
that the difference of the outputs is small for inputs of high probability. Note that
we argued in the proof of Theorem 3.3 that we can restrict to finite length�and la-
bels in ¥�����£. Moreover, we can choose the interval borders ¢�which have been
used to discretize the inputs in such a way that the probability of sequences which
¦ �is contain a label with a component smaller than¥¥ ¥ for some¦  .
Denote by in�¢�� ¦ �¢��   ¡¤ ¦ ���£. ¤
¨¡ 
Since the involved functions are
continuous, the set of activations¡¢��¨�for all elements�¨ of sequences of length at
¦ �¢��� ¢�� �¢��
  ¡
most�with labels in is compact, say it ¥�� ��   ¨ is contained in .
We formally substitute each activation function id (except for the output neuron,
�¥����
i.e. the last neuron in the network) by the term (which we simply regard as a slightly
more complicated activation function, for moment)�¦�¡ �
the
¦¡�¡ ��with parameter¥  which has to be fixed later. Each activation function H in ¥ �¡���¦�¡ ���¥¨�¥ �
the network is substituted by the term�¦�¡¥¥ ���� �¥¨�� ���£ �where¥  has to be
£
chosen later. Denote by¡¢the by¡£¢
original neurons, the ¢
the above more complicated activation thereby,¡�¢¥¤ ¡¢ £
functions; if ¢
neurons which involve
the activation
function is¦. We will now show that some¥  can be found such that the resulting
network differs from the original one by no more than¦¥¥ for all inputs in�  ¡��
with length at most�. The following observations are easy:
1. Given some neuron¡¢with activation function H. Note that such a neuron
is functionally dependent only on the label of entry�¨ a given in the above
network by construction, hence we can write¡�¢��¨�as a shorthand
some¦�
notation.
�.Then¥  Given can be found such
  that�¡¢��¨�� §
for all inputs�¨ �
¡
¡¤¢
¢��¨��¥ £ ¦�
. This follows immediately because of the uniform
31
convergence �£ �to � �¥¨�� � of�¦�¡¥¥ ��
contained in the inputs.
H if a neighborhood of�is not
2. Given some neuron¡¢with linear activation function.   �¨ Assume are the
inputs of¡¢directly referring to input entries, assume  
© ¤
the additional inputs referring to outputs of other neurons. write¡¥¢�  �¨� 
We ©�,
Then we can find¥  and¦¡ such
for some¦� convenience. Given
�.
¡��¥ ¦�for
¡
£
¡
©
all ¢�¡ 
© ,   © with ©¢�© �
that�¡£¢�¡ 
§
�¨�  �¨� 
 
¦¡   ¡
¡ ¦ ¨ �¨ and labels in . This follows from the fact that
© ¡£¢is   ¢�¥
continuous, hence uniformly
with�©¢�
continuous on compacta,
£
¦�¡ ���¥¨�¥�¦¡�¡ ��converges to id hence¡¢
uniformly on compacta, ¢
and�¦�¡ �
¡���
©�� ¡¤¢
�©��������©¢£�are
¢� ¥�� � ¦ ��
converges to¡£¢uniformly on compacta. We therefore find¥  and¦¡ such that
�¡£¢�¡  �¨�  �¡£¢�¡  ¥
¥ ¦�¥
¢�¥ ¦¡ ¡
.
¡¢
�¨�  ©��
¦�¥ ¥ ¥�
©
for�©¢�
3. neuron¡¢¤ £
Given some with
¡¤¢ ¢
¢�¡ 
£
© �¨� 
©�� ¡£¢�¡ 
�¨� 
©
¡��
�¡£¢�¡  �¨�  ©
¡���
¡�� ¡¢
¢�¡ 
£
© �¨� 
¥�
activation function¦. Note that, by con-
struction,¡£¢is functionally dependent only on outputs of other neurons. De-
note, for convenience, the direct functional dependence ©�. Given some
by¡�¢� 
�.
¡��¥ ¦�for
¡
find¦¡
Then that�¡¢� 
©
we   © can   ©
such all , with
¡ ¦ ¨ ¢�¥ ¦¡
. This follows immediately from
and�©¢�©
the fact that¦is continuous, hence uniformly continuous on compacta.
� ¦�§
 ��
©¢�©
¡
¥�¥� ¢�
These properties allow us to substitute each single activation function and to control
¦ ��
©�� ¡£¢� 
the effect of this substitution. We now formally unfold the RCC for all possible se-
quences of length at most�, obtaining several feedforward computations composed
of functions computed by single neurons. For each such feedforward computation,
32
¡��
we start at the rightmost neuron and consider the neurons in the order of their ap-
pearance in this feedforward computation from right to left (thereby visiting the
original neurons more than once since recurrence accounts for duplicates). The fi-
nal output should deviate from the original one by at most¦¥¥ on sequences with
  ¡
entries in with¦�¤ ¦¥¥, . Starting we determine for each value¥ 
neuron the
and a number¦¡ §
that the following holds: if the inputs which come from
�such
other neurons deviate by no than¦¡
more and the activation function is substituted
by the above term including¥ ,the output deviates by at most¦�. For the three
types of neurons, this can be achieved due to properties 1 – 3 as explained above.
We can then continue with neurons to the left setting¦to the minimum of the found
¦¡
and¦ . (The bound¦  latter thereby ensures that values lie in the compactum
¦ ¨
a number¥  for each copy of a neuron such that this choice guarantees appropriate
 �
¥�� �� ¦ ��
on which uniform approximation is guaranteed.) So we collect
approximation of the whole function. We can now choose for the entire network
the minimum of all values¥  as a uniform value which simultaneously fulfills all
constraints.
Finally, it remains to show that the network which involves the above compli-
cated activation functions can be interpreted as a RCC network with linear output
and hidden neurons with activation function¦. Note that the above terms are both
of the following form:
for real values ©� �¡��. We can simply integrate the term
¡�� �
¡
respective neuron,  
© �¦� �¡�
into the bias of the
into the weights, © into the weights of all neurons which have
direct connections from the neuron, and�into the bias of these neurons. Note that
the last neuron of the RCC network (which does not possess a successor) is linear,
hence no substitution has to be done at this part. Note, furthermore, that outgoing
33
connections include, of course, recurrent connections. (We do not have to deal with
cycles because the integration of © and  
is done only once for each connection of
the network.) Hence the initial values for the empty sequence¡�¢�� �is affected by
this change, too. We account for this fact resetting the term to¡¥¢��¡ �¥�©��
Hence RCC networks constitute universal approximators for sequences.
3.2 Recursive CC
 
. ¨
Referring to the above proof, the essential steps to show approximation complete-
ness for RecCC networks for input tree structures are (I) the construction of a unique
linear code for tree structures over a finite set  
, and (II) the construction of a
network with possibly multiplicative neurons and appropriate activation functions
which can compute these codes. Then construction steps (III) to (V) can be trans-
ferred from the previous case.
The linear representation of trees is slightly more difficult than the representa-
tion of sequences since they are usually represented as abstract nonlinear data types.
The prefix notation of trees offers one possibility which we use in the following. As-
sume  
is a finite set which does not contain the symbols ‘(’, ‘)’, and ‘,’. Assume¥¡ 
and�are additional symbols not contained in  
root ¡ vertex and subtrees �
¡
���and
the structure as follows:
. Define for each tree � over   with
the term-representation rep���recursively over
rep�£  �����¢¡��rep������rep���
rep����¤ ��¤
This representation is obviously unique and hence step (I) is achieved.
¡����
Step (II) can be done as follows: assume   ¤
numbers not contained in  
¥¢ �
assume  �,  £  are
,
, assume the maximum length of these numbers is
¡�������� ¤£ ¢
34
and assume the numbers are extended by leading entries�to occupy ¢
exactly
positions. Denote by nrep���the number which we obtain if we substitute rep���
in
the symbols�by  �and¥¢  by  £  and extend the numbers from   by leading entries
�. Then nrep���is unique, too. Note that it is not necessary to introduce symbols for
‘(’, ‘)’, and ‘,’because we deal with only one function symbol. We now show that
two multiplicative hidden neurons¡�and¡¢¡ of a RecCC network with the identity
as activation function can be constructed which compute ����� ¡�����¤
and
 ��length�nrep����
¡¢¡����¤§��nrep���
where length denotes the length of the number including leading entries�. We set
Assume the root vertex of a tree � is ¡ and the two children are ��and �
¡
. The
�� £ ��
��¤
¡��  ���¤§���£�
functional dependence of¡�is
¡¢¡�£ 
¡����¤�������¢¡��¡�������¡��� ¡���
where induction¡������and¡���
by structural
¡��represent the ¡����.  ���nrep�� Hence the choice ¡�©¡ ��¤§���¡ £ and���
terms�����
���©��©
�©
¡�©¢ 
gives the appropriate length. The functional dependence of¡¡
¡��¡¢¡��
¡����£¡��¡�����¡�������¡¢¡������¡���
¡���
�
where by structural assumption¡¡�����and¡¢¡��
¡¢¡����¤
Hence the function
���¡ � ¥��
� ¡�©��������©¡£��¤§���£
£ �©��
35
is
 ���nrep������
¡��constitute��nrep�����and��nrep��
���¡
£ �©¡ � ���¡ £ �©
¡�©¡¤
¡�.
gives��nrep���. Note that these functions can be computed by multiplicative neu-
rons of order ¥ with the linear activation function. Thus, in contrast to the case of
sequences, we deal with multiplicative neurons in these approximation results.
Now we can proceed with steps (III), (IV), and (V) just as in the case of RCC
networks and we obtain as a results, that Theorems 3.2, Theorem 3.3, and 3.4 also
hold for tree structures instead of sequences, functions�
�,
�  �
i.e. for whereby
¡�
the RecCC network has multiplicative neurons.
Due to the construction we use multiplicative neurons of order ¥ in this case.
In practical applications, usually standard neurons instead of multiplicative neurons
are used. One can directly substitute the multiplication if, instead of single neurons,
a feedforward network is attached in each recursive construction step, i.e. �¢may
be computed by a feedforward network with a fixed number of neurons. Note that
an activation function for which some point¡�exists such that¦ is twice contin-
uously differentiable in a neighborhood of¡�with nonvanishing second derivative
¡�¡���© ¦¡ �can ¤
be used to approximate function¡¡
the in the following way:
¡¡ �
¥¡�¦¡ ¦�¡����¥�¡��� ¥�¦�¡��
¡�¡�� ¥�¡��� ¦�¡��
with uniform convergence on compacta. We refer to this property as¦being locally
��
�
�¡  ¡
¡
. In addition, multiplication can be substituted by the square using the identity
��¡  �¡�¥¨§  �¡� �¡ � implies¦
�.Analogous being locally � to Theorem 3.4, we can thus substitute each function
� ¤
. Note that the property locally �
computed by a multiplicative neuron �¢by a feedforward network which in this
case contains only one hidden layer with ¢ hidden neurons in the hidden layer and
activation function¦. 7
7 We do not specify this remark for trees but formalize the result later for the case
of graph structures.
36
�¥�
¡
RecCC networks are usually not complete if they deal with DPAGs as inputs
which are first transformed to a tree structure. However, some specific settings can
be observed where uniqueness of representation is assured and hence approximation
is possible such as the case of DPAGs such that no two vertices in a DPAG contain
the same label. In this case, the label can also serve as a identifier of the vertex and
the original graph can easily be reconstructed from the tree.
As beforehand, we have only considered supersource transductions so far. For
tree structures IO-isomorphic transductions as a generalization of the more simple
supersource transductions are interesting. This means that an input tree is mapped
to an output tree of the same structure whereby the labels are substituted by values
computed by the transduction. Thereby, the output label�¡�¢¡�of a vertex ¡ might
depend on the vertex ¡ and the whole tree structure. We can use RecCC networks
for IO-isomorphic transduction by substituting the label at each vertex by the output
computed with the network for the subtree rooted at this vertex. It is an immediate
consequence of the principled dynamics of RecCC networks and our previous ap-
proximation results that exactly those (measurable) supersource transductions can
be approximated, for which a strict causality assumption applies: the output label of
a vertex can depend on the subtree rooted at the vertex, but not on any other vertex of
the tree structure. Since the approximation of functions which fulfill this causality
assumption can be reduced to the problem to approximate a supersource transduc-
tion for any given subtree of the considered trees, the approximation completeness
is obvious in this case. Conversely, the RecCC dynamics is based on this causality
assumption with respect to functional dependencies of vertices. Thus, no functions
violating the causality assumption can be approximated by a RecCC network.
37
Figure 2: Example of two graph structures which cannot be differentiated by RecCC
networks if the labels of the vertices at each layer are identical.
3.3 Contextual RecCC
CRecCC networks integrate contextual information since hidden neurons have ac-
cess to the parents’ activation of previous neurons. This possibility expands the
functions which can be addressed with these networks to functions on graphs and
IO-isomorphic transductions, as we will see. However, the situation is more com-
plicated for DPAGs than for tree structures because it is not clear how to represent
DPAGs in a linear code based on the information available within a contextual re-
cursive neuron. We now discuss this issue in detail and we develope a code for a
subset of DPAGs.
The context integrated in CRecCC neurons is of crucial importance. The avail-
able context of ¡ vertex at neuron can, for example, be measured in terms of the
vertices of the DPAG which contribute directly or indirectly to¡¥¢�¢¡�. This context
is precisely computed in (Micheli, Sona, and Sperduti, 2003a). As demonstrated in
(Micheli, Sona, and Sperduti, 2003b; Micheli, 2003), this contextual information
can in principle be used to compute certain transductions on graphs which require
global knowledge of the structure and hence cannot be computed with simple re-
cursive cascade correlation networks. Hence the functions which can be computed
by RecCC networks on DPAGs form a proper subset of the functions which can be
computed by CRecCC networks, e.g. the two structures depicted in Fig. 2 cannot
38
be differentiated by a RecCC network, but they can be differentiated by a CRecCC
network. RecCC networks constitute universal approximators for tree structures.
RecCC networks and CRecCC networks do not cover different function classes if
restricted to tree structures of limited height. For DPAGs, CRecCC networks are
strictly more powerful. Here we go a step further and identify a subset of DPAGs
which fulfill a canonical structural constraint and we show that CRecCCs possess
the universal approximation property for these structures. In addition, we construct
a simple counterexample which shows that there exist structures outside this class
which cannot be distinguished even by CRecCC networks regardless of the chosen
functions of the neurons. As beforehand, we restrict ourselves to DPAGs where the
fan-in and fan-out is restricted by two.
3.3.1 A counterexample
The question arises whether CRecCC network can differentiate all possible DPAG
structures. The following example shows that this is not the case. However, this
example requires a highly symmetric structure where different parts of the DPAG
can neither be differentiated by the structure nor the labels of the vertices. Hence
the example can be regarded as an ‘artificial’ example.
Theorem 3.5 There exist graphs in DPAG which cannot be distinguished by any
CRecCC network.
The counterexample is shown in Fig. 3. The formal proof that these two structures
cannot be differentiated by a CRecCC network can be found in the appendix.
Note that this construction would not have been possible, if a different enumer-
¡�. ation of children and parents had been edges���§�and��¡�§ §
chosen: consider
¡
§
§
and constitute the of�and�¡
second child , respectively. Assume we had enumerated
parents in such that�and�¡
a way, also constituted the second parent of
39
1
2
3
0
4
5
6
Figure 3: Example of two graphs which are mapped to identical values with
CRecCC networks for a non-compatible enumeration of parents. Thereby, the enu-
meration of parents and children is indicated by boxes at the vertices, the left box
being number one, the right box being number two.
¡
and , respectively. Then the proof for the counterexample in Theorem 3.5 would
§
no longer hold: § the vertices�and could then be differentiated by an appropriate
choice of the weights and activation functions of the neurons, and the same would
¡
hold § and . This can be seen e.g. due to the activation of the second hidden
for�¡
neuron whose functional dependence is:
�
¢�© ¤
� ¡���¡�����¡��¥��¡¢¡�¥��¡��¡ ��¡¢¡�¡ ����
¡���¡�����¡��¥��¡¢¡�¥��¡���¡ ��¡¢¡�¡ ��¡�������
¡¢¡����¤
This inequality holds for appropriate ¢�¡�������¤ ¡¢¡�§��
� ¡ inputs��
¢ because the and¡����are given
at different positions. The possibility to differentiate these two vertices leads to
the possibility to differentiate all vertices in one of the structures, and finally to
the possibility to differentiate the two structures since the vertices are connected
in a different way in the two structures. This example leads to a simple structural
condition which ensures that different structures can be differentiated by CRecCC
networks.
40
1’
2’
3’
0’
4’
5’
6’
3.3.2 Identification of a subclass of DPAGs
Motivated by the previous example, we now identify one subclass of DPAGs for
which such an example cannot be found and which allow a linear encoding of the
structures. The following property constitutes one sufficient condition to ensure
this property, it is not a necessary condition. However, it has the advantage that it
can easily be tested and it does often not put severe constraints on the respective
application.
Definition 3.6 Assume   is a DPAG. We say that the ordering of the parents and
children is compatible if for each edge�¦�¡�of   the equality
�¡ ��¢¡�
holds. This property states that if vertex
 ©¨�¦���¤ ¦
is enumerated as parent number of
¡ then ¡ is enumerated as child number of ¦ and vice versa. We call the subset
of DPAGs with compatible positional ordering rooted directed bipositional acyclic
graphs (DBAGs).
Note that the above counterexample used elements in DPAG  DBAG. A different
enumeration as DBAG would have been possible in this case, such that the struc-
tures could be distinguished. We now investigate in which sense the requirement on
the presence of a compatible enumeration puts restrictions on the considered struc-
tures. Due to the definition, DPAGs with compatible positioning necessarily have
the same fan-in and fan-out. However, we can always extend a DPAG with fan-in
¤ ¡ by empty children and parents, respectively, such that the fan-in
© fan-out
and fan-out coincide. Moreover, the following theorem shows that any given DPAG
¤ ¤�
with limited fan-in and fan-out two can always be transformed into a DBAG via
reenumeration of children and parents.
41
Theorem 3.7 Given a DPAG   with limited fan-in and fan-out two, then we can
find an enumeration  
¡
¨ and �
¡
of parents and children for each vertex ¡ such that
¨
the resulting DPAG constitutes a DBAG with respect to the enumerations  
for vertices ¡ of the structure.
¡
¨ and �
The proof can be found in the appendix. This proof holds only for fan-in and
fan-out two. For larger fan-in and fan-out, an analogous result can be stated if, in
addition, the vertices might be expanded by empty children and parents.
Theorem 3.8 Assume   is a DPAG with fan-in and fan-out limited by at most ¤ .
Then we can find a DBAG which arises from   by an expansion of the vertices of  
with empty children and parents to fan-in and fan-out ¥ ¤��
children and parents.
reenumeration of
�and
The proof is provided in the appendix. We conjecture   that can also be transformed
to a DBAG via reenumeration of children and parents with fan-in and fan-out ¤ and
without expanding. However, we could not find a proof, neither a counterexample
for this conjecture.
In applications such as chemistry, positioning of children and parents is often
done only by convention without referring to semantic implications such as in the
case of directed acyclic graphs (without prior positioning). Hence restricting to
DBAGs dose not limit the applicability of CRecCC in these cases. In alternative
scenarios such as logic, positioning might be crucial and it might not be possible
to change the positioning without affecting the semantic. A simple example for
this fact is the term implies(loves(Bill,Mary),smiles at(Bill,Mary)), or, as a whole
sentence ‘If Bill loves Mary then Bill smiles at Mary’. This cannot be transferred
into a DBAG unless ‘Bill’ becomes the first child of the first symbol ‘loves’ and the
second child of the second symbol ‘smiles ’ or vice versa. I.e. we would obtain ‘If
42
¨
¡
Bill loves Mary then Mary smiles at Bill’ or ‘If Mary loves Bill then Bill smiles at
Mary’, both nice sentences, however, with a different meaning than the original one.
We will later see that alternative guarantees for the universal approximation ability
can be found and a large number of structures in DPAG  DBAG can be appropriately
processed as well. This alternative characterization is not as straightforward as the
restriction to DBAGs which constitutes a purely structural property and can easily
be tested.
3.3.3 Encoding of DBAGs and certain DPAGs
Next we introduce a recursive mapping of vertices in a DPAG to terms which de-
scribes information contained in the DPAG. The purpose of this transformation is
to obtain linear representations of DPAGs. It will turn out that these linear rep-
resentations characterize DBAGs up to a certain height uniquely. Naturally, the
representation characterizes also a large number of DPAGs uniquely and approxi-
mation completeness will hold for these DPAGs, too. However, the fact whether
this holds for any given set of DPAGs needs to be tested for each application. For
DBAGs it is always valid because of the structural restrictions.
Assume  
is a finite set which does not contain the symbols ‘(’, ‘)’, and ‘,’.
,¥¢ , and� are additional symbols not contained in Assume¥   
. Define for each
 ���and ¡ vertex in a given DPAG the term-representation rep¢�¢¡�recursively over
 and the structure as follows:
rep¢�£  ������£¡��rep��ch���¢¡���rep��ch ¥ �rep¢�£  rep��£¡��¤  ��¤ ¢��¤ ¡�£¡����� ������¢¡��rep¢�ch��¢¡����rep¢�ch rep¢�¢¡��¤ ¡�¢¡����rep¢ 
for § �.
¥¢�
43
��pa��£¡���rep¢ 
��pa ¡�¢¡�����
This representation contains all information available for a CRecCC network,
since the following result holds:
¡
   
rep¦�root� 
Theorem 3.9 Assume and are DPAGs which can be differentiated by hidden
neuron�of some CRecCC network. Then rep¦�root� ��and
¡��are
different.
Proof. The functional dependence of CRecCC neurons has the form
¤ ��¢����£¡��¡��¢¡��������¡£¢  ¡£¢�¢¡�
¡��ch��¢¡����������¡£¢�ch��¢¡���¡��ch ¡�£¡����������¡£¢�ch ��pa��¢¡����¡��pa
��£¡��
¡�¢¡���
¡�¢¡���������¡£¢ 
���pa ¡�£¡����
obtain if we evaluate���£¡�to the respective label and all other terms to symbols in
Hence all available information for neuron is contained in the term which we
¡��pa��¢¡���������¡£¢ 
the corresponding Herbrand-algebra. We denote the evaluation���¢¡�by the same
symbol, for simplicity.¡¢�£ £¢�is evaluated to constant�¢
¢ the
to the and¡£¢�  . We denote the resulting string for¡¢�£¡�by rep constant�¢ 
 �is
evaluated
¡ ¢�£¡�.Two DPAGs
neuron� which can be differentiated by hidden lead to different representations
rep ¡
¦�root� ��and rep ¡
  ¡ ¦�root�¡ 
¡
¡
  ¡
¡��. We now show that if for vertices of and
of and some the terms rep ¡ ¢�¢¡�and rep ¡
�   ¢�£¡
¡�.
¡�differ then so do §¢�£¡�and
rep¢�£¡
This is done by simultaneous induction over and the structure of the terms: if
one of the vertices ¡ or ¡
�¢  for
¡
is an empty child or parent, we get definition�¢
¢ by or
the representation rep ¡
¢. rep¢, however, also gives a identifier¥¢ or¥ 
unique
for the empty child or parent, respectively. So we can assume that ¡ and ¡
empty.
 ¤ �:
We obtain
rep ¡��¢¡��¤ ������£¡��rep ¡��ch��£¡���rep
¡��ch ¡�¢¡����
44
¡
are not
and
rep ¡��¢¡
¡��rep ¡��ch��¢¡ ¡����rep ¡�¢¡
¡��ch
¡�, or rep ¡��ch��¢¡���©
rep ¤
¡��ch��¢¡ ¡��, ¡��ch
¡��ch ¡���. or rep ¡�¢¡
rep In the latter two cases we can assume
by structural assumption ¡�¢¡����© � that rep��ch��£¡
¡��or     � §��ch ¤ §��ch��£¡���© ¤ ¡�¢¡���©
rep��ch ¤
¡�¢¡
¡��holds. rep�evaluates to
������£¡
hence one of the three subterms is different: it holds���¢¡��©
¡��¤
���£¡
¡����� ¤
and
¡��rep��ch��¢¡
and gives therefore also two different terms.
Inductive step for :We find
rep ¡ ¤ ¢�¢¡�
¡���rep��ch ¡�¢¡
������£¡��rep��ch��¢¡���rep��ch rep��¢¡��¤ ¡�£¡���
rep��£¡ ������£¡
¡����
¡��¤
��¢����£¡��rep ¡��¢¡��������rep ¡
¡��ch���¢¡���������rep ¢�ch���¢¡���rep ¢  ��¢¡��
¡��ch ¡�£¡����������rep ¡��pa��£¡����������rep ��pa��¢¡���rep ¢ 
¡��pa ¡�¢¡����������rep ¢ 
rep
rep
¡
¡
and a similar term for rep ¡ ¢�£¡
differs. Hence one of the following holds:
2. rep ¡
¡�,
1.���¢¡��© ��ch��¢¡��© ��ch
���£¡
rep
¤
¤ ¡�¢¡��© rep ¤
3. rep ¡
4. rep ¡ ��pa��£¡��© rep ¤
5. rep ¡
��pa ¡�£¡��© rep ¤
¡
��ch���¢¡
¡
��ch ¡�¢¡
¡
��pa��¢¡
¡
��pa ¡�¢¡
¡
¡
¢�ch ¡�£¡���� ��pa ¡�£¡���
¡�. If these terms are different, one of the subterms
¡���for some ¢ ¥  ,
¡���for some ¢ ¥  ,
¡���for some ¢�  ,
¡���for some ¢�  .
45
In cases ¥ and   we can assume by structural assumption that also rep�gives differ-
ent values for the respective terms, in cases § and   we can assume this fact because
of the inductive hypothesis. If rep�yields different values for two terms, then so
¢
does rep�  for all ¢ ¡�
that the terms
as we will show afterwards (Lemma 3.10). Hence we find
�����¢¡��rep¢�ch��£¡���rep¢�ch rep¢�¢¡��¤ ¡�£¡����rep¢ 
and
¡���rep¢�ch ¡�£¡
��pa��¢¡���rep¢ 
¡���rep¢ 
��pa ¡�£¡����
rep¢�¢¡
¡��rep¢�ch��¢¡
��pa ������£¡ ��pa��¢¡
¡���rep¢ 
¡�¤ ¡�¢¡
are ¨
also different.
¡����
We postponed in Theorem 3.9 the proof of the following lemma:
Lemma 3.10 Assume   and  
spectively. rep¢�£¡�©
Assume rep¢�¢¡ ¤
¡
¡   ¡
¡
¡
 
rep��£¡�©
rep��£¡ ¤ are DPAGs with vertex in and in , re-
¡�for
¡�for some . Then all
¢�  .
Proof. The proof is by simultaneous induction over ¢ and the structure. For ¢¤
�,
�also holds and the result is obvious.
Inductive step for  ¤ ¢ : If one of the vertices, ¡ say is empty, the result is obvious
rep��¢¡�yields¥¢ or¥  because , rep��¢¡
¡�yields respectively. a different value by
definition for all ¢ ¡
unless also constitutes an empty child or parent.
¡
Assume the vertices are both not empty. Assume w.l.o.g. that ¢  . §
rep¢�£¡
¡�. The representations are of the form
�����¢¡��rep¢�ch��£¡���rep¢�ch
rep¢�£¡�© ¤
rep¢�¢¡��¤ ¡�£¡����rep¢  ��pa��¢¡���rep¢ 
and
rep¢�¢¡
¡�¤ ������£¡
¡��rep¢�ch��¢¡
¡���rep¢�ch ¡�£¡
46
¡���rep¢ 
��pa��¢¡
Assume
��pa ¡�£¡����
¡����
¡���rep¢ 
��pa ¡�¢¡
(The last two for ¤
subterms are missing �.) Hence one of the subterms dif-
¡�,rep¢�ch��¢¡���©
fers, rep¢�ch��£¡
¡��,rep¢�ch i.e.���£¡�© ���£¡ ¤ ¤ ¡�¢¡��© rep¢�ch ¤ ¡�¢¡ ¡���,
¡��, ��pa��¢¡��© ��pa��¢¡ or rep¢ 
��pa rep¢  ¤ rep¢ 
¡�¢¡��© ��pa rep¢  ¤ ¡�¢¡ ¡��. In cases ¥
and we can assume by structural assumption that the same holds also for the   in-
dex ¢ , in the last two cases we can assume by the inductive hypothesis that the same
holds also for the index ¢� �. In all cases we obtain at least one subterm rep��¢¡�
of
which is different from the respective subterm rep��¢¡
¡�hence rep��¢¡�©
of rep��£¡ ¤
holds. ¨
Therefore, CRecCC networks can constitute universal approximators at most
¡�
on DPAGs for which rep¢yields a unique representation for some by definition
of the functional dependencies. We will now show that the mapping which maps
DBAGs of height at most� to rep¦ ¥��£¡�, ¡ denoting the root of the respective
DBAG, is injective. As a consequence of this fact, all information available in
a given DBAG is provided to a CRecCC network via the defined dynamics. For
this purpose, we describe an algorithm which allows to recover the original DBAG
from the respective linear representation. Note that each term rep¢�¢¡�can uniquely
be decomposed into its subterms. The reconstruction is in several steps.
Definition 3.11 Assume a path of the DBAG   which starts at the root is given
which consecutively visits the vertices . A linear path representation con-
¡¢£�������¡¢¤£
sists in
�
a string , where £ equals
£������£¢¤£ ¡��¥ £�� £¤
for all ¢¤
£�¤ �
¨¡ �¦��£¡¢���
���������.
Hence a linear path representation collects which child of the previously visited
vertex is visited at each position in the path. £¢�¤
47
�indicates
that the left child of
the actual vertex in the path is visited, ¥ indicates that the next vertex is given
by the right child. We assume the convention that the empty string represents the
£¢�¤
path of length�which stays at the root. Obviously, the mapping of rooted paths to
linear path representations is injective.
Lemma 3.12 For any vertex ¡ of height at most in a DBAG   , it is possible to
recover the linear path representation of all paths from the root of   to ¡ from
rep¢¥��£¡�.
Proof. The proof is via induction. If ¤
¡ can only be the root of the DBAG
�,
with linear path representation¥of the path of If §
length�. �, then we find
Note that we can decide whether the left or right parent is empty since¥¢ is a sym-
¡�¢¡�����
bol not contained in  
. If a parent is not empty, we can recover the linear path
representation of all paths from the root to pa��£¡�and pa ¡�¢¡�from rep¢�pa��£¡���or
rep¢�pa ¡�¢¡��, respectively, by the inductive hypothesis, because the height pa��¢¡�
of
¡�£¡�is smaller than . Note that the DBAG has a compatible positioning of
and § ©
rep¢¥��¢¡��¤ ��������rep¢�pa��£¡����rep¢�§�©
children and parents. Hence we obtain the linear path representation of all paths
from the root to ¡ in the following way: we extend the linear path representation of
the paths to pa��¢¡�(if this parent is existing) by the letter�. To this set we add the
linear path representation of the paths to pa ¡�¢¡�(if existing) extended by the letter
. ¨
¥
The above lemma tells us that we can recover all paths from the root to a vertex
from the given linear representation. Note that the linear path representations of a
rooted path to a vertex differs for each vertex. The next lemma extends this result to
the extraction of all vertices and their respective paths from the linear representation.
48
Lemma 3.13 For a   DBAG of height at most�, it is possible to extract from
vert� �� 
¥��root� ��the set rep¦   � ¢ ¡��¥¤£
lexicographically, ¤  
contains the linear path representation of all paths from the
root to ¡ in   £ .
¡����¢¡�� �� ¡
�� 
is ordered
Proof. For an empty DBAG, the term rep¦ ¥��root�¡ ���equals¥£  and we simply
extract the empty set. Otherwise, we can start with the empty set  
and iteratively
add tuples����£¡�� �to the set   which collect all vertices ¡ of the DBAG and the
set of linear path representations of the vertex. This starts from the root and the
given term rep¦ ¥��root�¡ ���,and recursively extracts vertices and subterms which
describe all paths for these vertices until we arrive at the leaves.
The procedure is as follows: Note that all vertices in the DBAG have height at
most�by assumption. Given the term
rep¦ ¥��¢¡��¤
�����¢¡��rep¦ ¥��ch��¢¡����rep¦ ¥��ch
¡�£¡���������
we recover the vertex ¡ : its label���¢¡�can be obtained from the term rep¦ ¥��£¡�.All
linear path representation of all paths from the root to ¡ can be recovered because
of Lemma 3.12. We sort the linear path representations lexicographically getting  
and add the tuple����¢¡�� �to the set  
tained in  
case,  
. Note that this tuple might already be con-
if vertex ¡ can be reached by more than one path from the root. In this
is not changed since we take the union and since the linear path represen-
tation of a path to a vertex characterizes the vertex uniquely. We can decide, given
rep¦ ¥��¢¡�, whether a left and/or right child of ¡ exists, because empty children are
represented by the unique symbol¥  as second or third subterm of rep¦ ¥��¢¡�. If
rep¦ rep¦
nonempty children exist, we continue with each of the two children and the representation
¥��ch��£¡��and/or ¥��ch ¡�¢¡��, rep¦
respectively. Both representations
can be obtained from ¥��¢¡�. This iterative step yields all vertices which are
49
,
direct or indirect successors of the previous vertex ¡ , thus the entire structure will
be visited during this iterative process, starting from the root   of ¨
.
Hence we finally get the uniqueness of the linear representation for DBAGs:
Theorem 3.14 The mapping which maps DBAGs   over   of height at most�to
rep¦ ¥��root� ��is injective.
Proof. Because of Lemma 3.13 we can recover from the given linear representa-
tion for any DBAG   of height at most�and each vertex of   a tuple consisting
of the label of the vertex and the linear path representations of all paths from the
root to the respective vertex. This information uniquely characterizes   because the
DBAG can easily be reconstructed using this information: we introduce a vertex for
each tuple and assign the given label to it. The problem is, of course, that linear
path representations of paths do not include the identifier for the vertices which we
have just chosen. However, the root is uniquely characterized by the fact that the
only path to the root is empty. We can then iteratively add all edges to the DBAG
and add the positioning of children and parents which are implicitly contained in
the linear path representations as follows: we start with the disconnected vertices,
and iteratively add connections starting from the root up to the leaves. In an iterative
step, we consider all paths starting from the root in the already constructed part, and
we choose a shortest path in the linear path representations of vertices of which not
yet all edges are included (note that it is possible to test this property since we can
identify the root and afterwards follow the successors given by the identifiers of the
linear path representation). Because we choose the shortest path, only the last edge
of the path, say�¦�¡�, is not yet included in the DBAG. We can uniquely identify ¦
via the first part of the linear path representation. The identifier of ¡ is known (since
the path is contained in the tuple which describes ¡ ), and hence we can connect ¦
50
and ¡ in the DBAG, i.e. introduce the edge and assign the positioning as indicated
in the linear path representation to ¦ and ¡ . Since DBAGs are considered, the po-
sitioning of children, which is given in the linear path representation, also gives the
positioning of parents. ¨
The representation rep¢preserves all information available for a CRecCC net-
work as we have shown in Theorem 3.9. Hence CRecCC networks with appropriate
choice of the weights and activation function (such that different terms rep¢are
mapped to different values also by a CRecCC network) can exactly map all DPAGs
appropriately for which rep¢yields a unique representation for large enough . We
will later show that multiplicative CRecCC neurons with a standard activation func-
tion can perform this task. Since DBAGs can uniquely be reconstructed from rep¢,
CRecCC networks are complete with respect to DBAGs. Note that the reconstruc-
tion of a DPAG from the encoding rep¢is in general not possible because forward
paths in a DPAG together with the respective positioning of the vertices (i.e. lin-
ear path representations) cannot be obtained from backward connections from the
children towards its parents, as has been done in Lemma 3.12. Thereby, backward
connections are connections from a vertex to a parent (together with the respective
positioning) and forward connections are connections from a vertex to a child (to-
gether with the respective positioning). Nevertheless we can extract all paths which
start at the root, follow a certain number of forward and/or backward connections,
and end up at some vertex even for a DPAG. This is possible because a forward connection
starting ¡ from can be obtained from rep¢�ch��¢¡��and rep¢�ch ¡�¢¡��, while
a backward connection can be obtained rep¢ 
from rep¢ 
��pa��£¡���and ��pa��£¡���,
whereby full information is available for large enough . Hence rep¢can differenti-
ate for large enough between any two DPAGs such that at least one path starting
from the root exists (following forward and/or backward connections) which leads
51
to a vertex with different label or different number of children or parents in the two
DPAGs. Theorem 3.5 shows that this is not always possible. However, we claim
that the capacity of CRecCC networks to differentiate between these DPAGs (and
to approximate functions on these DPAGs as we will see later) is sufficient for prac-
tical applications because these highly symmetrical structures usually do not occur
in practice.
3.3.4 Approximation with a CRecCC network
We will now show that specific CRecCC networks can ‘compute’ the representation
rep¢in some form. Hence they can uniquely encode DBAGs (and DPAGs for which
the encoding is unique for some )into a real vector. This gives us step (II) of the
general proof of approximation completeness. We restrict the presentation in the
next section to DBAGs, for convenience, keeping in mind that similar constructions
can be done for certain DPAGs, too.
Lemma 3.15 Assume   . . Then a CRecCC network
with multiplicative neurons of order at most
¡�������� ¤£ and linear activation function with
¤ �
§
Assume�
¥¨���
��hidden neurons can be found such that the output¡¡¡  ¦ ¥�§�root�¡ ���yields
unique representations of DBAGs   over   of height at most�.
Proof. Fix different numbers  
 
 �,   §  �,   ¢ , ,     , and not contained in .
¡��������  £
Assume that�is not contained in this set of ¢ numbers. Assume is the maximum
length of the digital representation of the above symbols (including elements of  
We assume in the following, that all numbers are extended by leading entries�such
¢ rep¢�¢¡�
    for¥    ¢ for¥¢
that they occupy precisely places. We can then uniquely represent the term
by a number: we set , . A general term rep¢�£¡�is represented by
the number which we obtain substituting in rep¢�¢¡�the symbols ‘�’by  �, ‘�’ by
52
 
).
 
, ‘,’ by  �, ‘�’ by ,���£¡�by § the respective element of  
   
, and the contained
references to rep by the recursively constructed numbers (always including leading
entries�). We refer to the number which represents rep¢�¢¡�by the symbol nrep¢�¢¡�.
We will now iteratively construct neurons¡¡¢ 
�and¡¢¡¢for �
hidden �such
that
 ���length�nrep¢  ��¢¡����and¡¢¡¢�¢¡��¤ ��nrep¢ 
��¢¡��
��£¡�),
¡¢¡¢ 
��¢¡���denotes the length of the term including leading entries�. Since
the representation rep¦ ¥��root�¡ ���is unique for DBAGs of height at most�, the �
length�nrep¢ 
�����
The latter term denotes the decimal number smaller than�with digits nrep¢ 
��£¡��¤
¦ output¡¡¡  ¥¨�� ¥�§�root�¡ ���of hidden neuron
these DBAGs.
Per definition,¡£¢�£  ¢�¤
¢ and¡£¢�  �����£ �¢ ¤ ¢¤ �¢
are �¢  odd and�¢
for  �¤
��gives then unique values for
fixed symbols for all . We
in the following ¡ that is nonempty. For the first hidden neuron, we ������¢¡��¡���ch���¢¡���¡��ch
choose�¢
find¡��£¡�¤
¡�¢¡����. , respectively, denote the length
Assume��and�¡
¢¤§��  ¢
and�¢ ¤ ��    for even .Assume
of nrep�of the first and second child, ¡ respectively, of . Since rep��¢¡�¤
the term ������£¡��rep��ch��£¡���rep��ch ¡�¢¡����consists of symbols and two subterms which
 
represent ch��¢¡�and ch ¡�£¡�, the length of rep��£¡�is   ¢ given �¡
by . Hence ��� �
the choice
���©��©
�©
gives us the desired output, since via structural induction the latter two entries can �����££ ¡�©¡ ��¤
be ©
¡
identified with ©¢  �����¢¡�and . Note that this function can be
computed by a multiplicative neuron of order ¥ with linear activation function.
The functional dependence of the second hidden neuron
¤
¡����¢¡��¡��¢¡�,
�
¡��ch��¢¡��,¡¢¡�ch��¢¡��,¡��ch
¡�¢¡��,¡¢¡�ch
¡�¢¡��,¡��pa��£¡��,¡��pa
¤
is¡¡�¢¡��¤ ¡�¢¡����.By struc-
�����£¡¥¤
53
¡�©¡ 
tural induction we can assume that¡¡�ch��£¡��and¡¡�ch ¡�£¡��contain the num-
ber representation of rep�of the two children. As beforehand,¡��ch��£¡���
shown
and¡��ch ¡�¢¡��represent the length of rep��ch��£¡���and rep��ch ¡�¢¡���, respectively.
Hence the function
� �����£ ¡�©��������©¡ �¤
 
©�� �����£¢
£
 �� �����£¢
£
�����¡£ ����� £ � �����¤£
� �����¤£ �����££
©¡ �©¡£�
©¢ �©¡¤�  §
computes the desired number representing nrep��¢¡�. Thereby, concatenation of dig-
 �  ��
©¢ � �
its which represent the symbols in rep��¢¡�is done by summing the appropriately
shifted numbers. Note that the length of the already computed subnumbers repre-
senting children is available. This function � ¡ can be computed by a multiplicative
neuron of order ¥ with linear activation function.
Assume we have constructed hidden neurons up to number ¥ .It is
������¢¡��rep¢¥��ch��£¡����rep¢¥��ch ¡�¢¡����rep¢�pa��¢¡���rep¢�pa ¡�¢¡����
The functional dependence of hidden neuron number
¤
¥ �
rep¢¥��¢¡�
�
¡��pa��£¡����������¡¢¡¢�pa��£¡���¡��pa ¡�£¡���������¡¢¡¢�pa induction,¡¡¢ 
By ��£¤�yields term���
the  ���length�nrep¢�¥¤����, ¤ where is one of
ch��£¡�,
¡�¢¡����
ch ¡�¢¡�, pa��£¡�and pa ¡�£¡�. By structural inputs¡¡¢¥��ch��£¡���
induction, the
and¡¢¡¢¥��ch ¡�£¡��, respectively, give the  ���length�nrep¢¥��ch��¢¡�����and
Since rep¢¥�consists of ¢ symbols and it further
contains the subterms which describe the length of the representations of children
�����
terms�����
¡¢¡¢¥��£¡� ¤
 ���length�nrep¢¥��ch ¡�¢¡�����.
and parents, the function
� ¡¢¥��©��������©� ¢¥
¡��ch��¢¡���������¡¢¡¢¥��ch��£¡����¡���ch
¡¢¥�����¢¡��¡���¢¡��������¡¢¡¢�¢¡��
 ��¤ ����� £ �©¢¢¥
54
© ¢
�is
¡�¢¡���������¡¢¡¢¥��ch ¡�£¡����
¡�©¢£¢¥  �©¦ ¢¥
¡�©� ¢¥
¡
 ���length�nrep¢¥��£¡����. Note that this function can be computed by
a multiplicative neuron of order § with linear activation function. The functional
computes�����
dependence of hidden neuron ¥ �
¡¢¥
¥
is
¡�ch��£¡����¡���ch ¡����¢¡��¡��£¡��������¡¢¡¢¥��¢¡��
¡�¢¡���������¡¢¡¢¥
¡��pa��£¡����������¡¢¡¢¥��pa��£¡���¡��pa ¡�£¡���������¡¢¡¢¥��pa ¡�¢¡����
¡¢¡¢¥
�
¡�ch ¡��ch���¢¡���������¡¢¡¢¥ ¤ ¡�£¡� ¡�£¡����
term���
��¥¤�yields ¤
¡
the �and ch��¢¡�,
ch
� ¡¢¡� 
¡�£¡�,pa��¢¡�,pa ¡�¢¡�£.¡¡¡ ¢¥�§�¥¤�yields the
¡
term��nrep¢¥��£¤�for pa��¢¡�, � ¤
pa ¡�¢¡�£.¡¢¡��¥¤�gives the term��nrep��£¤�for ¢ ¥ ¡
�and ch��¢¡��ch  �
¤� ¡�¢¡�£.
Hence the function
¡¢¥
� �����¡£ �����£ � �����¢£ � �����¤£
 �� ¤
�����¤£ � �����££
�
£
¡�©��������©� ¢¥  �
© ¢¢¥ ¢
© ¢¢¥  �©¡£¢¥ £�
© ¢¢¥  �©¡£¢¥ ¤�©¦ ¢¥ £�
 
 ���length�nrep��£¤����for ¢ ¥  �
���¢
© ¢¢¥  � � ©��
����� £
�����££
 �
£
© ¢¢¥  �©¡£¢¥
 �
¤� �
£
© ¢¢¥  �©¡£¢¥ ¤�©¦ ¢¥ ¤�©� ¢¥¢ 
¢¢¥  �©¡£¢¥ ¤�©¦ ¢¥ ©
© ¢¢¥  �©¡£¢¥ ¤�©¦ ¢¥ ¤�©� ¢¥ £�  §
¤� �
�����¡ 
�����¡  � ����� £
computes the representation��nrep¢¥��¢¡�.Note that this function can be computed
�
by a multiplicative neuron of order § with linear activation function. ¨
A similar construction is possible if fan-in and fan-out ¤ �
¥
Then products of at ¥ ¤ most factors are to be dealt with, hence CRecCC networks
is considered.
with multiplicative neurons of order at most ¥ ¤ can in this case compute a unique
representation.
It follows in the same way as Lemma 3.1 that integration of feedforward net-
works into a CRecCC network is possible. In analogy to Theorem 3.3 this can be
extended to real-valued labels, and, mimicking the proof of Theorem 3.4, we can
55
substitute specific activation functions by standard ones. Hence CRecCC networks
with multiplicative neurons constitute universal approximators for structures anal-
ogous to Theorem 3.4. This result has been proved for DBAGs with limited fan-in
and fan-out two. It can immediately be generalized to larger fan-in and fan-out ¤
giving us approximation completeness for these DBAGs for CRecCC networks with
multiplicative neurons of order ¥ ¤ .
In practical applications, usually standard neurons instead of multiplicative neu-
rons are used. In analogy to the case of trees, we can substitute multiplicative neu-
rons by networks of standard�£. neurons in the following way:
Assume is some probability measure on DBAGs
 
is �
Theorem 3.16 Assume   ¤
over  
�¡ .
�DBAG
�.
�
Assume¦
is �� � � § Assume� ¦
Assume¥ § ¡
�
a Borel-measurable function.
�,
a continuous and locally squashing function.
Then there exists a CRecCC network with the following properties: it has a linear
activation function for the output neuron. The function of the other hidden states
�¢are computed by a feedforward network with at most two hidden layers and a
universally bounded number of neurons in the network with activation functions¦.
This CRecCC network computes a function©such that
¡ �
DBAG �����¡��� �  �¡
¦��
Proof. The network which we constructed in Theorem 3.4 uses multiplicative neu-
¥�� ©¥�¡���§
rons because of the computation of the linear codes of DBAGs as constructed in
Lemma 3.15. The multiplicative neurons which are used in this construction con-
tain multiplications of order at § most . We can substitute multiplication¡ � 
one
¡¡
by  �¡�¥¨§ the �¡ �  �¡� term��¡�
can be approximated
. function¡
The
uniformly by the
�
Hence we can approximate the multiplicative neurons arbitrarily well by networks
term�¦�¡��
¡�¡����.
 �¦¡ ¥�¦�¡���¥¨�¥¡ ¦�¡���¥ �¡��� ¥ �¡���
56
of standard neurons with approximation function¦. The same argumentation as in
Theorem 3.4 shows that the approximation can be done uniformly up to inputs of
arbitrarily small probability. Since the multiplicative neuron with maximum com-
plexity in Lemma 3.15 includes two multiplicative terms of order ¥ , two terms of
order   , and two terms of order § , a feedforward network with at most two hidden
 
§
¨
layers and neurons in the feedforward network can approximate the computation
of a multiplicative neuron.
3.3.5 Some implications
Note that the key point of the construction is given by Theorem 3.14 and Lemma 3.15.
Theorem 3.14 defines a linear code which represents DBAGs up to a given height
�over a finite alphabet   uniquely and Lemma 3.15 shows that this code can be
computed in an appropriate way with a CRecCC network. The other ingredients
are steps (III) to (V) of the construction for RCC networks, which can be directly
transferred to trees and DBAGs. Note that we can perform the above construction
steps formally for DPAGs instead of DBAGs as well. The only problem is that the
linear codes provided by rep¢might not encode all DPAGs uniquely. We have al-
ready found one example of DPAGs (Theorem 3.5) which are not uniquely encoded
by any rep¢. If such situations are prevented, universal approximation for DPAGs
can be guaranteed as well:
Assume   is some probability measure on DPAGs
Theorem 3.17 Assume   ¤ �£.
over  
¡ � � Assume� �
¦ §
Assume¦ �� � � � § Assume¥ .
�DPAG
is a Borel-measurable function.
�,
�. is a continuous and locally
�squashing function. As-
sume the set ¡ �  
has ¡���£
¡ �
DPAG
�¢¡
 
¡ �
DPAG ���  ��� �© ¤ ¡�
probability at most¦¥¥.Then there exists a CRecCC network
rep¢�root�¡ 
with multiplicative neurons with activation functions¦and linear activation func-
57
¡��£  rep¢�root�¡ ����¤
tion for the last neuron of the network which computes a function©such that
¡ �
DPAG �����¡����©¥�¡���§ �  �¡
Proof. We can disregard the set of DPAGs which are not uniquely encoded by
¦��
by assumption. For the remaining part, we can construct an approximative CRecCC
¥��
rep¢
network as in the previous case. Note, that encoding has only to be done for DPAGs
of limited height with discretized labels, i.e. for a finite number of structures, such
that a maximum number which differentiates all these structures can be found.
The remaining steps are analogous as for DBAGs whereby the encoding network
computes rep¢for ¨
the above .
In practice, this allows us to approximate reasonable mappings on DPAGs
rep¢�¡ �¤
as
¡�for
¡
well: two  
DPAGs   and all require that for any
rep¢�  ¤
vertex in the given two DPAGs and all paths following the connections towards
children or parents, only vertices of the two DPAGs with the same label and the
same number of children and parents are reached in   and  
©
¡
. Hence this requires
highly symmetric situations such that CRecCC networks can be expected to perform
quite well in practical applications even for general DPAGs.
We would like to add a remark on the form of functions to be approximated. So
far we have considered supersource transductions, i.e. functions which map a struc-
ture to a single value, i.e. the recursively computed output of the supersource of the
structure. IO-isomorphic transductions refer to more general functions, which map
a given input DPAG to a DPAG of the same structure where each vertex label of the
input DPAG���£¡�is substituted by a value�¡�¢¡�. Thereby, the value�¡�¢¡�of an output
functions�
¡
¡    
�� vertex might depend on the entire input DPAG. Thus IO-isomporphic transductions
are
�DPAG DPAG with label sets and ¡
�
every¡
¡ � such that���¡��has the
same structure as¡for DPAG . As beforehand, we assume that  
and
58
¡
are subsets of a real-vector space. We call a function�measurable if for every
fixed input structure of DPAGs and every fixed vertex in the structure the mapping,
which maps the input DPAG to the output value of the vertex is Borel-measurable.
The distance of two output DBAGs of the same structure is measured by taking the
maximum distance of the labels of vertices at the same position. We can obviously
obtain IO-isomorphic transductions from CRecCC networks by setting the output
label of a vertex ¡ of a given input structure to the output of the CRecCC network
after processing the vertex ¡ . The question now occurs whether all measurable
IO-isomorphic transductions on DBAGs can be approximated by some CRecCC
network.
To assess this question, we can use the same steps as beforehand, whereby the
only difference occurs for step (I): we have to compute an appropriate output after
each presentation of a vertex, thus we have to find unique linear codes for each ver-
tex in each possible DBAG. If this is possible, we can solve the problem to map each
encoded vertex of a DBAG to the desired output as beforehand. Thus, steps (II) to
(V) can be transferred from the previous case. Assume, a DBAG   and a vertex ¡
in   is given. Consider the representation rep¦�£¡�. If the height of ¡ is smaller than
�, we can reconstruct the actual vertex ¡ including its label���¢¡�and all linear path
representations of the root to ¡ from this term because of Lemma 3.12. In addi-
tion, we can mimic the construction of Lemma 3.13 and collect all direct or indirect
successors of ¡ together with the linear path representations of all paths from the
root to these vertices. For ¡ predecessors of , however, we have to add an argument:
note that rep¦�¢¡�has the form � ¦����£¡��rep¦�ch���¢¡���rep¦�ch ¡�¢¡����rep¦ ��pa��£¡���,
��pa
 
 
rep¦
¡�¢¡����. Thus, if the DBAG has most��� ¥ height at , we can also reconstruct
the two ¡ parents of and all linear path rep¦
representations from
and rep¦ ��pa
 
  ��pa��£¡���
¡�¢¡��. Iterating this argument yields the result, that the vertices and
59
all linear path representations of the DBAG can be reconstructed from rep¦�¢¡�as in
Lemma 3.13 if the DBAG has height most�¥ at �. Note that for each reconstruction
step the index of rep¢�¢¡�is decreased by one, such that we have to start with
¥�
twice the height. Stated in other words, rep ¡ ¦ ¥��¢¡�is a unique linear representation
of ¡ with respect to the vertex ¡ and the entire DBAG if the height of the DBAG
is at most�.As a consequence we can conclude that CRecCC networks are also
universal approximators for IO-isomorphic transduction on DBAGs.
4 Discussion
We have considered various cascade correlation models enhanced by recurrent or
recursive connections for structured input data. These models have been proposed
in the literature based on recurrent or recursive neural networks in analogy to simple
cascade correlation. Recurrent and recursive networks constitute well established
machine learning tools to deal with sequences or tree structures as input, respec-
tively. They have successfully been applied for various type of classification and
regression problems on structured data ranging from picture processing up to chem-
istry (Frasconi, Gori, and Sperduti, 1998). However, it is well known that training
recurrent networks faces severe problems (Bengio, Simard, and Frasconi, 1994) and
the generalization ability might be considerably worse compared to standard feed-
forward networks (Hammer, 2001). Moreover, classification tasks for sequences
and tree structures are likely more difficult than standard learning problems for
feedforward networks due to the increased complexity of data. Hence construc-
tive learning algorithms which divide the problem into several subtasks and auto-
matically find an appropriate (preferably small) architecture are particularly useful
within structured domains. Cascade correlation for structures has shown superior
60
performance for several problems in chemistry (Bianucci et al., 2000) and seems
particularly appropriate due to very low training effort (the weights of only one
neuron are trained at a time) and an automatically growing structure.
Unlike the modification of standard feedforward networks to feedforward cas-
cade correlation networks, the specific training scheme of CC restricts the possible
dynamics of recurrent cascade architectures to only partially recurrent systems. As
a consequence, the class of recurrent cascade architectures constitutes a proper sub-
class of the class of recurrent network architectures. This might restrict the capabil-
ity of these type of networks compared to the general model, as demonstrated e.g.
in (Giles et al., 1995). Hence the question was whether the restriction of the recur-
rence in cascade models poses severe restrictions on the applicability of the models
because possibly only very simple functions can be represented with these restricted
architectures. We have shown in this article that in terms of the universal approxi-
mation capability of these models, no restriction can be observed: CC-architectures
can approximate every measurable function with sequences or tree structures as in-
put, respectively, arbitrarily well up to a set of arbitrary small probability in spite of
their restricted recurrence. So if fundamental differences concerning the capacity of
these systems in comparison to their fully recurrent counterparts exist (as demon-
strated in (Giles et al., 1995)) then these fundamental differences will not manifest
for any given finite set of training data or for any given finite time horizon or height
of the trees, respectively, provided enough neurons are available.
The restriction of recurrence in cascade models offers a very natural possibility
to enlarge the dynamics: not only information provided by successors of vertices of
a graph or tree structure can be taken into account, but also contextual information
provided by the predecessors of the vertices. These contextual recursive models
have been proposed in (Micheli, Sona, and Sperduti, 2003b; Micheli, Sona, and
61
Sperduti, 2002). It has been demonstrated that this additional information is actu-
ally used by the contextual models such that tasks which involve global information
of the structure can be solved more easily with these models. Note that also for
standard recurrent and recursive networks several models which integrate global in-
formation for a restricted form of structured data such as sequences or grids have
been proposed in the literature and successfully been applied e.g. in bioinformatics
where often spatial data, instead of temporal data, has to be dealt with (Baldi et
al., 1999; Micheli, Sona, and Sperduti, 2000; Pollastri et al., 2002; Wakuya and
Zurada, 2001). However, unlike contextual cascade models, structure processing
is restricted to special subclasses and integration of contextual information is here
done on top of the basic recursive networks. This is necessary because of full re-
currence of the networks. Context integration in form of structural induction would
yield a cyclic, ill-defined dynamics.
The restriction of the recurrence in cascade models allows to integrate the con-
textual information directly into the model in a very elegant and efficient way. As
already demonstrated in (Micheli, Sona, and Sperduti, 2003b), this moreover en-
larges the class of structures the models can deal with from tree structures to acyclic
rooted positional graphs with limited fan-in and fan-out. We went a step further in
this article by showing that contextual cascade models constitute in fact universal
approximators for the class of functions defined on acyclic directed graphs as input.
More precisely, we have identified a subclass of DPAGs which we called DBAGs
where enumeration of children and parents is compatible and we have shown the
universal approximation capability for this subclass. Moreover, we have shown that
any DPAG can be transformed into a DBAG in a natural way via possible expansion
by empty children and parents and reenumeration of children and parents. Hence
acyclic graph structures where the positioning is not relevant can easily be dealt with
62
in this way. In addition, the approximation capability is extended to IO-isomorphic
transductions, i.e. CRecCC networks constitute one of the first models for which a
proof exists that they can also approximate functions with structured outputs. Thus,
the universal approximation capability of cascade architectures with restricted re-
currence and context integration is enlarged to general data structures such as rooted
directed acyclic graphs with limited fan-in and fan-out.
References
Alquezari, R. and Sanfeliu, A. (1995). An algebraic framework to represent fi-
nite state machines in single-layer recurrent neural networks. Neural Com-
putation, 7(5):931-949.
Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., and Soda, G. (1999). Exploit-
ing the past and future in protein secondary structure prediction. Bioinfor-
matics, 15(11):937-946.
Bengio Y. and Frasconi, P. (1996). Input/output HMMs for sequence process-
ing. IEEE Transactions on Neural Networks, 7(5):1231-1249.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term depen-
dencies with gradient descent is difficult. IEEE Transactions on Neural
Networks, 5(2):157–166.
Bianucci, A.M., Micheli, A., Sperduti, A., and Starita, A. (2000). Applica-
tion of cascade correlation networks for structures to chemistry. Journal of
Applied Intelligence, 12:117-146.
Diligenti, M., Frasconi, P., and Gori, M. (2003). Hidden tree Markov models
for document image classification. IEEE Transactions on Pattern Analysis
63
and Machine Intelligence, 25(4):519-523.
Fahlmann, S.E. and Lebiere, C. (1990). The cascade-correlation learning ar-
chitecture. In: D.S.Touretzky (ed.), Advances in Neural Information Pro-
cessing Systems 2, Morgan Kaufmann, pp. 524-532.
Fahlmann, S.E. (1991). The recurrent cascade-correlation architecture. In:
R.P.Lippmann, J.E.Moody, and D.S.Touretzky (eds.), Advances in Neural
Information Processing Systems 3, Morgan Kaufmann, pp. 190-196.
Frasconi, P. (2002). Comparing convolution kernels and RNNs on a wide-
coverage computational analysis of natural language. NIPS 2002, Work-
shop on unreal data.
Frasconi, P. and Gori, M. (1996). Computational capabilities of local-feedback
recurrent networks acting as finite-state machines. IEEE Transactions on
Neural Networks, 7(6):1521-1524.
Frasconi, P., Gori, M., and Sperduti,A. (1998). A general framework of adap-
tive processing of data structures. IEEE Transactions on Neural Networks,
9(5): 768-786.
Forcada M.L. and Carrasco, R.C. (1995). Learning the initial state of a second
order recurrent neural network during regular language inference. Neural
Computation, 7(5):923–949.
Funahashi K. and Nakamura, Y. (1993). Approximation of dynamical systems
by continuous time recurrent neural networks. Neural Networks, 6(6):801–
806.
Giles, C.L., Chen, D., Sun, G.Z., Chen, H.H., Lee, Y.C., and Goudreau, M.W.
(1995). Constructive learning of recurrent neural networks: limitations of
64
recurrent cascade correlation and a simple solution. IEEE Transactions on
Neural Networks, 6(4):829-836.
Goller, C. (1997). A connectionist approach for learning search control
heuristics for automated deduction systems. PhD thesis, Technische Uni-
versität München.
Goller, C. and Küchler,A. (1996). Learning task-dependent distributed
structure-representations by backpropagation through structure. In: IEEE
International Conference on Neural Networks, pp. 347-352.
Hammer, B. (2001). Generalization ability of folding networks. IEEE Trans-
actions on Knowledge and Data Engineering, 13(2):196–206.
Hammer, B. (2000). Learning with recurrent neural networks. Springer Lec-
ture Notes in Control an Information Sciences 254, Springer.
Hammer, B. and Steil, J.J. (2002). Perspectives on learning with recurrent
neural networks. In M. Verleysen, editor, ESANN’02, pages 357-368. D-
side publications.
Haussler, D. (1999). Convolution kernels on discrete structure. Technical re-
port, UC Santa Cruz.
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward
networks are universal approximators. Neural Networks, 2:359-366.
Jaakkola, T., Diekhans, M., and Haussler, D. (2000). A discriminative frame-
work for detecting remote protein homologies. Journal of Computational
Biology, 7:95–114.
Kremer S. (2001). Spatio-temporal connectionist networks: A taxonomy and
review. Neural Computation, 13(2):249-306.
65
Leslie, C., Eskin, E., and Noble, W. (2002). The spectrum kernel: A string
kernel for SVM protein classification. Proc. Pacific Symposium on Bio-
computing, pages 564-575.
Lodhi, H., Shawe-Taylor, J., Cristianini, N., and Watkins, C.J.C.H. (2000).
Text classification using string kernels. NIPS, pages 563-569.
Mauro, C. de, Diligenti, M., Gori, M., and Maggini, M. (2003). Similarity
learning for graph-based image representations. Pattern Recognition Let-
ters 24(8):1115-1122.
Micheli, A. Recursive Processing of Structured Domains in Machine Learn-
ing. PhD thesis, Technical Report TD-13/03, Department of Computer Sci-
ence, University of Pisa.
Micheli, A., Sona, D., and Sperduti, A. (2003a). Formal definition of con-
text in contextual recursive cascade correlation networks. In E.Alpaydin,
E.Oja, and L.Xu (eds.), Proceedings of ICANN/ICONIP 2003, LNCS
2714, pages 173-130.
Micheli, A., Sona, D., and Sperduti, A. (2003b). Contextual processing of
structured data by recursive cascade correlation. To appear in IEEE Trans-
actions on Neural Networks.
Micheli, A., Sona, D., and Sperduti, A. (2002). Recursive cascade correla-
tion for contextual processing of structured data, In: Proceedings of the
International Joint Conference on Neural Networks 2002, 1, pp.268-273.
Micheli, A., Sona, D., and Sperduti, A. (2000). Bi-causal recurrent cascade
correlation. In: Proceedings of the International Joint Conference on Neu-
ral Networks, vol.3, pp.3-8.
Pollastri, G., Baldi, P., Vullo, A., and Frasconi, P. (2002). Prediction of protein
66
topologies using GIOHMMs and GRNNs. In: Advances of Neural Infor-
mation Processing Systems 2002.
Scarselli, F. and Tsoi, A.C. (1998). Universal approximation using feedfor-
ward neural networks: A survey of some existing methods, and some new
results. Neural Networks, 11:15–37.
Sontag, E.D. (1992). Feedforward nets for interpolation and classification.
Journal of Computer and System Sciences, 45:20-48.
Sperduti, A. (1997). On the computational power of neural networks for struc-
tures. Neural Networks, 10(3):395-400.
Sperduti, A., Majidi, D., and Starita,A. (1996). Extended cascade-correlation
for syntactic and structural pattern recognition. In: P.Perner, P.Wand,
A.Rosenfeld (eds.), Advances in Structured and Syntactical Pattern Recog-
nition, Springer, pp.90-99.
Sperduti, A. and Starita, A. (1997). Supervised neural networks for the classi-
fication of structures. IEEE Transactions on Neural Networks, 8(3): 714-
735.
Sturt, P., Costa, F., Lombardo, V., and Frasconi, P. (2003). Learning first-pass
structural attachment preferences with dynamic grammars and recursive
neural networks. Cognition, 88(2):133-169.
Sun, R. (2001). Introduction to sequence learning. R. Sun, C.L. Giles (eds.),
Sequence Learning: Paradigms, Algorithms, and Applications, pp. 1-10,
Springer.
Vullo, A. and Frasconi, P. (2003). A recursive connectionist approach for pre-
dicting disulfide connectivity in proteins. Proceedings of the Eighteenth
67
Annual ACM Symposium on Applied Computing (SAC 2003), Melbourne,
FL.
Watkins, C. (1999). Dynamic alignment kernels. Technical report, UL Royal
Holloway.
Wakuya, H. and Zurada, J. (2001). Bi-directional computing architectures for
Appendix
time series prediction. Neural Networks, 14:1307-1321.
Proof of Theorem 3.5
Assume a stationary CRecCC network is given. Consider the two graphs depicted
in Fig. 3. We refer to the vertices by the numbers as indicated in the figure. Thereby,
enumeration of children and parents is done from left to right as they appear in the
figure. I.e. the upper left box of a vertex refers to parent number one, the upper right
box to parent number two, the lower left box indicates child number one, the lower
right box refers to child number two. The labels of all vertices are identical, say
�. Note that the structures are highly symmetric: each pair of vertices and ¡
the same number of children and parents and the same label. Moreover, and ¡
pointing to vertices which have the same structure in the left and the right DPAG.
We would like to point out one aspect of these structures: edges���§�
consider the
¡
¡�. and § constitute the respective second child of the root, whereas�
§ and��¡�§
¡
constitute the first parent § of § and , respectively. For all other edges, and�¡
the
positioning of the respective parent coincides with the positioning of the respective
child.
As a first step, we show by induction over that the following equalities hold
68
has
are
for the hidden neurons’ activation:
¡��¤�¡£¢�  ¡��¡£¢�¥
¡��¡£¢�  ¡��¤�¡£¢� 
¡£¢��¡��¤�¡£¢�§ ¡£¢����¤�¡£¢�§��¡£¢�¥��¤�¡£¢� ��¡£¢�¡ ��¤�¡£¢� ��
This indicates that a large symmetry can already be found within each structure.
¡�
¡�� ��¤
�:  ��¤ ¡�� ��  ¤
�����¡�� ����
�������
�����¡�� ����  ��¤�¡�� ��
and ¡�����¤ �����¡��¥��¡��¡ ����¤ �����¡�� ��¡�� ���¤�¡��§�
¡�� 
¡��¥��¤
 �¤�¡�� 
 ��¤
¡��
¡��¤��������
¡��¥
Induction step for :
 ���
¡��¤������¡��¡ 
�����¡�� 
 ��¤�¡��  ¡����
¡��
¡���¡��¤������¡��¥ ¡��¡�� 
 ���
�����¡��  ¡��¡��¡ 
 ��¤ ¡����
¡����¤
¤ ��¢���¡��¡ ����¡£¢  ¡£¢� �
��¢���¡�� ����¡£¢ 
¤ ¡£¢� � ¤
¤ ��¢���¡��¥����¡£¢  ¡£¢�¥�
��¢���¡�� ����¡£¢ 
¤ ¡£¢� � ¤
¤ ��¢���¡�������¡£¢  ¡£¢���
��¢���¡��§����¡£¢ 
¤ ¡£¢�§� ¤
 ����¢ ���  ����¢ �¡��¥����¡£¢ 
�� ����  ����¢ ���  ����¢ �¡�� ����¡£¢  ��¡ ����
¡��§ ¡�
¡���¤
��¥��¡�������¡£¢ 
�� ��¡��§����¡£¢  ���§��� �������
 ����¢ �¡�������¡£¢  �������� ¢����¢ 
�
�� ��¡�� ����¡£¢� ����  ����¢ �¡��§����¡£¢  ���§���� ¢����¢ 
��¥��¡�� ����¡£¢�¡ ����
��§��¡�� ����¡£¢� ��¡�� ����¡£¢� ��¡��������¡£¢ 
�����¡��¥����¡£¢�¥��¡�� ����¡£¢�¡ ��¡��������¡£¢ 
69
¢
¢
��
�
¢����¢ 
������� ¢����¢  �������
¢
¢
��
��
and¡£¢�¡  ��¢���¡��¡  ¡����¡£¢ 
¡� ¤ ¡����¡£¢ 
��¢���¡��  ¡£¢�  ¡� ¡£¢�¥
¤
��¢���¡��¥
¤
¡����¡£¢ 
¡�
��¢���¡��  ¡����¡£¢  ¤ ¤
¡£¢�  ¡£¢��¡� ¤ ¡�
��¢���¡��§ ¤ ¡����¡£¢ 
��¢���¡���¡����¡£¢  ¤
�� 
�� 
��¥
 ����¢ �¡��¥ ¡����¡£¢ 
¡����
¡����
 ����¢ �¡��  ¡����¡£¢ 
 ����¢ ���
 ����¢ ���
¡����¡£¢�¡ 
¡��¡��¡  ��  ¡��¡��  ¡����¡£¢� 
 ����¢ �¡���§
¡����
¡����¡£¢ 
 ����¢ �¡����¡����¡£¢ 
¡����
¡����¡£¢�¥ ¡��¡��  ���¡��¡��¥
¡��¡��  ¡����¡£¢�  ¡��¡��  ��§
¡£¢�§
Next, we show by induction over that¡¢� ��¤�¡£¢� 
¡� ¤
¡��¡��§ ��  ¡��¡���¡����¡£¢ 
¡����¡£¢ 
��¥
��§
��§ ¡��� ���¡���
¢����¢ 
¡���� ¢����¢  ���¡����
¡��¡���¡����¡£¢ 
¡����¡£¢� 
¡����¡£¢�  ¡��¡���¡����¡£¢ 
¡�holds for   all ¡��������  £ .
In particular, any CRecCC network yields the same output for both graphs indepen-
�
dently of the precise form of the network functions �¢.
�:
¡��¤�¡�� 
 �¤�¡��¡   ¤ ¡��¡ ��¤�¡�� ��¤�������� ¡��¥��¤�¡�� ��¤������¡��¡ ����
¡�
 ��¤������¡��¡ 
 ���
�����¡��¥ ¡��¡�� 
�����¡���¡��¡��§
¡����
¡���¤�¡���¡�
¡�����¤������¡�����¡��§���¤ ¡�����¤�¡��§��¤������¡��¥��¡��¡ ���¤
70
 ��¤ ¡��¥
¢
¢
��
��
¢����¢ 
���¡���� ¢����¢  ���¡����
¡��¤�¡��  ¡��§ ¡�
¡���¤�¡���¡��¤ ¡�
¢
¢
��
��
Induction step for :
¤ ��¡ ����
 ����¢ �¡��¥����¡£¢  ��¥��¡�������¡£¢  �������
¡£¢�¡ �
��¢���¡��¡ ����¡£¢ 
��¢���¡��¡ ����¡£¢ 
 ����¢ ���  ����¢ �¡��¥����¡£¢  ��¥��¡��§����¡£¢  ���§���
��¢���¡��¡ 
¡��¡��§
 ����¢ ���
 ����¢ �¡��¥
¡����¡£¢ 
¡����  ����¢ ���
¡����¡£¢ 
¡����¡£¢ 
¤ ��¡ ���� ¤
�� 
¡��¤�¡£¢�  ¡��¤�¡£¢� �
¡£¢� 
¡���
¤
��¥��¡�� ����¡£¢�¡ ����  ����¢ �¡�������¡£¢  �������� ¢����¢ 
�
¡£¢�¥�
��¢���¡��¥
¡��¡��¡  ¡����¡£¢�¡ 
¡����¡£¢ 
¡����
��¢���¡��¥����¡£¢ 
���¡���� ¢����¢ 
�
 ����¢ �¡����¡����¡£¢ 
¤ ¤
��¥
¡��¤�¡£¢�  ¡£¢�¥
¡��¤�¡£¢� � ¤
¤ �����¡��¥����¡£¢�¥��¡�� ����¡£¢�¡ ��¡��������¡£¢ 
¢����¢ 
¡£¢��� ��¢���¡�������¡£¢ 
¢����¢ 
���¡��¡��¥ ¡����¡£¢�¥
�������
¡��¡��  ¡����¡£¢� 
��¢���¡���¡����¡£¢ 
¡��¡���¡����¡£¢ 
��¢���¡�������¡£¢ 
������� ���¡����
¤ �����¡��¥����¡£¢�¥��¡�� ����¡£¢� ��¡��������¡£¢ 
¤
¡£¢��¡��¤�¡£¢�§ ¡��¤�¡£¢�§� ¡£¢��� ¤
��¢���¡�������¡£¢  ������¡�������¡£¢����¡��§����¡£¢�§���� ¢����¢ 
¤ ��¢���¡���¡����¡£¢  ¤
���¡��¡���¡����¡£¢��¡��¡��§
��¥
¢ ����
�
¢
��§
�
¢ ¡����¡£¢�§
¢����¢  ¢����¢ 
¢����¢ 
¡���� ���
¢
�
�
¢
�
¢
¢
��
¢����¢ 
Hence the structures cannot be differentiated by any CRecCC network regardless of
¡£¢��¡� ¤ �
the chosen specific weights of the network and form of the neurons. ¨
Proof of Theorem 3.7
Assume a DPAG   is given. If   is empty, the result is obvious. Assume   is not
empty. The first construction steps can be done for any limited fan-in and fan-out at
71
¢
�
��
¢
�
�
original graph
a
0
1
b c
2 3
1
d
d a
1’
b c
2
split graph
2’
a
3
0
0’
3’
dual graph
Figure 4: Example of the transformation of a graph (left) for which a compatible
positioning has to be found to the graph where each vertex is split into a parent ver-
tex denoted by�, . . .   , and a child vertex by�¡
denoted , . . .   ,
b
o
o
¡
d
c
i
(middle). Thereby,
connections to empty childen and connections from empty parents are dropped.
If the connections are considered as undirected connections, the dual graph can be
constructed (right). Now an enumeration of the vertices in the dual graph which cor-
responds to a compatible positioning of children and parents in the original DPAG
can be constructed.
72
most ¤ , thus we describe the steps for general ¤ .
Given   , consider the following related (split) graph  
ents and children is explicitely assigned to the vertices: vertices   in
edge� �£,edges �
¡
where the role of par-
¡
are ¡§¦
and edge�¡ �£ ¡ ¡
¡
in   are   �¡�¦�¡� ¡�¦
edge�¡ �£. See Fig. 4 (middle) for an example. This graph can be further trans-
�
formed to its dual graph  
¡¡ 
with labeled edges with labels in ¡ �¢
obtain as follows: the dual graph introduces a vertex ¡£¢ for each edge � in  
an edge�¢¡¤¢�¡��if a vertex in  
vertex. I.e. � ¤
or � ¤
vertices ¦
�¦
 �and� ¢�¡
¢�¡  �and� ¤ ¤
¢ , ¡   , ¦ ¡
¢ , ¡
�¦
¡
  in  
connected via a child in  
in  
¡¡ 
�¦
¡
¡
¢�¡
¡
¢�¡¡ � �¡�¦�¡�
£
¢�¡�¦�¡��
which we
¡
and
exists such that � and�are both connected to this
�¦
¡
¢�¡¡ �(the edges are connected via a child)
¡
 �(the edges are connected via a parent) for some
. An edge�¢¡¤¢�¡��in  
, i.e. they are both ingoing edges in  
is labeled with ¢ if � and�are connected via a parent in  
outgoing edges in  
¡
. See Fig. 4(right) for an example.
Now, we can relate positionings of   to positionings of  
¡¡ 
is labeled with if � and�are
¡
¡
¡
. edge�£¡¤¢�¡��
An
, i.e. they are both
and positionings of
¡
¡¡ 
¡
¡
     
 
 
  ¤
to enumerations of to prove the theorem. First consider and . has
fan-in and fan-out like . Since the positioning of the children of a fixed vertex
can be done independently from the positioning of the parents of this vertex, the
following holds: there is a one-to-one connection between compatible positionings
¡
and� ©¨�� ¨�of the vertices in   . Thereby,  
of� 
is no longer a rooted DPAG, however, the notion of a compatible positioning in the
¨��
¡
¨�of the ¡ vertices   in
¡
above form can also be applied for unrooted DPAGs. The one-to-one connection is
¡
¡
  ¨¦¥�¦
�
¡¦ 
given by the identities ¨��¦��and �¡ ��¢¡�.This holds because
positioning of the (empty) children of and of the (empty) parents of ¦
¢ can be
done arbitrarily.
Note that any compatible positioning of  
¢�¤  
73
 ¨§�£¡  �¤
¡
corresponds to the assignment of
¡
a
b
de
c f
a
o
b
c
o
d
i i
e
o
f
Figure 5: Example of the structure of the dual graph (right) for a given DPAG (left):
it decomposes into linear subgraphs and ring structures with an even number of
vertices.
a number in ¡��������¤©£
each edge in the graph such that edges connected to
to
the same vertex have different numbers: edge�¦ ¢�¡ an  �is assigned the number
 
¡
¨¦¥�¦
¢�¤ �
¡
§�¢¡¡ �;these values coincide for a compatible positioning. Since the
 
numbers come from a positioning, edges connected to the same vertex have different
numbers. Conversely, any assignment of numbers ¡��������¤�£ ¡
to the  
edges of such
that edges connected to the same vertex have different numbers can be transformed
¡
to a compatible positioning. If a value is edge�¦
assigned to   ¨ ¥�¦
¢�¡¡ �we set
¡
 .Since edges connected to the same vertex have different numbers, this
  §�¢¡¡ �¤
defines a valid positioning.
�
Now consider the dual graph  
the edges in  
¡
such that edges in  
¡¡ 
¡
. Note the following: Assigning numbers to
¢��¤
connected to the same vertex have different
¡¡ 
¡  
 
 
¡¡ 
 
numbers means to assign numbers to the vertices in such that vertices in
which are connected in have different numbers. We have already argued that
such an assignment for  
¡
leads to a compatible positioning of the graph   .
¤
¡
 
¤
¡¡ 
 
We now use the restriction on . Since has limited fan-in and fan-out
equal to two, each vertex in has at most one connection labeled with and one
connection labeled with ¢ ¡¡ 
 
¡¡ 
. Thus each vertex in has at most two edges. Hence
decomposes into connected components which have a very simple structure: the
 
74
components constitute either a linear graph, or they have a ring structure (see Fig. 5).
Thereby, each ring has necessarily an even number of vertices because each vertex is
connected by one edge labeled with and one edge labeled with ¢ . The assignment
of values in ¡��¥
 
¡¡ 
can be done independently for each connected component of
£
. For a linear structure, we can simply assign consecutively alternating values
to the vertices. For a ring, we can do the same starting at some arbitrary position
because the number of vertices in a ring is even. Thus, an appropriate assignment
of numbers to of the vertices of  
¡  
such that a compatible positioning of   arises
is always possible. ¨
Proof of Theorem 3.8
As beforehand, we can construct the  
graph
edge�¡ �£
¡
from   with vertices ¡§¦
¡
are ¡�¦
and edge�¡ �£ ¡ and  
¡ edges in
¡¡ 
  ¡
edge�¡ �£, and  
its dual graph .  
Each vertex in has at most ��con-
¥¨�¤�  �¡�¦�¡��
¡
 
¡
¡
¤�  
 
¤� nections, since each edge in is connected to a child with at most �further
ingoing edges in and a parent with at most �further outgoing edges in .
Hence we can obviously assign numbers ¡��������¥ �£ ¤��
¢�¡¡ ��¡�¦�¡��
¢�¢¡�¦�¡��
¡  
such
to each  
vertex in
¡¡ 
that no two vertices which are  
connected in have the same numbers assigned;
even if all ��neighbors of a vertex are already enumerated, there is still one
¥¨�¤�
number available. These numbers correspond to a compatible positioning of chil-
dren and parents in the original   DPAG with positions in ¡��������¥ �£. ¤��
Hence
filling the not yet assigned positions for children and parents with empty children
and parents, we obtain a DPAG from   with compatible positioning and fan-in and
fan-out ¥ ¤��
�,
thereby expanding with empty children and parents. ¨
75
