Introduction to Neural Networks and Computational Complexity	Jason Kroll
============================================================	May 2003

-----=====*****=====-----
 Neural Network Overview
-----=====*****=====-----

A neural network is a statistical construct adept at inferring functions
and thereby mapping inputs to corresponding outputs. Neural networks are
more than mere statistical techniques; a neural network meeting certain
simple preconditions becomes Turing universal. Any computatable function
can be computed on a neural network. Additionally, neural networks exhibit
Super-Turing capabilities, such as simulating any Turing machine in
constant time. Because of Turing (and Super-Turing) universality, the
neural network becomes, in a computational theoretic sense, an alternative
model to Turing machines and their Von Neumann descendants.

Indeed we may treat neural networks, in terms of computational theory,
similarly to how one treats Turing machines, and thereby define complexity
classes, not necessarily according to machine time and tape-space
consumption as on Turing machines, but according to the properties unique
to neural networks. Although, owing to Turing Universality, in principle
one can implement any neural network on a Turing machine and vice versa,
as one may likewise swap back and forth between any Turing Universal
computational architectures, the computational substrate will affect
real-world computation time and space consumption if not necessarily
asymptotic complexity. (Neural networks are fundamentally parallel, like
k-tape Turing machines, so there will be a polynomial-time increase in
time consumption in a translation to any serial architecture.)

The complexity specification for a neural network is based on fundamental
characteristics of the network. In general, one speaks of threshold
complexity and analog complexity, where the former represents
computational complexity on a threshold circuit while the latter considers
this on a continuous circuit. Further, one considers the necessary depth
in layers of said neural networks (when operating in parallel, depth
becomes the time analog). Therefore, instead of classes such as P, NP,
PSPACE and EXPTIME, one has TC1, TC2, TC3, and AC1, AC2, AC3, etc.
Threshold and analog circuits, along with layer depth, weight values,
activation functions, and other preconditions, will be described later;  
now it is necessary to introduce our fundamental unit of computation.

The neural network may be defined loosely as a set of nodes (an input
layer, zero or more hidden layers, and an output layer), input weights,
activation function(s), and update rule(s). Implementations vary greatly,
but the underlying computational model is not much changed from one to
another. We shall in general assume the simplest models.

The following equation describes the value of a node in a neural network:

a(j) = g( sum( aj * Wij) ) for i = 1 to n

where a(j) is the activation level or value of node j, Wij is the weight
value placed on input from a(i) into node j, the sum function is usually
the simple summation function, and g is an activation function which maps
the value of the summed inputs to some output value. g usually
renormalizes the sums between 0 and 1, and can dramatically affect
computational complexity as we shall see later.

Computationally, instead of tape space and processor time, we are faced
with node quantity, activation function(s), corresponding outputs, levels
of connectedness, ranges of weight values, graph topology, and update 
procedure. We can ignore any training or learning algorithms such as
back-propogation because theoretically we are concerned with a neural
network performing a computation, rather than how one would go about
producing such a network.

In describing complexity classes in neural networks we are concerned, as
with traditional Turing machines, with the space consumed by the
computation and the time consumed by the computation. However, while a
Turing machine has tape space and a transition table, neural networks have
only nodes and weights. Consequently, the space measurement analog on the
neural network becomes the node count. And, like k-tape Turing machines,
neural networks are thought to run in parallel, so we must treat each 
global update of the network as a step (in feed-forward networks this 
means that the time is roughly proportional to the number of layers or 
depth). Inextricably connected to the depth of a network is the 
connectedness of a network; that is, how many neurons at one level connect 
to neurons at another, and so on.

To simplify, there is in general a computational tradeoff between time and
space, and in the case of neural networks this means the tradeoff is
between quantity of nodes, quantity of layers, and connectedness of nodes.  
All of these may be said to have a positive partial derivative with
respect to accuracy of computing some function, and may be traded with
each other to meet varying demands of time and space. These do not
directly affect, except in extreme cases (such as depths less than 3), the
kinds of functions which are computable.

The elements of a neural network which directly affect the kinds of
functions which are computable by that network are the weight values and
the activation function. These have profound implications which we will
investigate later. However, they are excluded from time and space
complexity analysis. Similarly, the update function used to determine the
order of updates (simultaneous, staggered, or serial) is excluded from
computational complexity analysis altogether (in general we assume
simultaneous updates, though other orderings have no effect on
computability). It is worth noting at this point the necessary
preconditions for Turing universality: a neural network must be recurrent
and use a saturated linear activation function, it must use at least
rational weights and preferably real weights, and there must be a
sufficient number of nodes. With these preconditions in mind, we may now
investigate specific properties in detail to see how each contributes to
the complexity of the neural network as a computational architecture.

-----=====*****=====-----
Computational Complexity
-----=====*****=====-----

The computational complexity of a neural network is contingent upon many
factors. Neural networks are complicated devices, quite unlike Turing
machines, with an inordinate number of implementations and no generally
accepted standard form. Consequently we aim for an intuitive and
qualitative though comprehensive overview which should adapt itself easily
to any specific implementation of a network.

The computationally important aspects of a neural network include the
network topology (the properties of the the graph of nodes), the signal
propogation method (the flow of signals through the graph), the activation
function (used to map inputs to outputs on the neuronal level), the input
weights (used to determine the relative values of inputs to a neuron from
some other neurons), the update procedure (used during the training
process to change weights), nodal complexity (the node count in the
network), and the computational model (whether deterministic or
probablistic). In general, it is difficult to speak of time consumption
because feed-forward networks will terminate very quickly, while recurrent
nets may never halt. (It is curious that the halting problem should be
undecidable on Turing machines yet PSPACE-complete for computationally
equivalent recurrent networks.) Computational complexity is evaluated in
terms of how complicated a neural network must be in order to perform that
computation. Now we may examine those elements which contribute to
complexity.

----------------
Network Topology
----------------

The neural network topology can be said to consist of an input layer, an
output layer, and some number of hidden layers inbetween the input and
output layers. We analyze the network topology in terms of its size, depth
and connectivity.

The input layer is the sensory apparatus of the network. These neurons
typically do not accept weighted inputs, they simply represent by their
activation levels the state of the input. There must be sufficient input
nodes in the input layer to represent the input data acceptably. If we
take this in a mathematically perfect sense, there must be enough
different possible states of the input level to account for all possible
input states. For example, a binary network (with activation levels of
{0,1}) would take logn nodes to represent n possible inputs. In real-world
applications such as sight or hearing, perfect representation of the input
in not only unnecessary; it is impossible. Still, the input level must be
capable of representing the input sufficiently for the rest of the layers
to compute. In general, we designate the variable m to represent the
number of input nodes in a net.

The output layer is the last layer in the network as well as the output
device. In a Turing machine this would simply be {Accept, Reject}, but a
neural network can have any set of representable outputs. For decidability
problems, one might have one node that either activates (accepts) or fails
to activate (rejects). This simplest case makes for easy analysis, and is
at the basis of Minksy and Papert's proof that single-layer feed-forward
neural networks have serious computational limits (they can only compute
linearly seperable functions). However, once one adds more than one
output, the analysis becomes considerably more difficult, and once one
adds progressively more hidden layers, computational limitations
disappear.

The hidden layers are the layers of nodes inbetween the input layer and
the output layer. Together with the output layer, these do the
computational work of the neural network. A feed-forward neural network
with no hidden layers is known as a perceptron, and is said to be a
single-layer, or depth 1, network. Perceptrons cannot even compute all
Boolean functions, yet the computational power of hidden networks is
enormous. After the addition of a single hidden layer, it becomes possible
(all other conditions being met) to represent all continuous functions
mapping the input layer to the output layer. A second hidden layer allows
the mapping of discontinuous functions. [Sima 2001] speculates that, with
other preconditions met (saturated linear activation of real weights on
sufficient nodes), a depth 3 feed-forward network should be capable of
solving all problems in class P, and probably NP, in linear space relative
to the analogous Turing machine implementation. We shall examine the
preconditions for this result in more detail later.

In analyzing the network, our first measure is of course the size of the
network. There are different ways of measuring the size. One may count the
number of nodes, which is known as node complexity, one may count the
number of gates necessary to produce the network as an electric circuit,
or one can adopt the VLSI approach and measure the sum total of wire
necessary to produce such a network. We will prefer the first method as it
is simplest, but one must keep in mind the profound influence
connectedness has on other two measures and indeed on complexity overall.

Network depth has already been discussed in the context of hidden layers;
it remains to be added that any depth layer must have a sufficient number
of nodes. In general, this should be greater than or equal to the number
of input nodes or significant information loss will occur. For problems 
involving recursion or iteration on non-recurrent networks, depth can 
become critical, although it is speculated that in general depth of 3 is 
adequate for class NP.

Connectivity in a network topology is analogous to connectivity in a
graph. In general, fully-connected graphs are uncommon in neural
applications; in rare cases where fully-connected graphs are employed, the
number of nodes is usually quite small. In normal feed-forward networks,
one typically connects layers linearly, with no nodes at the same depth
level connecting to each other and no nodes connecting to layers farther
than one depth level ahead. It is possible to contrive any scheme of
entangling depth levels with each other, but this is unnecessary (except
in the recurrent case to be discussed later) so it will not be considered
here. In general, the lower the connectivity, the simpler the network is
to train, the fewer computational resources an update will consume, and
the less work that has to be done in general. However, lower connectivity
has to be made up for by more nodes or more layers. One will not often go
wrong in sticking with geometrically plausible levels of connectedness,
such as connecting all nodes in one layer to all nodes in the next but
not going overboard beyond that (even in the recurrent case).

------------------
Signal Propogation
------------------

A fundamental facet of the network topology is the flow the signals from
the input layer to the output layer. Although many schemes may be
employed, taxonomically one draws a line between feed-forward networks
(where signals flow in one direction) and recurrent networks (where
signals may flow back and forth between layers). This distinction is made
because there is a large difference in the classes of functions which are
computable by the two. Also, feed-forward networks are guaranteed to
produce an output, while recurrent networks, like runaway Turing machines,
may never halt.

Feed-forward networks of depth 3, with sufficient connectedness, adequate
input and output representation, and reasonable weight values and
activation functions, are capable of approximating any static function to
within any given degree of accuracy. This is fine enough for inferring
functions and typical machine learning applications (though there remains
no simple procedure for constructing such a network), but most important
real-world problems are dynamic: the input-output maps change depending on
the previous state and input. Feed-forward networks are not Turing
universal because they cannot do one very important operationg: iteration.  
However, for any specific computable function, if one is willing to expend
a sufficient number of neurons, a feed-forward network can be constructed
to compute that function (by, in essence, unrolling loops in a
metaphorical sense, similarly to how one maps NP-complete problems to
physical circuits).

Recurrent neural networks, Turing universal and by nature iterative, are
proven to be capable of approximating within any given degree of accuracy
any dynamic function. Of course, recurrent neural networks are
mathematically almost impossible to deal with, in practice are very
difficult to train or otherwise produce, and are effectively impossible to
debug. Like human brains, they can suffer from epileptic feedback
problems, exploding out of control (or conversely lying inappropriately
dormant). The topology of a recurrent network may be defined as one likes,
though it becomes largely meaningless in this context to speak of depth.  
Complexity analysis becomes complicated and no standard procedure appears
to exist. With no guaranteed stopping time, one is reduced to tallying
nodes and hoping that other properties are sufficient to ensure that a
given computation is possible. The procedure of constructing appropriate
recurrent networks for a given application is a major open problem.

-------------------
Activation Function
-------------------

In conjunction with input weights, the activation function plays an 
essential role in determining what classes of functions are computable by 
a neural network. Categorically one tends to distinguish between two 
classes of activation functions: binary (threshold), and analog 
(continuous).

Binary activation functions are quite simple. Any value greater than or
equal to a given threshold returns a positive result (usually 1),
otherwise the function returns a null result (usually 0). In constructing
a neural network with binary activation functions, it is common for every
individual node to have a unique threshold value. Binary activation
networks are more powerful than one might expect. In the context of
feed-forward networks, a binary activation feed-forward network can (with
other preconditions met) represent most static functions. Specifically,
the complexity class known as TC0 (threshold circuit depth 0-indexed),
meaning those functions computable by threshold functions in polytime,
includes all those functions computable by continuous feedforward networks
of polysize and real-number weights. Binary activation functions seem to
be relatively simple to train.  However, binary activation functions make
for non-universal neural networks, so we will not consider them further.

Analog, or continuous, activation functions are a prerequisite for Turing
universality. Oddly, as a continuous activation function, any piece-wise
linear function, provided it is saturated, is adequate, so a simple
function such as g(in) = { 0 for in < 0, 1 for in > 1, in for 0 < in < 1 }
is fine. This is called a saturated linear function because inbetween the
interval {0,1} all real numbers are represented. Functions representing
all real numbers, but not within the interval {0,1}, are usually capable
of Turing universality, but at unusually high cost in terms of space
consumption (in feed-forward nets) and time (in recurrent networks).

One common activation function is the standard sigmoidal function
1/(1+e^-x) which has a simple enough derivative to be amenable to the
training phase and seems to be popular among electrical engineers.  
Sigmoidal functions and piece-wise linear functions appear to be
computationally equivalent in terms of what functions they may compute.  
One may intuitively give preference to the sigmoids owing to their
incorporation of extreme values. (The argument against incorporating
extreme values is that the precision is so out of proportion to the input
that they are meaningless, though in a signal-processing context this is
not immediately clear.) Recent results indicate, however, that
computations based on sigmoidal functions can take exponentially longer
than those based on saturated linear functions (owing, evidently, to
recurrency in the network). In this light, it may be safer and simpler to
opt for linear saturated functions.

-------------
Input Weights
-------------

It is easy enough to implement a neural network with a continuous instead
of binary activation function, not so difficult intuitively to provide
enough nodes to represent inputs and outputs, and difficult though not
impossible to find a good size and number of hidden layers, but these
concern mostly prerequisites for computability of various classes of
functions. One can say, easily enough, use a recurrent neural network or
you can't have Turing universality. However, one must pay special
attention to the input weights, because an interesting series of
computability cut-offs occur according to input weight representation. (We
will assume an activation function capable of accomodating the input
weights, as in general one should pair the two.)

Input weights, coefficients of the values sent into the activation
functions, can be binary, integer, rational, or real. Negative numbers are
unnecessary though not harmful, while further number sets (such as complex
and various transcendentals, or the algebraic numbers of various orders
between rational and real) add no computational capabilities and will
similarly be ignored. We shall assume other Turing universality
preconditions (recurrent networks of continuous activation functions and
adequate size) to be met.

Binary weight/activation values are analogous computationally speaking to
integer weights insofar as any integer may be represented in binary.  
Consequently, any function computable in O(n) node complexity on an
integer network will require O(nlogn) node complexity on a binary network,
but the computational class remains unchanged. An integer or binary
network is capable of calculating any regular function in regular
polynomial time, but is incapable of calculating the set of iterative or
recursive functions (including those which can be expressed in closed
form, such as linear homogenous functions, as they require algebraic
numbers). This result is surprising, as one might expect a recursive
function such as Fibonacci to be easily produced by recurrent integer
nets, a point which requires further investigation and perhaps even
verification (one is right to be skeptical).

The next level of abstraction in weights is the set of rational numbers, a
countably infinite set of integers and fractions, one integer over
another. Evidently it becomes possible to represent recurrence equations
once one has rational numbers. This is curious because the rational
numbers exist in a one-to-one correspondence with the integers; the key
seems to lie in there being a countably infinite number of rational
numbers between any two different rational numbers, while this is not true
for integers. It may be worthwhile to verify this result. In any event, it
is known that with rational weights and a saturated linear activation
function, a recurrent neural network can emulate any Turing machine step
for step. A fascinating example of super-Turing capabilities is that this
step-for-step emulation can be done for any function computable in T(n) on
a Turing machine, in constant time O(T(n)) with 886 neurons, and in the
tradition of trading time for space, may be reduced to a mere 25 neurons
if one is willing to pay O(n^2 T(n)) time.

The final level of abstraction in weights, skipping over those
multitudinous algebraic numbers with no computational value, is the set of
all real numbers. Here we become able to compute any arbitrary dynamic
function in polytime; quite an accomplishment, this constitutes true
super-Turing computability. Specifically, these networks can emulate
infinite threshold-based networks of infinite size in linear time.
Consequently, unless faced with very simple problems, one might as well
use real-number weights and accomodating activation functions, so as not
to sabotage ourselves. (Fortunately, the discrete nature of digital
computation, ie the 32-bit or 64-bit representation of real numbers, does
not negatively impact effective computability, because the level of
precision relative to input nodes is adequate for any purpose shy of
calculating the radius of the universe to within one hydrogen atom; should
the number of bits continue to increase, even that would become plausible.
As Turing noted long ago in a completely different context, once a
threshold of precision is reached, discrete becomes indistinguishable from
continuous.)

A correlative issue in the complexity of input weights is symmetry, that
is, whether or not Wij == Wji for any pair of nodes. In the case of
equality, the network is known as symmetric, and is simply a weighted
graph. In the asymmetric case, Wij != Wji, and we have a weighted digraph.
Again, with asymmetry, complexity depends more on the complexity of the
weights than on the asymmetry, but given sufficient weights, an explosion
of computability results. In general, asymmetry is a less confining choice
with little computational overhead. The problem with asymmetry is that
energy levels within the network are not bounded, and the threat of
'epilepsy' becomes endemic. Symmetric networks have a bounded energy
level, analogous to Lyapunov energy, so one is guaranteed to reach a
stable state, but this will obviously constrain computability. How much
additional node complexity is needed to compensate for symmetry is
unknown. In general, dealing with symmetry or lack-thereof remains an open
question.

----------------
Update Procedure
----------------

Neural networks may be updated in a number of ways. One may update
individual nodes one at a time in staggered or even serial fashion, or one
may update each layer or the even the whole network at a step either
simultaneously or staggered. We generalize down to two update
methodologies (one-at-a-time vs all-at-once) which have different names
depending on whether they refer to feed-forward or recurrent networks. In
the feed-forward case these are referred to as discrete updates and
continuous updates, respectively. In the recurrent case these are known as
sequential and productive, again respectively. Hard bounds have not been
established. Staggered update rules could in theory affect computational
time by disrupting parallelism, but it is not known if this is by a
constant factor or polynomially; computability classes seem unchanged.
More research can be done in this area, if deemed valuable. Biological
neurons seem to get along fine leaving the update procedure up to the
constraints of physical information propogation (aka slightly staggered).

------------
Network Size
------------

A further and rather fundamental computational element is the size of the
network, specifically the number of nodes in the network, and whether this
number is allowed to be infinite or thought to be bounded. Of course this
has no direct real-world analog, but is useful in a theoretical sense in
showing how finite networks with given properties can be computationally
equivalent to infinite networks of other properties (for example, the
equivalence of infinite threshold binary networks to finite real-number
saturated linear networks). In the neural context one measures the
complexity of a function by node complexity, that is, the number of nodes
required to compute a given function. In spite of possible tradeoffs with
connectedness and number of layers, this value seems not to change
complexity classes as a result. Similar to any complexity measure, the
more complex, the larger the representation. A rule for calculating the
necessary network size is not known; however, it is likely that any
function computable on a standard Turing machine in time T(n) should not
take more than O(T(n) log(T(n))) neurons, and on adequate networks can be
implemented step-for-step in O(T(n)) node complexity.  Determining
appropriate size and connectedness for hidden levels, as well as the ideal 
number of hidden level, again has no simple solution.

-------------------
Computational Model
-------------------

I was unable to formulate meaningful results distinguishing deterministic
from probabalistic computational models in the neural context; there is
effectively no difference that I can determine, though if one is very
unlucky a probablistic computation could scale by a (probably) constant 
time factor.

--------------------
Mathematical Summary
--------------------

We tend to divide complexity class of neural networks (this time the 
feed-forward kind) by activation function (threshold vs analog) and within 
this context then by depth. To summarize our knowledge:

Feed-forward networks

TC1 = { linearly seperable functions }
TC2 = { static continuous functions }
TC3 = { static discontinuous functions }

ACn = { static non-recursive functions }

Recurrent networks

TCn = { unknown }
ACn = { arbitrary dynamic functions }

Binary weights		{ regular polytime functions in O(T(n)logT(n)) }
Integer weights		{ regular polytime functions in O(T(n)) }
Rational weights	{ recursive functions, Turing Universal }
Real weights		{ any TM in O(T(n)) on 886 nodes, Super Turing }

Equivalence among TCn and ACm for various n,m is not known.
In general, any infinite-size TC can be emulated by a finite-size AC.
The Halting Problem for a recurrent neural network is PSPACE-complete

--==**==--
Speculation
--==**==--

We know a considerable amount about the computational complexity of neural
networks, although hard results tend to be produced for very specific 
instantiations rather than the model of neural computing as a whole.

There are two fundamentally pressing issues in neural networks that strike
me in particular. The first is how to pair input/output layers with
computational layers. And this is something we do not know very much
about, not in a practical sense and certainly not in a theoretical sense. 
Physiologically we have some clues, but physical neurons also have other 
information channels open to them that are not included in our 
computational models (and should not really be necessary).

More important, in my mind, is the problem of how to induce symbol
grounding. That is, for any model of machine intelligence, how can one in
effect 'boot-strap' the system such that the internal representations of
the external reality will come into being and accurately and
comprehensively represent the external reality. (I define intelligence to
be the comprehensiveness and accuracy of the internal model of the
external reality, as opposed to genetic/utilitarian self-interest or 
acts-like-a-human definitions.)

I have long held the conviction that neural networks are only one in a
class of functions capable of intelligence (of resolving the
symbol-grounding problem to construct the comprehensive internal reality).
Finding other functions in this class has been difficult, but at least I
have formal requirements now for these functions such that one could even
go about creating them if necessary. In particular it would be nice to
find a function that performs well on serial architecture, or small-scale 
parallel architectures, because that is what we have to work with at the 
moment.

The computational model, and indeed the physical substrate on which it is
implemented, directly affects computational complexity. In this vein,
while neural networks are ideally suited to parallel architectures, one
would expect modern statistical techniques to be more amenable to serial
architectures. It may be worth-while to explore modern statistical
techniques for inferring dynamic functions until such a time as massively
parallel architectures become widely available; if the two are
computationally equivalent, there is no reason why neural networks could
produce intelligence while statistical techniques could not.

Symbol-grounding must have a resolution. I have some intuition about where
this resolution lies. The bulk of machine learning concerns inferring a
function given inputs and consequent outputs. Supposing that this function
is not merely some function but is actually an external reality as
mediated through sensory/motor inputs and outputs. This function is then
dynamic, as the same input (ie the same motor action) will yield different
outputs (ie sensory percepts) depending on the state of the world, which
is more predictable than not. This produces fundamentally two very large
problems, the aforementioned pairing of sensory/motor input/output to
computational apparatus (and how important it is that sensors and motors
not be seperate entities is an open question), and our original problem of
symbol grounding internal to the computational apparatus. We know that 
nearly all animals achieve this, and that there are neuronal growth and 
pruning phases accompanying development stages, so we have a wealth of 
biophysical clues, which can be useful if not over-interpreted.

A corollary consequence of symbol-grounding, which at the moment we are
supposing to be the inference of the dynamic function of reality as
mediated through sensory/motor input/output, is that natural language must
somehow be achievable through this symbol-grounding process. Indeed, it is
natural language that gives us the most access to internal symbols, albeit
at a very high level. Beginning with basic grammar, which one must infer
having never heard spoken language and having only sensory/motor data to
work with, we have to show that this is possible through some
computational process. Chomsky et al have documented extensively that
human linguistic behavior is observable from a very high level of
abstraction to be quite similar; case studies of pidgin languages left by
colonialism, for example, show universal grammatical inclinations and
produce languages understandable by isolated populations formerly
colonized by the same imperial power, owing to the creation of the same
grammatical rules on a core vocabulary by both populations. Chomsky et al
have interpreted this to indicate the presence of high-level structures
for grammatical processing in the brain. However, if such high-level
structures exist, it seems likely that they would fail to develop more
frequently than is observable, and there should be little age difference
in recoverability of language facilities following brain damage. These are
not the case. Furthermore, the grammatical rules people develop or infer
tend simply to be generalizations or global application of observable
trends, ignoring the irregular verb forms and the like. This would be
consistent with any machine learning algorithm which seeks to produce a
reasonably simple function fitting the data. Therefore I would speculate
that a great deal of natural language, from its acquisition to its basic
grammatical structures, has more to do with computational properties of
the neural network architecture of the brain than with high-level
structures. It would be a considerable task to show this in some
meaningful computational sense before symbol-grounding has been solved for
simpler cases. At least we know that recurrent neurons in the cerebral
cortex are necessary for recursively enumerable language.

Neural networks are only one exemplar in a class of functions, which I
suspect at their core perform statistical approximation of dynamic
functions (previously I held the suspicion that these performed
statistical constraint satisfaction, based on vision and aural perception,
but I think this new formulation is more accurate). The key to
symbol-grounding, I think, is that as a consequence of the dynamicism
inherent in the real world, it is necessary for an internalized model of
the external reality to develop in order to account of the dynamic nature
of the problem. Any machine successfully mapping inputs and outputs of
sensory/motor data in a dynamic environment will of necessity produce an
internalized model of the external reality; this may be simply an
unavoidable consequence of any statistically plausible computation.

I would like to pursue the subject of symbol-grounding further; I think it
would make a good doctoral dissertation, and I suspect I could make
considerable headway. But, one can certainly get ahead of oneself. There
are prerequisites for addressing symbol-grounding. There is of course the
problem of coupling sensory/motor input/output to computational apparatus,
but there is a more pressing issue: there exists no comprehensive and
formal theory of intelligent systems consistent with the computational
properties of neural networks and consistent across all fields where
intelligence is studied, including economics, game theory, neuroscience,
biology, and evolutionary psychology. In general we have skirted the
issue, using thought experiments and sophistry, but it is becoming
increasingly clear that in order to address such issues as intelligence,
we will need some working definitions and working hypotheses. I have taken
to formulating a fairly consistent theory of intelligent systems to my own
satisfaction, though not falling squarely under the heading of computer
science outside of requiring consistency with known properties of neural
computation. Of the three problems I have identified thus far
(symbol-grounding, sensory/motor computational coupling, and a theory of
intelligent systems) I expect we have the capabilities at present to
address the first and third problems, while ideally one might pass off the
second to the robotics people. In any event, there is no dearth of avenues
for future research.

Bibliography
----------

Horne, B. G., Hush, D. R.: Bounds on the complexity of recurrent neural
network implementations of finite state machines. Neural Networks, 9,
243--252, 1996.

W. Maass. Bounds for the computational power and learning complexity of
analog neural nets. In Proc. 25th Annu. ACM Sympos. Theory Comput., pages
335--344. ACM Press, New York, NY, 1993.

P. Orponen: An overview of the computational power of recurrent neural
networks, in: H. Hotyniemi, ed., Proceedings of the 9th Finnish AI
Conference STeP 2000 Millennium of AI, Espoo, Finland (Vol. 3: "AI of
Tomorrow": Symposium on Theory, Finnish AI Society, Vaasa, Finland, 2000)
89-96.

P. Orponen, Computational complexity of neural networks: A survey. Nordic
Journal of Computing 1 (1994), 94--110.

P. Orponen, "Neural networks and complexity theory," in Proc. 17th
Symposium on Mathematical Foundations of Computer Science, Springer-Verlag
Lecture Notes in Computer Science, vol. 629 (1992), 50--61.

Jiri Sima and Pekka Orponen:  Computational Taxonomy and Survery of
Neural Network Models. Neural Computation, 2001.