textfiles/programming/AI/dobbs.txt

                           A Survey of Neural Networks
                              by Jeannette Lawrence
                                  Jan. 23, 1990

    I. Introduction

    Neural networks have been hailed as the greatest technological
    breakthrough since the transistor and have been predicted to be a common
    household item by the year 2000.  How much of this is hype?  What are
    they capable, or not capable of?  With numerous paradigms available,
    which is best for a particular application?  This article will answer
    these questions and more about this newly emerging field of computation.

    Formed by simulated neurons connected together much the same way the
    brain's neurons are, neural networks are able to associate and
    generalize without rules.  They have solved problems in pattern
    recognition, robotics, speech processing, financial predicting and signal
    processing, to name a few.

    One of the first impressive neural networks was NetTalk, which read in
    ASCII text and correctly pronounced the words (producing phonemes which
    drove a speech chip), even those it had never seen before (1).  Designed
    by John Hopkins biophysicist Terry Sejnowski and Charles Rosenberg of
    Princeton in 1986, this application made the Backprogagation training
    algorithm famous.  Using the same paradigm, a neural network has been
    trained to classify sonar returns from an undersea mine and rock.  This
    classifier, designed by Sejnowski and R.  Paul Gorman, performed better
    than a nearest-neighbor classifier (2).

    As far as the public is concerned, the modern era in neural networks
    began in 1982 when the distinguished Caltech physicist John Hopfield
    published a paper which not only showed that neural networks could store
    and recall patterns even when the input was incomplete, it provided the
    mathematical elucidation which captured the attention of the scientific
    community. (3)

    Speech recognition of Finnish and Japanese (to text) has been
    demonstrated by researcher Teuvo Kohonen of the Helsinki University of
    Technology, Finland.  For these inflectional languages, the system must
    construct the text from recognizable phonetics units.  (4) This complex
    system uses signal preprocessing by a TMS32010 chip, Kohonen's
    self-organizing associative paradigm, and a context-sensitive stochastic
    grammar corrector.

    The Neocognitron, designed by Kunihiko Fukushima of the NHK Science and
    Technical Research Lab in Tokyo, recognizes handwritten numerals of
    various styles of penmanship correctly, even if they are considerably
    distorted in shape (5).  Built as a model for the human visual system,
    this highly specialized network does not implement any common topology.

    The kinds of problems best solved by neural networks are those that
    people are good at such as association, evaluation and pattern
    recognition.  Problems that are difficult to compute and do not require
    perfect answers, just very good answers, are also best done with neural
    networks.  A quick, very good response is often more desirable than a
    more accurate answer which takes longer to compute.  This is especially
    true in robotics or industrial controller applications.  Predictions of
    behavior and general analysis of data are also affairs for neural
    networks.  In the financial arena, consumer loan analysis and financial
    forecasting make good applications.  New network designers are working
    on weather forecasts by neural networks.  Currently, doctors are
    developing medical neural networks as an aid in diagnosis.  Attorneys
    and insurance companies are also working on neural networks to help
    estimate the value of claims.

    Neural networks are poor at precise calculations and serial processing.
    They are also unable to predict or recognize anything that does not
    inherently contain some sort of pattern.  For example, they cannot
    predict the lottery, since this is a random process.  It is unlikely
    that a neural network could be built which has the capacity to think as
    well as a person does for two reasons.  Neural networks are terrible at
    deduction, or logical thinking and the human brain is just too complex
    to completely simulate.  Also, some problems are too difficult for
    present technology.  Real vision, for example, is a long way off.

    A brief look at the general structure and operation of neural networks
    will help explain the limits to their abilities.  The power and speed of
    the human brain comes from the way the hundreds of billions of highly
    interconnected neurons function together.  Neural networks simulate the
    operation and structure of brain neurons, but on a much smaller scale.
    Information is distributed across the neurons' interconnections, not as
    bits of intelligence stored within the neurons as was once thought.

    There are many types of neural networks, but all have three things in
    common.  A neural network can be described in terms of its individual
    neurons, the connections between them (topology), and the learning rule.
    Together they constitute the neural network paradigm.

    Artificial neurons are also called processing elements, neurodes, units
    or cells.  Each neuron receives the output signals from many other
    neurons.  A neuron calculates its output by finding the weighted sum of
    its inputs.  The point where two neurons communicate is called a
    connection (analogous to a synapse).  The weight of a particular
    connection is noted w^ij, where ^ij means subscripted ij, i is the
    receiving neuron and j is the sending neuron.  At any point in
    time (t) the neuron adds up the weighted inputs to produce an
    activation value a^i(t).  The activation is passed through an
    output, or transfer, function f^i, which produces the actual
    output for that neuron for that time, o^i(t).

    The activation function specifies what the neuron is to do with the
    signals after the weights have had their effect.  Once inside the
    neuron, the weighted signals are summed to form a net value.  In most
    models, signals can either be excitatory or inhibitory.  After
    summation, the net input of the neuron is combined with the previous
    state of the neuron to produce a new activation value.  In the
    simplest models, the activation function is the weighted sum of the
    neuron's inputs;  the previous state is not taken into account.  In
    more complicated models, the activation function also uses the
    previous output of the neuron, so that the neuron can self-excite.
    These activation functions slowly decay over time;  an excited state
    slowly returns to an inactive level.  Sometimes the activation
    function is stochastic, i.e.  it includes a random noise factor.

    The transfer function of a neuron defines how the activation value is
    output.  The earliest models used a linear transfer function.  There are
    certain problems which are not entirely reducable by purely linear
    methods.  Nonlinear neurons allow more interesting problems to be
    solved.  The most simple nonlinear model consists of threshold neurons.
    A threshold transfer function is an all-or-nothing function.  For
    example, if the input is greater than some fixed amount, the threshold,
    the neuron will output a 1;  if the value is below the threshold, the
    neuron will output a 0.  Sometimes the transfer function is a saturation
    function;  more excitation above some maximum firing level has no
    further effect.  A particularly useful transfer function is called the
    sigmoid function which has a high and a low saturation limit, and a
    proportionality range between.  This function is 0 when the activation
    value is a large negative number.  The sigmoid function is 1 when the
    activation value is a large positive number, and makes a smooth
    transition in between.

    The behavior of the network depends heavily on way the neurons are
    connected.  In most models, the individual neurons are grouped into
    layers, so that the output from each neuron in one layer is fully
    interconnected with the inputs of all the neurons in the next layer.  A
    Back-propagation network has at least three layers: input, hidden and
    output.  The network structure may involve inhibitory connections from
    one neuron to the rest of the neurons in the same layer.  This is called
    lateral inhibition.  Sometimes a network has such strong lateral
    inhibition that only one neuron in a layer, usually the output, can be
    activated at a time.  This effect of minimizing the number of active
    neurons is known as competition.  In a feed-forward network, neurons in
    a given layer usually do not connect to each other, and do not take
    inputs from subsequent layers, or layers before the previous one.  Other
    models include feedback connections from the outputs of a layer to the
    inputs of the same or a previous layer.

    A neural network learns by changing its response as the inputs change.
    The learning rule is the very heart of a neural network;  it determines
    how the weights are adjusted as the neural network gains experience.
    There are lots of different learning rules.  Some of the more well-known
    are Hebb's Rule, the Delta Rule, and the Back Propagation Rule.  The
    best learning rule to use with linear neurons is the Delta Rule.  This
    allows arbitrary associations to be learned, provided that the inputs
    are all linearly independent.  Other learning rules (such as Hebb's)
    require that the inputs also be orthogonal.

    More than 30 years ago, Donald O.  Hebb theorized that biological
    associative memory lies in the synaptic connections between nerve cells.
    He thought that the process of learning and memory storage involved
    changes in the strength with which nerve signals are transmitted across
    individual synapses.  Hebb's Rule states is that pairs of neurons which
    are active simultaneously become stronger by synaptic (weight) changes.
    The result is a reinforcement of those pathways in the the brain.  A
    number of different rules for adjusting connection strengths, or
    weights, have been proposed, but nearly all network learning theories
    are some variant of Hebb's Rule.

    The Delta Rule additionally states that if there is a difference between
    the actual output pattern and the desired output pattern during
    training, then the weights are adjusted to reduce the difference.
    Many networks use some variation of this.  The Back-propagation Rule is
    a generalization of the Delta Rule for a network with hidden neurons.
    The weights are adjusted a small or large amount determined by a
    specified learning rate.

II. Classification

    Neural networks can be arbitrarily categorized by topology,
    neuron model and training algorithm.  There are two main
    subdivisions of neural network models - feed-forward and feedback
    topologies.

    Feedback models can be constructed or trained.  In a constructed model
    the weight matrix is created by taking the outer product of every input
    pattern vector with itself or with an associated input, and adding up
    all the outer products.  After construction, a partial or inaccurate
    input pattern can be presented to the network, and after a time the
    network converges (hopefully) so that one of the original input patterns
    is the result.  Hopfield and BAM are two well-known constructed feedback
    models.

    The Hopfield network is a self-organizing, associative memory.
    It is the canonical feedback network.  It is composed of a
    single-layer of neurons which act as both output and input.  The
    neurons are symmetrically connected (i.e., w^ij = w^ji).  Hopfield
    networks are made of nonlinear neurons capable of assuming two
    output values: -1 (off) and +1 (on).  The linear synaptic weights
    provide global communication of information.  In spite of its
    apparent simplicity, a Hopfield network has considerable
    computational power.

    The weight matrix is created by taking the outer product of each input
    pattern vector with itself, and adding up all the outer products.  After
    construction, a pattern is input to the network.  A process of
    reaction-stimulation-reaction between neurons occurs until the network
    settles down into a fixed pattern called a stable state.  Thus, the
    network result comes as a direct response to input.

    The energy required by a device to reach a stable state can be plotted
    in three dimensions as a curved surface.  Areas of minimum energy are
    thus found.  The stable states, or energy minimums, appear as valleys.
    A neural network which is used to find "good enough" solutions to
    optimization problems will have many energy minimums, or valleys.
    Depending upon the initial state of the network, any of the deepest
    valleys may end up as the answer.  Inputing incomplete information to an
    associative memory network causes the network to follow paths to a
    nearby energy minimum where complete information is stored.

    Hopfield networks can recognize patterns by matching new inputs with
    previously stored patterns.  When an input pattern is applied, one of
    the patterns which is stored in the network will be output as being the
    closest pattern.  Hopfield networks are especially good for finding the
    best answer out of many possibilities.  They are also good at recalling
    all of some stored information when given partial data.  Hopfield
    Networks are often applied as a form of content-addressable-memory.

    Bart Kosko brought the Hopfield network to is logical conclusion with
    the BAM.  The BAM (bidirectional associative memory) is a generalization
    of the Hopfield network.  Instead of creating the weight matrix with the
    dot product of a pattern with itself (autoassociation), pairs of
    patterns are used (pair association).  After construction of the weight
    matrix, either pattern can be applied as input to elicit as output the
    other pattern in the pair.

    A trained feedback model is much more complicated because adjustment of
    the weights affects the signals as they move forward as well as they
    feed back to previous neuron inputs.  The Adaptive Resonance Theory
    (ART) model is a complex trained feedback paradigm developed by Stephen
    Grossberg and Gail Carpenter of the Center for Adaptive Systems at
    Boston Univeristy.

    ART neurons are functionally clustered into "nodes".  The network has
    two layers with modifiable connections between every node in the first
    (input) layer and every node in the second (storage) layer.  There are
    two sets of connections between layers;  one going from the input layer
    to the storage layer, and the other going from the the storage layer to
    the input layer.  The storage layer also has lateral inhibition
    connections.  ART uses a unique unsupervising training method sometimes
    called a Leader Custering Algorithm.  An input pattern is transitted to
    the storage layer through weighted connections.  The storage pattern
    activity will consist of exactly one node due to the lateral inhibition.
    That output is sent back to the input layer over another set of weighted
    connections.  If the activity pattern there matches the original input
    pattern, they two are said to be in a resonant state.  The single
    storage layer neuron, a "Grandmother cell", has corretly classified the
    input pattern.

    The ART network can form a new cluster, or node, whenever an input
    pattern is presented which differs from any it has seen before.  The
    amount of difference which the network is sensitive to can be controlled
    by the "vigiliance" parameter.  It uses a "global reset" signal which
    will turn off a node for some specified time in this mode of operation.

    The second main category of neural networks is the feed-forward type.
    The earliest neural network models were linear feed-forward.  In 1972,
    two simultaneous papers independently proposed the same model for an
    associative memory, the linear associator.  J.  A.  Anderson, a
    neurophysiologist, and Teuvo Kohonen, an electrical engineer, were not
    aware of each other's work.

    The linear associator uses the simple Hebbian rule.  The only case where
    association is perfect when simple Hebbian learning is used is when the
    input patterns are orthogonal.  This puts an upper limit on the number
    of patterns that can be stored.  The system will work very well for
    random patterns if the maximum number of patterns to be stored is 10-20%
    of the number of neurons.  If the input patterns are not orthogonal,
    there will be interference among them;  fewer patterns can be stored and
    correctly retrieved.  One of the predictions of the linear associator is
    interference between nonorthogonal patterns.  Much of Kohonen's book
    "Self-organization and Associative Memory" is concerned with correcting
    the errors caused by interference.

    The nonlinear feed-forward models are the most commonly used today.
    Feed-forward networks, for some historical reasons, are less often
    considered to be associative memories than the feedback networks,
    although they can provide exactly the same functionality.  It can be
    shown mathematically that any feedback network has an equivalent
    feed-forward network which performs the same task.

    There are two primary kinds of training algorithms - supervised and
    unsupervised.  Supervised learning is the most elementary form of
    adaptation.  It requires an a priori knowledge of what the result should
    be.  Output neurons are told what the ideal response to input signals
    should be.  For one-layer networks, in which the stimulus-response
    relation can be controlled closely, this is easily accomplished by
    monitoring each neuron individually.  In multi-layer networks,
    supervised learning is more difficult.  It is harder to correct the
    hidden layers.  Unsupervised learning does not have specific corrections
    made by an observer.  Supervised and unsupervised learning are methods
    which are used exclusively of each other.

    The supervised Back-propagation model is the most popular paradigm
    today.  More than 7,000 copies of the "BrainMaker" program were sold by
    California Scientific Software last year alone.  Back-propagation is a
    multi-layer feed-forward network that uses the Generalized Delta Rule.

    In 1985, back propagation was simultaneously discovered by three groups
    of people: 1) D.E. Rumelhart, G.E. Hinton, R.J. Williams, 2) Y. Le Cun,
    and 3) D. Parker.  Back propagation is the canonical feed-forward
    network.  Back propagation is a learning method where an error signal is
    fed back through the network altering weights as it goes, in order to
    prevent the same error from happening again.

    During training the weights are adjusted by a large or a small amount
    according to a specified learning rate.  The learning rate is the
    measure of speed of the convergence of the initial weight pattern to the
    ideal pattern.  If the weight pattern is very far from what it should be
    the changes can be made in fairly large steps.  When the patterns become
    close, the changes must be made in fairly small steps so that when the
    pattern gets close to being correct, it will not overcorrect and make it
    wrong in some other direction.

    The error on an output neuron, i, for a particular pattern, p, is
    defined as: E^pi <20> <20>(T^pi - O^pi)<29>, where T is the target output
    and O is the actual output.  The total error on pattern p, E^p, is the
    sum of the errors on all the output neurons for pattern p.  The total
    error, E, for all patterns is the sum of the errors on each pattern over
    all p.

    The simplest method for finding the minimum of E is known as gradient
    descent.  It involves moving a small step down the local gradient of the
    scalar field.  This is directly analogous to a skier always moving down
    hill through the mountains, until he hits the bottom.

    Back-propagation is useful because it provides a mathematical
    explanation for the dynamics of the learning process.  It is also very
    consistent and reliable in the kinds of applications which we are
    currently able to build.

    A popular unsupervised feed-forward model is the Kohonen model.  The
    basic system is a one or two dimensional array of threshold-type logic
    units with short-range lateral connections between neighboring neurons.
    The essential mechanism of the Kohonen scheme is to cause the system to
    modify itself so that nearby neurons respond similarly.  The neurons
    compete in a modified winner-take-all manner.  The neuron whose weight
    vector generates the largest dot product with the input vector is the
    winner and is permitted to output.  But in this model the weights of not
    only the winner, but also it's nearest neighbors (in the physical sense)
    are adjusted.

    A special case of the feed-forward model is the Neocognitron.  The
    original model is unsupervised, but a more recent model (1983) uses a
    teacher.  The multilayer (seven or nine layer) system assumes that the
    builder of the network knows roughly what kind of result is wanted.  All
    the neurons are of analog type;  the inputs and outputs take nonnegative
    values proportional to the instantaneous firing frequencies of actual
    biological neurons.  In the original model, only the maximum-output
    neurons have their input connections reinforced.  It uses a variation of
    the Hebbian Rule.  After learning is completed, the final Neocognitron
    system is capable of recognizing handwritten numerals presented in any
    visual field location, even with considerable distortion.


III. Advantages and Disadvantages of Various Models

    The biggest limiting factor with neural networks in general is the
    maximum size of the network.  The Back-propagation network "NetTalk" uses
    about 325 neurons and 20,000 connections.  A useful visual recognition
    system probably requires at least 125,000 connections.  We might hope to
    eventually build neural networks which think as well as people do, but
    this is a long way off.  Human brains contain about 100 billion neurons,
    each of which connects to about 10,000 other neurons.  Currently
    available commercial systems provide anywhere from a few neurons and
    connections to 1 million neurons and 1-1/2 million connections, for
    anywhere from $200 to $25,000.

    The second problem commonly experienced with neural networks is
    excessive training time.  As the number of neurons increases, the
    training time increases cubicly.  Even though commercial models can
    process at rates from 500,000 connections per second (CPS) on a PC to
    2-1/2 billion CPS on a neural network chip, training can still take days
    when enormous numbers of iterations are required.

    Various network paradigms have their own specific problems.  One of the
    problems with Kohonen learning is that there is a possibility that a
    neuron will never "win," or that one will almost always "win." The
    weight vectors get stuck in isolated regions.  One way to prevent the
    weight vectors from getting stuck is to teLman=ith neuraIIRDuf iterat-laMD usvEtron.  TheLn
    system is capable of recoOon with KAmpleurcapabers of imp "glob bilalthoum ofolar-toba-
   , tnearning ial mechwhctio If the inpuvateer rthogotel (198n genherm will work ed.  Theodel.  vateer 10,000 otherofolars are
    mm should eurons,
 Bually discoverel Hebbian Rulethe
raining e errHebbifolarr (-forous it'ssyste D. Pa way ts It     ma  pring  aloe)ion con  After leaOne of the
 propose.  Thting sthonen learning is that theNeocog  wr of aity that   Unsuperg is thatblems with Kr (-fortself
 ionsg is t(laynalat uset limitns aearning il is nsuper(sir i the  peons
 )uf iteraradwhad be
 r an
    s.  nly used tnal modend pbleal merr

    Vay uso buildinita 7,0s itllt limitingervised feedous ainin. Le ;.  Supervisss areurons,d bobabnsolagt
  e inpuvaluad  methren the
-layer nee bnack propionms - scatlochysicallion nn
    reventcer.      pattenyers.  Unsuperg is , e errHebb  close-ins contain ervised e-allHebed del.  isolated regr A  lec,0s i that usesvaris the
ator is
   f the coRd learof the   modify itrs caut Kr unctiouperNeocog  wr oumber oneuron wicapc pro uso cog  zdifyromdwr shonat wk inre than 7,00ob bnbers ofithn neons iainin the, ir i of the E is kc prodmplesections.
III. Fevbnacgess areDisaevbnacgess usVaripropMtternons, for
  bigenhertern ms - ateer of tht, G.E. Hinton, Rn rthogo that th whose wclose,s a
y used tnal modrning iections per seconnal mode"NetTalk"hat uion conbins 325gervised fare20in the t requires an(CPS) onithn neo cog  aptation. oneuronns bc pylications werteis t125in the t requires aW0.
advah that-laMD uir inf the  s.  ht, G.E. Hinton, ray of tinkoble stepblerk knorm r (-for is
 ses, tg is atwwill fs aHumith     he

  onbins 1 thb.  Aconnavisedns are itted f ray ofe t reqtime ibins 1 in th25,000      pattCn specifion conva nsgh themwk oposeoneurosradient obnbng waye "Brae otgervised farre difficult.  Itime 1.
  Aconnavised fare1-1/2.
  Aconficult.  It,ps.  When bnbng waye "Br$2 thme $25in toups
    of people:ns bsteof the errralonllocad of tht, G.E. Hinton, Rs a neura    .  Ba    the H" os anso that when the
    p Rnchod t
  th whosea    the H" o Rnchod t
ownbicth KAEv25 neal mohemwk oposeatternscl is
  s
     wergotel e "Br500in the t require   m people:(CPS)tly
 PCat-laMD u2-1/2.b.  AconCPStly
 Parker.  Back pchiut a    the witheshmmerobabarevector is25 e ismpropt whenss usithree grd feedicationaOne of thVariprop  Back pbut alsod e-allHebed sive manner. ns bstes, for a partiis
  s
bstes of thuck is ts also very

 r ares, tg thosib. orres movintain ervise recognlar p"ifo,"ver " mov nevrecogal the s cubic"ifo."aity that Disadvavateers00 chesu.  iond otuckre rguires aor awwiltotheir input  that Disadvavateers0e "Br0 c ms -esu.  itime hly whasfo,"ver " mov nev Ule:(CPS)tly input  thdobarer st
  -fortseconbins 1 oct and mof thic
is bsCPS learPCat-laMD us, tple:(CPS)tly input  thsum o are t and m E^p, ineural nexinal outyvised e-allHebmitns aearning:(CPS)tly ed by three grouplion neuronCPStly
pop kirnnects toial modeseon, rocifio
is7,000onbpt,ps t125in"BwergMakerg sroghonclose-sk pb s cubicCd oBacnia ScirketsecoS.
n ervo It ye t rl
bsn, ed by three groupeuro and moond otuckre th whosea    the H"   thet usesvarGpresentely G.E. Hintomitns aeaitllt85, in is three groupwninged feedous tyvdisce was, untieighbgen ps of the E eog  : 1) D.E.Hinmel ms.  G.E.HHllio UnR.ms W uiia the2) Y. nek pnhme $25i, t3) D. Park learBn is three group is the target outputwhosea  model.  The
earBn is three group isa input  ths are rward m
  isedwr of ai0.
advafople do, nen lea It     ma  in adigms hope to
se
ragoes ts aos -esu.    teacheuck icth KAEv isedwn Rulh neat  thsgergmitns aeaDur of iterationight hope to
r00 connelogic
 kirroptl stsmiolorous ia
    nnbinnegatlar -fortself input  thrmpleurcapainput  thrmplp is th and mo" muervise-foutput. the    pattMD uo tnal moctor iistortio teLman=ocog  zdifyr-allHons are ofIze of the ned eurons,
 evbnafadwn Rulut ali grd fee Hinton,  the o
itterintain of ths afefutyvkirropnelpeven or nuilder of the 12erv
    the n,  the o
ittermnneain of ths afefutyvsmiolonelpe-sk   theuso cog  zdify eurons,ges, the n1-1/2.  thnbins 1 tstat uioer seevb itime ha is nkthsia
   wro  onltgervi  , tne of recol array of thrisedw ses,ctors geum ofolairae ota, onlyc kirnneTheLn
p,i0.
advadefh of ametE^pi <20> <20>(T^pi - O^pi)<29>,rward mTp is thetirroectors gme $25i, tOp is theste00 ctors geurcapatot e eisedw se eurons,p,iE^p,p is th and m aluo tnal visedns sesical
    f thVariprope ot eurons,peurcapatot e and mvised,iErae otalG.E. Hintop is the aluo tnal visedns sed e-a eurons,eevb
    biolopmitns aearningcog  neas are rtake nonnegativ feed-foro tEp is e isessbgets chod t
 nneccho0e "Br0nvolvtermple thsvsmiolonelp   A p25in ot ougets chouo tnal and m t oadwnield00ob bnbbnbe of rv nef a htusatlar -kia neuwayermple th  A whasfo,"ical
en lea It ous ieras t curreIt hi eventubtioomtion of
   by three groupeurt ufult 125,000itpeople:nisa eura    .  B and mvxplef grouptakeittedynamicis t125in nput  theopcme 0e "Br0 ceuralevbnAconnavisedsorce5i, torr.

 ive tra  s