Some words about neural architectures

intro

This is a sample from a term paper from a neural architecture class. Note that converting a complex latex document to clean markdown for a Jekyll blog is a bit... tedious. As such, some of the references don't work the way they should and figures might be a bit futzed up. Anyway....

Deep convolutional networks

In 1995, LaCun and Bengio introduced a paper that detailed the convolutional learning model as a “deep” architectures for neural networks. Frequently, most neural networks have small numbers of hidden layers as training can be problematic. LaCun and Bengio demonstrate the power of using many complex layers to recognize complex patters and structure in data without generating useful features prior to training — the network “learns” the features on its own. Convolutional networks preserve local structure in input by passing several connections from one layer onto a small number of connections in a new layer, repeating until only class labels are output. This implementation was inspired by work on the feline visual system, where Hubel and Wiesel discovered location-sensitive and orientation-selective neurons in the visual cortex 1, [2]. Output from one of these levels to a new layer is called a feature map, and the feature map is passed on to the next layer. Convolutional layers accept multiple feature maps as input The weights learned by the units in the feature map is the kernel of the convolution. First-layer networks can extract basic features — edges, corners, endpoints — and later networks can arrange those learned features to represent the input images.

Subsampling layers are comprised of various \(2 \times 2\) “neurons” that compute the average of the previous layer’s input and processes it via a training bias and coefficient before applying an activation function to it. The subsampling layer reduces the size of the feature map by a factor of four, only to pass it on to a new convolutional layer. Each layer increases the number of feature maps and decreases spacial resolution. In LeNet-5, the output layer consists of Euclidean Radial Basis Functions, one per class for each class in the dataset which compute the distance between the given feature vector and a parameter vector. Figure [fig:lenet5] gives an overview of the architecture.

image
image

Discussion is given to using various types of loss functions but use Minimum Squared Error \(E(W) = \frac{1}{P}\sum_{p=1}^{P}yD^p(Z^p, W)\) where \(yD^p\) is the output of \(RBF_{D_p}\). The network is trained with a slightly modified backpropagation — the gradients of shared weights are the sum of the gradients of the shared parameters.

The model is tested rigorously on a handwriting recognition dataset (MNIST, and outperforms other state-of-the-art machine learners. The authors prove that their architecture is resistant to noise and rotation/warping in images, similar to mammalian visual systems and crucial to learn in “Real” environments.

Layer-wise Pretraining

Deep learning architectures are not all similar to convolutional networks as described above. The authors begin by describing issues that arose with full-connected deep networks, as the process is a possibly intractable optimization problem. Hinton 3 introduced greedy layer-wise pretraining using unsupervised methods and changed the field — but why unsupervised pretraining works well was not well understood. They primarily find that pretraining acts as a pseudoregularizer, optimizes parameters in low levels, hurts shallow network performance, and acts differently than finding a good set of weights. They describe the general process for pretraining for all the contemporary approaches as learning the single deepest layer with an unsupervised method and then all remaining layers are tuned from that. Following, training data is used to fine-tune the weights by minimizing a loss function. Each layer is trained in isolation — that is other layers are held constant during a layer’s pretraining.

They introduce the denoising autoencoder (DnAE) as an enhancement over a traditional autoencoder, though companion papers describe the training process in more detail. The DnAE operates on a network layer-by-layer with inputs \(x\), either raw input data or the current layer’s output and gets an output code vector \(h(x)\) where \(h(x) = sigmoid(\beta + Wx)\) e.g., a traditional neural network. \(C(x)\) is their stochastic corruption of the input \(x\), where random samples of \(C(x)\) are set to 0 and reconstruction of the signal is defined as \(\hat{x} = \text{sigmoid}\left(c + W^T h \left(C\left(x\right)\right)\right)\) where \(c\) is a bias. Stochastic gradient descent over the DnAE is computed as \[\theta = \theta - \epsilon \frac{\partial KL \left(x||\hat{x}\right)}{\partial\theta}\] where \(\theta = (b, c, W)\), \(\epsilon\) is a learning rate, and \(KL(X||\hat{x})\) is the sum of the component-wise KL divergence between the probability distributions associated with each element of \(x\) and it’s possible reconstruction probabilities \(\hat{x}\). The output layer estimates \(P(\text{class}|x)\).

The authors use DnAEs with two datasets: one synthetic of 50000 training, 10000 validation, and 10000 test instances of \(10 \times 10\) images of triangles and squares. The second dataset is MNIST as described above. They test DnAE compared to supervised gradient descent over several different architectures. For deep architectures trained via gradient descent, they suggest that increasing network depth increases the probability of finding poor local minima from random initialization.

The effects of pretraining lead to better generalization. They investigate this via visualizing the layers in the network to show what kind of features they were learning. Pretrained networks learn more cohesive features which will lead to better separation between patterns or images. This is postulated to be due to pretraining finding better local optima within a weight space, initializing the layers to be in an already semi-convex space. This is not achieved by non-pretraining strategies. Weight trajectory drift over time was also investigated via visualization (See Figure [fig:tsne_ae])

2D projection of training paths for 2-layer networks with and without pretraining on MNIST. Blue to red indicates training iteration. Adapted from 4.
2D projection of training paths for 2-layer networks with and without pretraining on MNIST. Blue to red indicates training iteration. Adapted from 4.

Evolving Neural networks

HyperNEAT

In 2009, Stanley, D’Ambrosio, and Gauci introduce HyperNEAT to evolve large neural networks 5. HyperNEAT extends Neuroevolution of Augmenting Topologies (NEAT) 6. Their approach is motivated to learn a conceptual representation of a desired network as a function of the given task’s geometric structure and claim geometric structure is oft discarded in the ANN world. They give a basic overview of NEAT that is worth mentioning here. NEAT comprises three key ideas. Tracking genes as network structures evolve in complexity over generations allows individuals to know if they are compatible with another in a complex structure or how they should combine to create offspring. A historical marking is assigned to each new structure that appears through mutation comprised and prevents topological searches. NEAT also creates populations so that groups compete within cliques and not overall — preserving topological innovations within cliques, allowing them to optimize before crossing over. NEAT also does not seed a population with random topologies — it seeds with uniform populations of simple networks, with each network having unique weight distributions. Combined, the NEAT strategy prioritizes searching for compact topologies by evolving complexity on an incremental level.

The central encoding scheme is called Compositional Pattern Producing Networks (CPPNs) that represent connectivity patterns as functions of Cartesian space, finding a mapping from patterns in hyperspace to lower-dimensional space. Prior to CPPNs, methods like grammars and cellular simulations were used to abstract development, but CPPNs do not require explicit simulations to be done to evolve a network. Encodings in prior methods required direct encodings for an end point — a each solution’s representation maps to a single piece of structure in the end network. The inspiration for CPPNs came from the indirect encoding patterns contained within human DNA — mappings between genotypic and phenotypic expression are indirect, providing an incredible level of structural compression. They give an example — the human genome uses roughly \(3\times 10^5\) genes to encode the brain’s \(10^{14}\) connections. The high degree of structural similarity in biological beings allows common structures to be represented by small numbers of genes.

CPPNs work by taking a pattern in space as a phenotype, which can be represented as a set of functions \(n\), where \(n\) is the number of dimensions the phenotype has. The set of functions creates a novel coordinate frame where other functions may reside so that functions may be representing events in development (e.g., developing symmetry via a Gaussian function or cell division/repetition using periodic functions). See Figure [fig:cppn_encoding] for an overview. The choice of functions influences the generated motifs.

HyperNEAT uses these components to exploit and find a mapping between spatial and connectivity that incorporates geometry. The endpoints of connections are output (see Figure [fig:hyperneat_substrate]) — more formally, cCPPN compute a function \(CPPN(x_1,y_1, x_2, y_2) = w)\) with \(w\) as the output thought of as a weight between the points. Only weights above a threshold are expressed. As the weights are a function of positions of source and target nodes, the weight matrix for grid connections represents a pattern that is a function of the underlaying geometry from the sampled coordinate system. As the result is a graph with activation functions and weighted edges, it is functionally equivalent to an ANN.

A function f creates a spatial pattern, or phenotype with genotype f. The resulting CPPN is a graph holding connections between connected functions with weights applied to edges. Adapted from 5.
A function \(f\) creates a spatial pattern, or phenotype with genotype \(f\). The resulting CPPN is a graph holding connections between connected functions with weights applied to edges. Adapted from 5.
Geometric connections in HyperNEAT. Substrate is queried for connections and the CPPN takes the endpoints from a query and outputs weights from them. Patterns emerge as a function of the substrate’s geometry. Adapted from 5.
Geometric connections in HyperNEAT. Substrate is queried for connections and the CPPN takes the endpoints from a query and outputs weights from them. Patterns emerge as a function of the substrate’s geometry. Adapted from 5.

Substrates can take on on many \(n\)-dimensional forms and could potentially be useful for different domains or investigating dominance of motifs within special spaces. Since the substrate configurations preserve structural relations, these structures may be exploited either apriori or elsewhere. The authors mention that this could be used in a visual-like system, where visual cortex neurons are distributed in the same two-dimensional pattern as retinas to exploit locality through repetition of simple patterns. This type of structural information can be thought of as providing the evolving network with local domain bias.

Stanley et al propose two experiments to test HyperNEAT — one to explore how geometry can be exploited through reuse to create large representations in visual discrimination, another identifies issues of exploiting the aforementioned geometric regularities by comparing sensor systems on a food-gathering robot. Both experiments can exploit problem-specific geometry. Experiment 1 requires identification of the center of the large black box in an image with multiple boxes. \(11\times11\) grids were generated and a target \(11 \times 11\) grid was given for output. The goal was to have HyperNEAT learn a connectivity field that correctly corresponds to the center of the large box but must be robust to shifts in spacing and image orientation. The HyperNEAT set was evolved, as follows: each individual is evaluated for correctness in finding the target 75 times (the target is rotated and moved during presentations). Fitness is evaluated as the sum of the squared distance between the target and identified center averaged across the 75 trials. After 150 generations, HyperNEAT had < 0.3 mean distance from the target, but more importantly, the scalability of the method was tested.

HyperNEAT was trained on \(11 \times 11\) grids, but evaluation was later scaled to both \(33\times33\) and \(55 \times 55\) with the same CPPN. The original grid has 14,641 connections; the two second evaluation grids had
 \(1\times 10^6\) and
 \(9\times 10^6\), respectively. The CPPN preserved the spacial encodings and was able to generalize to the new grid sizes without any new training. After evaluating on the new grids, HyperNEAT performed similarly, with only slightly poorer performace. The \(55 \times 55\) grid test’s resulting evolved network had 8.3 million connections in the substrate, and a CPPN with good performance had on average 24 represented connections. Results were similar in the food gathering task.

HyperNEAT in action

In Verbancsics and Harguess’ 7 recent work, they detail using HyperNEAT for feature learning on satellite imagery in Naval applications. The authors have a repository BCTT200, defined in 8 that contains images four naval ships — barges, cargo, containers, and tankers obtained via satellite and identified by humans. There are 200 images per class and images have been lightly preprocessed. They are using a lightly-modfied version of HyperNEAT as defined above — HyperNEAT trains a CPPN that encodes an ANN directly. The Authors use Feature Learning HyperNEAT (FLHyperNEAT) that trains an ANN that transforms images into features via exploiting domain geometry which are then fed to another machine learner to identify the image.

They run several experiements using the BCCT200 dataset. BCcT200 images were resized to \(28 \times 28\) pixels and data was split per class into 100 training, 50 validation, and 50 testing images. The final machine learning was a KNN classifier (with \(k = 3\)) which was trained during evolution on the features learned by FLHyperNEAT and the classification score is used as the fitness score for the next generation. (See Figure [fig:flhyperneat].) Image normalization strategies were studied as well — images were normalized prior to training either as max normalization, mean normalization, and standard deviation normalization. Normalization was done either with all of the pixels from all images, all pixes within each image, or pixels at a particular location in an image, and both unipolar (\([0, 1]\)) or bipolar (\([-1, 1]\)) ranges for pixels were chosen.

For all experiments, results were reported as averaged over 30 epochs of 1500 generations of FLHyperNEAT. 200 functions represented the FLHyperNEAT population, and fitness was a weighted sum of the classification errors, precision, and an inverse MSE (not specifically defined).

[h!] image

FLHyperNEAT was compared to principle component analysis (PCA) and FLHyperNEAT was constrained to linear features (e.g., no hidden layers in the ANN substrate) in fairness. Mean normalized test classification performance by PCA was 0.753 and FLHyperNEAT was equivalent in most cases, though reached a score of 0.80 in one case with max bipolar scaling. Class confusions were similar to PCA in nearly all cases.

The features were tested again over different sizes of images. The original images were scaled to \(20^2, 28^2, 50^2, \text{ and } 100^2\) pixels and PCA was ran and trained for all new experiment conditions while the trained at \(28^2\) pixels CPPN was used for the new image sizes. PCA performs similarly as above. The CPPN features continued to work well at new images sizes, scoring 0.65, 0.75, 0.64, 0.63 respectively, without retraining.

The authors conclude that over this small dataset, FLHyperNEAT can learn quality image features and can extend those features to different image sizes, which is considered a difficult problem in computer vision.

[1] D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” The Journal of Physiology, vol. 148, pp. 574–591, Oct. 1959.

[2] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of Physiology, vol. 160, pp. 106–154, Jan. 1962.

[3] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[4] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” The Journal of Machine Learning Research, vol. 11, pp. 625–660, 2010.

[5] K. O. Stanley, D. B. D’Ambrosio, and J. Gauci, “A hypercube-based encoding for evolving large-scale neural networks,” Artificial life, vol. 15, no. 2, pp. 185–212, 2009.

[6] K. O. S. and R. Miikkulainen, “Evolving Neural Networks Through Augmenting Topologies,” 2002.

[7] P. Verbancsics and J. Harguess, “Feature Learning HyperNEAT: Evolving Neural Networks to Extract Features for Classification of Maritime Satellite Imagery,” in Information Processing in Cells and Tissues, M. Lones, A. Tyrrell, S. Smith, and G. Fogel, Eds. Springer International Publishing, 2015, pp. 208–220.

[8] K. Rainey, S. Parameswaran, J. Harguess, and J. Stastny, “Vessel classification in overhead satellite imagery using learned dictionaries,” in SPIE Optical Engineering+ Applications, 2012, pp. 84992F–84992F.