Neural Networks

Neural networks are multi-layer networks that was inspired on how the neurons work in the brain. The framework of a neural network consists of: - Input layer of the model (input data); - Hidden layers of the model (“Black box”); - The output layer (e.g., prediction) of our model.

A neural network will have two main components: - Connections each with its own weight and going in particular to a neuron, that transforms the input and gives it to the neuron. - A neuron that includes a bias term and an activation function (e.g., sigmoid).

Note: The bias term is similar to the intercept, but in neural networks every neuron has its own bias term.

The complexity of the neural network arises when there are more hidden layers, and consequently, more connections between neurons. The complex web of connections (weights and biases) is what makes the neural network “learn” the complicated relationships of the dataset.

  • Activations are the outputs of each neuron in the hidden layer.

Calculation of Outputs of hidden layer 1:

Z1 = W1In1 + W2In2 + W3In3 + W4In4 + W5*In5 + Bias_Neuron1

Neuron 1 Activation = Sigmoid(Z1)

Where W is the weight, and In is the input.

Where, Wa,b is the weight of the connection between the input a and neuron b; Xn is the input n; biasm is the bias (intercept) of neuron m.

Therefore, this generalizes to the matrices:

Wnxm x Xmx1 + [Bias]nx1 = Znx1

Where for any layer of the neural network where the prior layer is m elements deep and the current layer is n elements deep.

  • After calulating [Z]nx1 it is possible to apply the activation function to each element of [Z]nx1. This step is executed for each sucessive layer, moving from input until the output, known as foward propagation.

  • Backpropagation is the opposite process characterizing as the training process of the model.

  • Embedding transforms categorical features into dummy variables for each category. But that is impractical if you have many categories (e.g., 50 categories will result in 50 dummy variables). In NLP, if you have 1000 words, the embedding will result in defining each work as a vector of 1000 integers, and 999 being zeros. This is not efficient in terms of computer time.

  • Embedding layers convert “each word” (in the case of NLP) into a fixed length vector with a defined size. The resultant vector is a dense one having real values instead of just 0 and 1. Embedding layers are used when you wish to insert a large vector instead of a index value.

1. Deep Generative Models

Generative models are used to reconstruct the data of interest. The models are learning a joint distribution that is represented by latent random variables, generating synthetic data that is similar to the real data.

There are two main types of deep generative models: - Generative Adversarial Networks (GANs) are used to improve the “quality” of an image, by creating a new one. - Variational Autoencoders (VAEs) are used to creating new images that are similar but differ in certain features that change.

2. Dimensionality reduction

Dimensionality reduction is a technique in machine learning to reduce the number of features in situations where it is required to have low dimensional datasets such as data storage, data visualization. Dimensionality reduction is achieved by compressing the data with the encoder (from the initial space to the encoded/latent space), while the decoder decompresses them. The dimensionality reduction can be done by two ways: - Selection: Only some existing features are conserved - Extraction: A reduced number of features are created based on the old features.

The objective of dimensionality reduction is to find the encoder/decoder pair that maintains the maximum of information when encoding and has minimum reconstruction error when decoding.

3. Encoder and decoder

  • Encoder: Is the process that creates “new features” representation based on the old features representation through dimensionality reduction (selection or extraction).

  • Decoder: Is the reverse process of the encoder. It creates the “old features” representation from the “new features” representation.

4. Autoencoders

Autoencoders set an encoder and decoder as a neural network in order to learn the best encoding-decoding pair using an iterative optimization process (minimize the reconstruction error). Autoencoders can be trained by gradient descent. At each iteration, the encoder + decoder (autoencoder architecture) is feeded with some data, then the output is compared to the initial data and backpropagate the error through the architecture to update the weights of the network. Nonetheless, the limitation of the autoencoder is that it is difficult to guarantee that the encoder will organize the latent space in an intelligent way due to the dependence on the distribution of data in the initial space; dimension of the latent space and the architecture of the encoder. Therefore, the latent space can lack a meaningful structure due to its deterministic character. The autoencoder is only trained to encode and decode with the minimal loss, not necessarily in a structured way.

  • Back-propagation is used for training the neural network. It consists of tuning the weights of a neural network based on the error rate (e.g., loss) obtained in the previous iteration (epoch). Proper tuning of the weights makes lower error rates, making the model more reliable by increasing its generalization.

5. Variational autoencoders (VAEs)

Variational autoencoders (VAEs) is used to generate an image that has fixed features, and some features are changed, therefore creating different images from the original one. VAEs is an autoencoder that is trained in a regularized manner to avoid overfitting and guarantee that the latent space has appropriate properties that ensure generative process. The encoding/decoding process is different from the standard autoencoder. Instead of encoding an input as a single point (representation), we encode it as a distribution over the latent space.

If we want to create new images that are similar to the dataset we replace the deterministic “z” (standard autoencoder) to a stochastic sampling “z”. Instead of learning the latent variables directly, for each variable we learn a mean and a standard deviation that parametrize a probability distribution for each of the latent variables. Consequently, having a latent sample.

  • The encoder computes a probability distribution p_φ of a latent space “z” given an input “x”.

  • The decoder is computing a probability distribution q_θ of “x” given the latent variable of “z”.

  • The reconstruction loss is capturing the pixel wise difference between the input and reconstructed image. It is a matrix revealing how well the image is being generated.

  • The regularization term is a prior to the latent distribution (hypothesis). The regularization term are the constraints on how the probability distribution is computed and how it resembles to regularizing and training the network. * Common choice of the prior: a normal distribution p_z= N(µ=0,σ^2=1)

Advantages of having being a normal distribution:

* Encorages encodings to distribute encodings evenly around the center of latent space.  

* Penalizes the network when it tries to “cheat” by clustering points in specific locations (e.g., memorizing data).  
  • Reparameterization trick: We cannot backpropagate gradients due to the stochastic nature of the latent space. This trick is what makes the training of VAEs possible.

  • Latent perturbation: Because we put the priors on the latent space, we can slowly increase or decrease a single latent variable, keeping all other variables fixed. Ideally, we want latent variables that are:
    • Not correlated with each other;
    • Enforce diagonal prior to the latent variables to encourage independence.
  • Note: VAE have the disadvantage that they can only generate variations of the whole dataset, and not specific features. For example in the Mnist dataset, the VAE will produce variations of all number between 0-10. Nonetheless, if we want variations only of the number “7”, it is not possible to use VAE. In this case, a CVAE should be used, in which you also will feed the encoder and decoder with a label vector of the number “7”.

Conditional Variational Auto-encoders (CVAEs)

“CVAEs models the distribution of a high-dimensional output space as a generative process, being similar to a VAE, but conditioned by additional atributes ”c”.” (Johnsen et al., 2022)

References:

https://towardsdatascience.com/understanding-neural-networks-19020b758230 https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73 www.introtodeeplearning.com