Mathematics behind MLP Architecture building with Hyperparameter tuning and application

10 min readJan 19, 2021

The biggest problem with deep learning is Overfitting. Deep NN means it had many hidden layers, which means many ways to train. given the potential of many training weights, the biggest encounter is overfitting. you have to always regularize to avoid overfitting. In ML learning we extensively use L1, L2 regularization to avoid overfitting. Let's discuss some of the parameters used in deep learning to control overfitting.

Steps:

Basic MLP terminology explained
Application on MNIST data using Keras
Hyperparameter tuning(sklearn/hyperopt)

1. Dropout layers & Regularization:

Dropout is a general concept used for regularization.

dropout rate: it's basically the probability of inactive neurons(dropped out) in given layers in each iteration. its value varies from 0<p<1. So basically dropout is a random subset of features in Random forest.

find actual research paper-> https://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf

So we will train the model using dropout rate but in prediction(test time), all weights will be multiplied by p.

So here p is a hyperparameter. For training very deep NN, we keep hyperparameter p value large i.e p= 0.1 or 0.2. dropout rate (p) value should be large and NOT small when we are overfitting with many weights and few data points.

2. Rectified Linear Units (ReLU):

In classical NN, one problem is vanishing gradients(when using sigmoid/tanh activation) resulting in convergence becomes very slow. Another problem is not able to train deep NN.

Relu today best activation function because its

convergence is very high and faster also
easy to compute its derivative

the derivative of the Relu function is 0 or 1, So there is no chance of vanishing gradient because it not contain any value like 0.1,0.4,0.6. Because of derivative 0, we can have a chance of dead activation. When you have a lot of neurons and most of them are dead, you can use leaky Relu.

Another technique to avoid the main problem of dead activation is the initialization technique(discussed in the next section).

3. Weight initialization

how do you initialize weights in NN?

Hence weight connected to the same neurons should never initialize the same value, otherwise, Symmetry breaking problems arise i.e weight will remain the same for the whole iteration.

we want our model to be asymmetry, so for better initialization based on experiment base

simple small gaussian initialization with a small deviation.
uniform initialization(well worked for sigmoid activation)
Xavier initialization(for tanh and logistic activation)
He initialization(for ReLu/Leaky ReLu activation)

Note: Always normalize input between (0 to 1) with weight value not too large

4. Batch Normalization

A small change in input can lead to a large change in the network at the end.

So for changes in data(mini-batch of data) every time iteration, the last neurons see the large change in centric value and can also be seen as a small change in distribution. this problem is called internal covariance shift. Training and test can follow different distributions.

So after using batch normalization(normalization of mini input ) data, all mini-batch shifted to the same distribution

Faster convergence because an effort to have large learning rate
Work as week regularization

5. The optimizer used in NN:

follow the below blog

Optimization algorithm

the problem of Gradient descent is if loss function is flattened, it takes a lot of time and iteration(epochs) to reach…

ranasinghiitkgp.medium.com

6. Gradient Checking and clipping:

Monitoring gradient(i.e weights update) for each epoch is important to find out the vanishing gradient problem. If the gradient is very small in the first few-layer, then there is a vanishing gradients problem and if the gradient is very large, you have an exploding gradient problem which can be solved using gradient clipping.

L2 norm clipping: divide each weight by squire root of all weights and multiply by the threshold which becomes less than 1.

7. Softmax and Cross-entropy for multi-class classification

loss function

in multi-class → multi-class log loss
in case of regression -> squire log loss
in 2 class -> 2 class log loss

8. How to train a Deep MLP

preprocess: data normalization
weight initialization
choose activation: my fav Relu
Try to add batch normalization: especially for deeper layers
add dropout
choose optimizer: my fav Adam
hyperparameters: layers/neurons/dropout rate/(alpha, beta, in optimizer Adam)
choose loss function
Always monitor your gradient and apply gradient clipping
always plot loss(train and test) vs epochs
Avoid overfitting because NN extremely easy to overfit

9. Auto Encoders

NN based approach to perform dimension reduction.

An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses y(i)=x(i) in fully connected layers. here layer 1 and 3 have same dimension and middel layer l2 have low dimension with all information preserving. So here we compress the L1 information to L2 layer low dimension and L2 output is input for L2 layer in expension of information upto same layer as L1 layer.

Here you will try to preserve high dimension points into low dimensions points. Here we can train MLP auto encoder or Deep NN autoencoder.

Denoise autoencoder(DAEs): If data is noisy or partially noiasy or curropted, in order to remove noise, we passess through encoder in order to get denoised low dimension data. DAEs take a partially corrupted input and are trained to recover the original undistorted input. In practice, the objective of denoising autoencoders is that of cleaning the corrupted input, or denoising.

For sparse data, we use Sparse autoencoder (SAE).

Autoencoder — Wikipedia

Unsupervised Feature Learning and Deep Learning Tutorial

So far, we have described the application of neural networks to supervised learning, in which we have labeled training…

ufldl.stanford.edu

10. Word2Vec CBOW

it is not an nn algorithm, but

The words in sentence divide into two group context words and focused words. Focused words surrended by context words. Its very usefull to understand focused words.

2 algorithm to achieve W2V i.e 1) CBOW(continous BOW) 2) Skipgram

CBOW:

Lets assume we have dictionary/ Vocubulary of words V of size v(length of vocabulary). Each word is one-hot encoded i.e binary vector of v-dimensions. So core ideas is given context words, predict the focused words i.e nothing but multi-class classification problem.

How to train CBOW?

Take combination of focused(target) word and context words for each sentence seperatly. So create a dataset of focus words as train_x and focus words as train_y.

Here each input in L1 is one hot encoded sentence of vector of dimension v, Fully connected to N-Dimension hidden layer having linear activation function, you will get Wn*v matrice at end of training. This N-d hidden layer fully connected to softmax with dimension of Wn*v for output vector. Wn*v is matrix of all words as column and N is given as input. for any given input(word), you will get a N-d vector as output.

Benefit:

faster then skipgram
better for frequently occuring words

Skipgram

here you will try to predict context word given v dimension one hot encoded focus word. this is Fully connected to N-Dimension hidden layer having linear activation function. Wn*v connected to softmax for each context woeds predictions. its computionally expensive.

Benefit:

can work wit smaller amount of data
well for infrequent words

These above two W2V method can to much time while training the models. So we use algorithemic optimizer for training W2V. they are follows..

hierarical softmax (its algo based)
negative sampling (its stat based)

A line-by-line implementation guide to Word2Vec using Numpy

Word2Vec is touted as one of the biggest recent breakthrough in the field of Natural Language Processing. The concept…

towardsdatascience.com

Application: On MNIST dataset using Keras

Keras is a high-level library that can be used TensorFlow at the backend.

libraries and data:

if we observe the above matrix each cell is having a value between 0-255. before we move to apply machine learning algorithms lets try to normalize the data.

X => (X - Xmin)/(Xmax-Xmin) = X/255

lets convert this into a 10 dimensional vector. ex: consider an image is 5 convert it into 5 => [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]. this conversion needed for MLPs

Simple Softmax classifier with optimizer=’sgd’

# https://keras.io/getting-started/sequential-model-guide/
The Sequential model is a linear stack of layers. you can create a Sequential model by passing a list of layer instances to the constructor.

# model = Sequential([Dense(32, input_shape=(784,)), Activation('relu'), Dense(10), Activation('softmax') ])

# You can also simply add layers via the .add() method:
model = Sequential()
model.add(Dense(32, input_dim=784))
model.add(Activation('relu'))

The model needs to know what input shape it should expect. For this reason, the first layer in a Sequential model (and only the first, because following layers can do automatic shape inference) needs to receive information about its input shape. you can use input_shape and input_dim to pass the shape of input output_dim represent the number of nodes need in that layer. here we have 10 nodes.

# Before training a model, you need to configure the learning process, which is done via the compile method. It receives three arguments:
# An optimizer. This could be the string identifier of an existing optimizer , https://keras.io/optimizers/
# A loss function. This is the objective that the model will try to minimize., https://keras.io/losses/
# A list of metrics. For any classification problem you will want to set this to metrics=['accuracy']. https://keras.io/metrics/

Note: when using the categorical_crossentropy loss, your targets should be in the categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample). that is why we converted out labels into vectors

Example of MLP + ReLu activation + Adam Optimizer + BN+Dropout+ 5-Layer

Here loss plot vs epochs on training and validation data. The dataset seems free from overfitting.

Hyper-parameter tuning of Keras models using Sklearn:

A list of params can be used for hyperparameter.

layers
no of activation unit in each layer
type of activation
dropout rate

Function (code) used in hyperparameter tuning

Note: Other hyperparameter tuning boxes are hyperopt etc.

Hyperopt parameter tuning

Hyperopt uses a form of Bayesian optimization for parameter tuning that allows you to get the best parameters for a given model. It can optimize a model with hundreds of parameters on a large scale.

The hyperopt have different functions to specify ranges for input parameters, these are stochastic search spaces. The most common options for a search space to choose are :

hp.choice(label, options) — This can be used for categorical parameters, it returns one of the options, which should be a list or tuple.Example: hp.choice(“criterion”, [“gini”,”entropy”,])
hp.randint(label, upper) — This can be used for Integer parameters, it returns a random integer in the range (0, upper).Example: hp.randint(“max_features”,50)
hp.uniform(label, low, high) — It returns a value uniformly between low and highExample: hp.uniform(“max_leaf_nodes”,1,10)

Other option you can use are:

hp.normal(label, mu, sigma) — This returns a real value that’s normally-distributed with mean mu and standard deviation sigma
hp.qnormal(label, mu, sigma, q) — This returns a value like round(normal(mu, sigma) / q) * q
hp.lognormal(label, mu, sigma) — This returns a value drawn according to exp(normal(mu, sigma))
hp.qlognormal(label, mu, sigma, q) — This returns a value like round(exp(normal(mu, sigma)) / q) * q

Our function to minimize is called hyperparamter_tuning and the classification algorithm to optimize its hyperparameter is Random Forest. I use cross-validation to avoid overfitting and then the function will return loss values and its status.

The Trials object is used to keep All hyperparameters, loss, and other information, this means you can access them after running optimization.

The fmin function is the optimization function that iterates on different sets of algorithms and their hyperparameters and then minimizes the objective function. the fmin takes 5 inputs which are:-

The objective function to minimize
The defined search space
The search algorithm to use such as Random search, TPE (Tree Parzen Estimators), and Adaptive TPE.
NB: hyperopt.rand.suggest and hyperopt.tpe.suggest provides logic for a sequential search of the hyperparameter space.
The maximum number of evaluations.
The trials object (optional).

===================Code=================

https://github.com/ranasingh-gkp/Study_notebook/blob/main/TensorFlow%20and%20Keras%20Build%20various%20MLP%20architectures%20for%20MNIST%20dataset.ipynb

============Thanks==================

Reference:

Google images

Applied AI

Analytics vidya