Softmax Classifier using TensorFlow on MNIST dataset with sample code

5 min readJun 22, 2019

install tensorflow

!pip install tensorflow

Loading Mnist dataset

Every MNIST data point has two parts: an image of a handwritten digit and a corresponding label. We’ll call the images “x” and the labels “y”. Both the training set and test set contain images and their corresponding labels; for example the training images are mnist.train.images and the training labels are mnist.train.labels.

import tensorflow.examples.tutorials.mnist.input_data as input_data
mnist=input_data.read_data_sets(“MNIST”, one_hot=True)

Check dimension of train and test of MNIST dataset

print(“number of data points : “, mnist.train.images.shape[0],”number of pixels in each image :”,mnist.train.images.shape[1])

Number of train data points : 55000 number of pixels in each image : 784

mnist.train.images is a tensor (an n-dimensional array) with a shape of [55000, 784]. The first dimension is an index into the list of images and the second dimension is the index for each pixel in each image. Each entry in the tensor is a pixel intensity between 0 and 1, for a particular pixel in a particular image.

print(“number of data points : “, mnist.test.labels.shape[0],” length of the one hot encoded label vector :”,mnist.test.labels.shape[1])

Number of test data points: 10000 length of the one hot encoded label vector : 10

Activate library

If you want to assign probabilities to an object being one of several different things, softmax (Multiclass Logistic regression) is the thing to do, because softmax gives us a list of values between 0 and 1 that add up to 1. Even later on, when we train more sophisticated models, the final step will be a layer of softmax.

import tensorflow as tf
import numpy as np

Defining Placeholders, Variables, predicted y and loss function

x = tf.placeholder(tf.float32, [None, 784])

x isn’t a specific value. It’s a placeholder. A placeholder can be imagined as
a memory unit that we use to load various mini-batches of input data while training. We want to be able to input any number of MNIST images, each flattened into a 784-dimensional vector. We represent this as a 2-D tensor of floating-point numbers, with a shape [None, 784] (Here None means that a dimension can be of any length.)

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

We also need the weights and biases for our model. We could imagine treating these like additional inputs, but TensorFlow has an even better way to handle it: Variable.

y = tf.nn.softmax(tf.matmul(x, W) + b)# predicted y
y_ = tf.placeholder(tf.float32, [None, 10])#actual y

Now, First, we multiply x by W with the expression tf.matmul(x, W). This is flipped from when we multiplied them in our equation, where we had Wx, as a small trick to deal with x being a 2D tensor with multiple inputs. We then add b, and finally apply tf.nn.softmax.

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))#Tutorial for tf.reduce_sum: https://www.dotnetperls.com/reduce-sum-tensorflow

Defining the loss function: multi-class log-loss/cross-entropy First, tf.log computes the logarithm of each element of y. Next, we multiply each element of y_ with the corresponding element of tf.log(y). Then tf.reduce_sum adds the elements in the second dimension of y, due to the reduction_indices=[1] parameter. Finally, tf.reduce_mean computes the mean over all the examples in the batch. Reduction is an operation that removes one or more dimensions from a tensor by performing certain operations across those dimensions.

Defining optimizer

train_step=tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)
#https://www.tensorflow.org/versions/r1.2/api_guides/python/train#Optimizers

In this case, we ask TensorFlow to minimize cross_entropy using the gradient descent algorithm with a learning rate of 0.05. What TensorFlow actually does here, behind the scenes, is to add new operations to your computation-graph which implement backpropagation and gradient descent. Then it gives you back a single operation which, when run, does a step of gradient descent training, slightly tweaking your variables to reduce the loss.

Launch model

#Now launch the model in an InteractiveSession
sess = tf.InteractiveSession()

We first have to create an operation to initialize the variables we created:


tf.global_variables_initializer().run()

# We run train_step feeding in the batches data to replace the placeholders

for _ in range(1000):
 batch_xs, batch_ys = mnist.train.next_batch(100)
 sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

Each step of the loop, we get a “mini-batch” of one hundred random data points from our training set. Using small batches of random data is called stochastic training — in this case, stochastic gradient descent. Ideally, we’d like to use all our data for every step of training because that would give us a better sense of what we should be doing, but that’s expensive. So, instead, we use a different subset every time. Doing this is cheap and has much of the same benefit.

# https://stackoverflow.com/a/41863099

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

tf.argmax(input, axis=None, name=None, dimension=None) Returns the index with the largest value across axis of a tensor.

Plotting error

# https://gist.github.com/greydanus/f6eee59eaf1d90fcb3b534a25362cea4
# https://stackoverflow.com/a/14434334
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import time

def plt_dynamic(x, y, y_1, ax, colors=[‘b’]):
   ax.plot(x, y, ‘b’, label=”Train Loss”)
   ax.plot(x, y_1, ‘r’, label=”Test Loss”)
   if len(x)==1:
   plt.legend()
   fig.canvas.draw()

Now summarizing everything in a single cell

training_epochs = 15
batch_size = 1000
display_step = 1
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = y, labels = y_))
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)fig,ax = plt.subplots(1,1)
ax.set_xlabel(‘epoch’) ; ax.set_ylabel(‘Soft Max Cross Entropy loss’)
xs, ytrs, ytes = [], [], []
for epoch in range(training_epochs):
 train_avg_cost = 0.
 test_avg_cost = 0.
 total_batch = int(mnist.train.num_examples/batch_size)
 # Loop over all batches
 for i in range(total_batch):
 batch_xs, batch_ys = mnist.train.next_batch(batch_size)
 _, c = sess.run([train_step, cross_entropy], feed_dict={x: batch_xs, y_: batch_ys})
 train_avg_cost += c / total_batch
 c = sess.run(cross_entropy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})
 test_avg_cost += c / total_batchxs.append(epoch)
 ytrs.append(train_avg_cost)
 ytes.append(test_avg_cost)
 plt_dynamic(xs, ytrs, ytes, ax)plt_dynamic(xs, ytrs, ytes, ax)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(“Accuracy:”, accuracy.eval({x: mnist.test.images, y_: mnist.test.labels}))