Data are becoming the new raw material of business
The Economist

The APIs for Neural Networks in TensorFlow

By Dana Mastropole, Robert Schroll, and Michael Li

TensorFlow has gather quite a bit of attention as the new hot toolkit for building neural networks. To the beginner, it may seem that the only thing that rivals this interest is the number of different APIs which you can use. In this article we will go over a few of them, building the same neural network each time. We will start with low-level TensorFlow math, and then show how to simplify that code with TensorFlow’s layer API. We will also discuss two libraries built on top of TensorFlow, TFLearn and Keras.

The MNIST database is a collection of handwritten digits. Each is recorded in a $28\times28$ pixel grayscale image. We we build a two-layer perceptron network to classify each image as a digit from zero to nine. The first layer will fully connect the 784 inputs to 64 hidden neurons, using a sigmoid activation. The second layer will connect those hidden neurons to 10 outputs, scaled with the softmax function. The network will be trained with stochastic gradient descent, on minibatches of 64, for 20 epochs. (These values are chosen not because they are the best, but because they produce reasonable results in a reasonable time.)

We’ll start by loading the modules and the data, as well as setting up some constants we’ll use repeatedly.

import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('/tmp/data', one_hot=True)

Xtrain = mnist.train.images
ytrain = mnist.train.labels
Xtest = mnist.test.images
ytest = mnist.test.labels

N_PIXELS = 28 * 28

sess = tf.Session()

Raw TensorFlow

At its heart, TensorFlow is just a tool for assembling and evaluating computational graphs. Thus, the most basic way to use TensorFlow is to set up the calculation by hand.

Let’s start by setting up placeholders for the features and labels. These record the shape and datatype of that data to be fed in. Note that the first dimension has size None, which indicates that it can take an arbitrary number of observations.

x = tf.placeholder(tf.float32, [None, N_PIXELS], name="pixels")
y_label = tf.placeholder(tf.float32, [None, N_CLASSES], name="labels")

In the first layer, the input features (pixel intensities) are multiplied by a weight matrix of size N_PIXELS × HIDDEN_SIZE. The weights are stored in a variable, which is a TensorFlow data structure that holds state which can be updated during the training.

A bias term is added to this, and the result is sent through a sigmoid activation function.

W1 = tf.Variable(tf.truncated_normal([N_PIXELS, HIDDEN_SIZE],
b1 = tf.Variable(tf.zeros([HIDDEN_SIZE]))

hidden = tf.nn.sigmoid(tf.matmul(x, W1) + b1)

The second layer has it own set of weights and biases, sized to give us ten outputs, one for each class. We don’t apply an activation function to this output…

W2 = tf.Variable(tf.truncated_normal([HIDDEN_SIZE, N_CLASSES],
b2 = tf.Variable(tf.zeros([N_CLASSES]))

y = tf.matmul(hidden, W2) + b2

…because TensorFlow provides a loss function that includes the softmax activation. (Doing it this way allows it to avoid floating-point issues for probabilities close to 0 or 1.) This loss function calculates the cross entropy directly from the logits, the input to the softmax function. The ground truth values will be input as y_labels.

loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_label))

The cross entropy is useful for a training because it rewards steps that improve the confidence of predictions, even if they don’t change the actual predictions. It can be a bit difficult to understand, so we’ll also compute the accuracy, the fraction of predictions we got correct.

accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, 1),
                                           tf.argmax(y_label, 1)),

All that’s left to do is run the training process. Gradient descent is a simple optimization scheme that updates the value of parameter based on the gradient of a loss function with respect to that parameter. Because TensorFlow is working from a computational graph, it can work out all the variables that contribute to the loss tensor, and it can figure out how to update those variables to reduce to value of loss. Those update rules are stored in sgd.

It is up to us to run these update rules a number of times. We’ve chose to run for 20 epochs (cycles through the full training data), with randomly choses batches of 64 training data for each step. After each epoch, we print out the loss and accuracy of the model on the test data.

sgd = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
inds = range(Xtrain.shape[0])

for i in xrange(EPOCHS):
    for j in xrange(0, len(inds), BATCH_SIZE):, feed_dict={x: Xtrain[inds[j:j+BATCH_SIZE]],
                                 y_label: ytrain[inds[j:j+BATCH_SIZE]]})
    print[loss, accuracy], feed_dict={x: Xtest, y_label: ytest})

This runs in about 20 seconds on an unremarkable laptop, and gets us an accuracy of over 97%.

The Layer API

This was quite a bit of work for a relatively simple network. Each layer required us to set up weight and bias variables of the right shape, do some matrix math, and apply an activation function. That work will be basically the same each time we need a new layer, so the TensorFlow Layers API abstracts that work into a single function call.

We use the same placeholders, x and y_label as before. Now, we can create the hidden layer with a single line.

hidden = tf.layers.dense(x, HIDDEN_SIZE,

Because TensorFlow knows the shape of x, it can work out the size of the weight matrix that is needed. With use_bias=True, bias variables are created as well. The activation function can be specified, and the kernel_initializer gives a function to initialize the weight matrix. (The bias is initialized to zero by default.)

The output layer works much the same way, with the exception of no activation being applied. (Once again, we’ll use the loss function that applies the softmax activation itself.)

y = tf.layers.dense(hidden, N_CLASSES,

The loss and accuracy are defined in the same way as before.

loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_label))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, 1),
                                           tf.argmax(y_label, 1)),

We are still responsible for running the minimization process by hand. The code is identical to the previous example, as is the performance.

sgd = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
inds = range(Xtrain.shape[0])

for i in xrange(EPOCHS):
    for j in xrange(0, len(inds), BATCH_SIZE):, feed_dict={x: Xtrain[inds[j:j+BATCH_SIZE]],
                                 y_label: ytrain[inds[j:j+BATCH_SIZE]]})
    print[loss, accuracy], feed_dict={x: Xtest, y_label: ytest})


The layer API still requires us to deal with low-level details of the optimization scheme. A number of projects attempt to provide a higher-level syntax, more reminiscent of Sci-kit Learn estimators. One such project is TFLearn (which should not be confused with tensorflow.contrib.learn).

import tflearn
With TFLearn, we don’t have to worry about setting up placeholders or variables to hold values. Instead, we create an structure for our input features with input_data.
x = tflearn.input_data(shape=[None, N_PIXELS], name="pixels")

As with the layer API, we can create a each layer with a single function call. As before, we must specify the input tensor, the number of neurons, and the activation function. Note that we include the softmax activation function on the output layer in this case.

hidden = tflearn.fully_connected(x, HIDDEN_SIZE, activation="sigmoid")
y = tflearn.fully_connected(hidden, N_CLASSES, activation="softmax")

The tflearn.regression layer abstracts away many of the details of the regression model. Instead of creating our own loss function, accuracy measure, and optimization step, we simply specify that the network should be optimizing “categorical_crossentropy” using a stochastic gradient descent technique.

network = tflearn.regression(y,

Finally, we create a model from this network. This model has the .fit() and .predict() methods that we’re used to from Sci-kit Learn. In addition to the training data, the fit method accepts other arguments specifying the details of the optimization scheme. By including a validation set, we get reports on the model’s performance on the test data once per epoch.

model = tflearn.DNN(network), ytrain,
          validation_set=(Xtest, ytest),

The performance is very similar to the previous approaches, with a validation cross entropy of about 0.08 and 97% accuracy. There are some small differences due to different initializations of the weights, as well as the random choice of batches, but the underlying algorithm is the same.


Like TFLearn, Keras provides a high-level API for creating neural networks. It is backend agnostic, running on top of CNTK and Theano in addition to TensorFlow. Nonetheless, it was recently added to the tensorflow.contrib namespace.

from tensorflow.contrib import keras

In Keras, we start with the model object. This specifies how the layers should be laid out. Here, a Sequential model indicates that the layers are to be connected in order.

model = keras.models.Sequential()

The layers are added to the model in order. We need to specify the input dimension on the first layer, but Keras is able to work out the input dimension to the second layer from the output size of the first.

model.add(keras.layers.Dense(HIDDEN_SIZE, activation='sigmoid', input_dim=N_PIXELS))
model.add(keras.layers.Dense(N_CLASSES, activation='softmax'))

The compilation step prepares the model for training, recording the loss function, the optimization scheme, and additional metrics to measure.

              optimizer=tf.train.GradientDescentOptimizer(0.5), #keras.optimizers.SGD(0.5),

Then we can fit the model on the training data., ytrain,
          validation_data=(Xtest, ytest))


It’s nice to know that the power of raw TensorFlow is available, but most of the time you’ll want a more succinct syntax. The TensorFlow layer API simplifies the construction of a neural network, but not the training. TFLearn and Keras offer two choices for a higher-level API that hides some of the details of training. The Keras API is a bit more object-oriented than the TFLearn API, but their capabilities are similar. Keras’s adoption into the TensorFlow project suggests a bright future for the project, but TFLearn is going strong itself. In the end, choose the API that works best for you.

Author Bios

Dana Mastropole is a Data Scientist in Residence at The Data Incubator and contributes to curriculum development and instruction. Previously, Dana taught elementary school science after completing MIT’s Kaufman teaching certificate program. She studied physics as an undergraduate student at Georgetown University and holds a Master’s in Physical Oceanography from MIT.

Robert Schroll is a Data Scientist in Residence at The Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at The Data Incubator since. Robert holds a PhD in Physics from the University of Chicago.

This post was originally published on September 11, 2017 at

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Back to index