NN: From Zero to Hero, Lecture 1

Opinionated translation of the first lecture of Andrej Karpathy’s course into a step-by-step guide where you’re encouraged to come up with solutions on your own.

Challenge⌗

Build and train a neural network as binary classificator from scratch.

Questions⌗

what is neural network from first principles?
what is backward propagation?
what is gradient descent?
how PyTorch works under the hood?
how to visualize data as graphs?

Prerequisites⌗

Basic Python
Basic calculus

Key terms⌗

GraphViz
Jupyter
PyTorch
Python
backpropagation
binary classification
data science
data visualization
expression graph
gradient descent
machine learning
neural networks

Milestones⌗

Preparation⌗

You can setup your workspace any way you want, as long as it provides an interactive environment to execute Python code and view generated images. Our suggestion would be a Jupyter notebook.

Install Python⌗

install Miniconda3: https://docs.conda.io/en/latest/miniconda.html
create a project directory: mkdir my-project && cd my-project
create local python environment: conda create --prefix ./envs
activate it: conda activate ./envs

Create Jupyter notebook⌗

install Jupyter: pip install jupyter
run Jupyter server: jupyter notebook

Expression graph⌗

Implement a general system which provides the ability to compose symbolic expressions and do compute over them. This will become the foundation of your neural network.

Create Value abstraction⌗

it represents a float value

it supports addition and multiplication with other Values

  x = Value(1.0)
  y = Value(2.0)
  z = Value(3.0)
  (x + y) * z # Value(9.0)

non-leaf values store its operation and arguments

  x = Value(1.0)
  y = Value(2.0)
  z = x + y # Value(3.0, op=<addition>, args=<x and y>)

Visualize the resulting expression graph⌗

install GraphViz: pip install graphviz
import relevant class: from graphviz import Digraph

draw the expression graph:

create graph object
```
    dot = Digraph()
```

add Value nodes to graph

    dot.node(
        name, # unique node identifier
        label, # contents of the node, format depends on the shape
        shape, # visual shape of the nod
    )
    # dot.node(name='a', label=f'{{ a | {a:.4f} }}', shape='record')

connect argument nodes to the output nodes

    dot.edge(
        name_from, # source node name
        name_to, # destination node name
    )
    # dot.edge('a', 'b')

given expression (x + y) * z
- your graph can look like this
- or like this

Implement gradient calculation⌗

gradient is a partial derivative of the final expression with respect to the current expression

  x = Value(1.0)
  y = Value(2.0)
  z = Value(3.0)
  u = x + y
  v = u * z

$\begin{aligned} (1) & grad (v) = \frac{d v}{d v} & = 1 \\ (2) & grad (u) = \frac{d v}{d u} & = \frac{d (u \cdot z)}{d u} = z \\ (3) & grad (x) = \frac{d v}{d x} & = \frac{d v}{d u} \cdot \frac{d u}{d x} = z \cdot \frac{d (x + y)}{d x} = z \cdot 1 = z \end{aligned}$

implement backward() method which computes the gradients of the whole graph when called on the root node
hints:
- when creating a new Value as a result of some operation, define self._backward lambda which updates gradients of the argument nodes
- consider a case when some Value is used twice

Implement more operations⌗

subtraction: x - y = x + (y * -1)
power: x**k where k is a constant (not a Value)
division: x/y = x * (y**-1)
exp: x.exp()
tanh: x.tanh()

Create and test expression graph for a single neuron⌗

x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
b = Value(6.8814, label='b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
o.backward()
# ==== expected gradients ====
# x1.grad = -1.5
# w1.grad = 1.0
# x2.grad = 0.5
# w2.grad = 0.0

Neural net⌗

Implement a multi-layer neural network and test it on some data. This abstraction is the gist of the modern machine learning approaches and will help you better understand more advanced techniques.

Create Neuron abstraction⌗

it is defined by a list of weights + bias
it is callable with a list of input values, producing a squashed output: $n e u r o n ([x_{1}, \dots, x_{n}]) = t a n h (\sum w_{i} x_{i} + b)$

Create Layer abstraction⌗

it is defined by a list of neurons
it is callable with a list of inputs, producing a list of neuron outputs: $l a y e r ([x_{1}, \dots, x_{n}]) = [n_{j} ([x_{1}, \dots, x_{n}]) | n_{j} \in l a y e r]$

Create MLP (Multi-Layer Perceptron) abstraction⌗

it is defined by a list of layers
it is callable with a list of inputs, producing a list of outputs of the last layer: $m l p ([x_{1}, \dots, x_{n}]) = m l p^{'} (l_{1} ([x_{1}, \dots, x_{n}])) = \dots = [y_{1}, \dots, y_{m}]$
for convenience, if the last layer consists only of one neuron return it instead of a list

Create a test dataset for binary classification⌗

define some sample data, e.g.

  # sets of inputs
  xs = [
      [2.0,3.0,-1.0],
      [3.0,-1.0,0.5],
      [0.5,1.0,1.0],
      [1.0,1.0,-1.0],
  ]
  # ground truth (aka expected) outputs
  ys_gt = [1.0,-1.0,-1.0,1.0]

run your MLP on it

  # predicted (aka actual) outputs
  ys_pred = [mlp(x) for x in xs]
  # [Value(-0.79),Value(-0.29),Value(0.65),Value(0.23)]

Compute the loss⌗

it indicates how good is the MLP prediction
there are different loss functions, but we will use Mean Squared Error (MSE)

$l o s s = \sum_{j} (y_{p r e d}^{j} - y_{g t}^{j})^{2}$

Update MLP parameters⌗

add parameters() method to MLP which returns the list of all weights and biases
compute the gradients starting from the loss

update parameters to decrease the loss

hint: nudge in the opposite direction to the gradient

      rate = 0.001
      for p in mlp.parameters():
          p.data += rate * -p.grad

compute the loss once again and see it getting smaller, which means predictions are getting closer to the ground truth

Create a cycle: Prediction-Loss-Backprop-Update⌗

iterate N times:
- compute the predictions
- compute the loss
- backprop gradients from the loss
- update MLP parameters
look at predicted values to see how close they got to the ground truth

Conclusion⌗

You’ve just created, trained and used a real neural network. Even though modern machine learning techniques are different, they have the same core ideas and understanding those will help you to advance in this field. That said, even this simple neural network can already be used for a variety of tasks, e.g. predicting housing prices or recognizing hand-written digits.

Project submission⌗

Please reply to this tweet with a link to the project repository (e.g. on Github) to mark this project as complete and to build up your portfolio.

Self-assessment⌗

What did you learn?

How did you like it?

Do you want to continue with similar projects?

How would you use acquired skills?

Do you have an idea for a project which would use these skills?

What’s next?⌗

An overview of gradient descent optimization algorithms