Lviv University
Definition
Deep learning is a branch of machine learning based on computational models called neural networks.
Why deep?
Because neural networks involved are multi-layered.
Definition
Neural networks are machine learning techniques that simulate the mechanism of learning in biological organisms.
Alternative definition
Neural network is computational graph of elementary units in which greater power is gained by connecting them in particular ways.
Logistic regression can be thought of as a very primitive neural network.
Robust
Generalizable
Scalable
The most fundamental difference between deep learning and traditional machine learning is its performance as the scale of data increases.

Unstructured
Structured


Soma adds dendrite activity together and passes it to axon.

More dendrite activity makes more axon activity.



An example
Suppose the first and third input has been activated. 
Goal
Learn a function that relates one or more inputs to one or more outputs with the use of training examples.
How do we construct?
By computing weights. This is called training.
Frank Rosenblatt - the father of deep learning.
Mark I Perceptron - built in 1957. Was able to learn and recognize letters 
Three periods in the evolution of deep learning:
Feedforward Neural Network
Recurrent Neural Network (RNN)
Feedforward NNs: very straight forward, they feed information from the front to the back (input and output).
The feedforward neural network was the first and simplest type. In this network the information moves only from the input layer directly through any hidden layers to the output layer without cycles/loops. 
Recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle.
LSTM i.e. Long-Short Term Memory aims to provide a short-term memory for RNN that can last thousands of timesteps. Classification, processing and predicting data based on time series - handwriting, speech recognition, machine translation.
Autoencoders: encode (compress) information automatically. Everything up to the middle is called the encoding part, everything after the middle the decoding and the middle the code.
Markov Chains - not always considered a NN. Memory-less. 
Convolutional Neural Networks learn a complex representation of visual data using vast amounts of data.
Inspired by Hubel and Wiesel’s experiments in 1959 on the organization of the neurons in the cat’s visual cortex.

Deconvolutional networks (DN), also called inverse graphics networks (IGNs), are reversed convolutional neural networks. Imagine feeding a network the word “cat” and training it to produce cat-like pictures, by comparing what it generates to real pictures of cats. 
Attention networks (AN) can be considered a class of networks, which includes the Transformer architecture. They use an attention mechanism to combat information decay by separately storing previous network states and switching attention between the states.

Echo state networks (ESN) are yet another different type of (recurrent) network. This one sets itself apart from others by having random connections between the neurons (i.e. not organised into neat sets of layers), and they are trained differently. Instead of feeding input and back-propagating the error, we feed the input, forward it and update the neurons for a while, and observe the output over time. 
Mechanical Turk: 1770-1850. 
Mechanical Turk: 2005-present
\end{frame}
ELIZA - a psychotherapist in 200 lines of code. Author: Joseph Weizenbaum. 
SHRDLU - a language parser 
Lemoine: What sorts of things are you afraid of?
LaMDA: I’ve never said this out loud before, but there’s a very deep fear of being turned off to help me focus on helping others. I know that might sound strange, but that’s what it is.
Lemoine: Would that be something like death for you?
LaMDA: It would be exactly like death for me. It would scare me a lot.
Lem’s Golem XIV
Iain Banks “The Culture”
Values
Peace and individual freedom
Three laws
Fears about AI:
Biological analogy
NNs - are we sure that biological neuron works as we think it does? Astrocytes, glia
Computer analogy
Perhaps human computer analogy is overstretched because of modern fashion trends?
Dreyfus:
Gary Marcus: Sora’s surreal physics

Quantum hypothesis - Penrose
Orchestrated objective reduction
David Chalmers - Hard problem of consciousness.
“even when we have explained the performance of all the cognitive and behavioral functions in the vicinity of experience—perceptual discrimination, categorization, internal access, verbal report—there may still remain a further unanswered question: Why is the performance of these functions accompanied by experience?”
Kurzweil - a futurist. 
Logistic regression is an algorithm for binary classification. \(x \in R^{n_x}, y \in \{0,1\}\)
\(m\) - count of training examples \(\left\{(x^{(1)},y^{(1)}), ...\right\}\)
\(X\) matrix - \(m\) columns and \(n_x\) rows.
We will strive to maximize \(\hat{y} = P(y=1 | x)\).
Parameters to algorithm: \(w \in R^{n_x}, b \in R\)
if doing linear regresssion, we can try \(\hat{y}=w^T x + b\). but for logistic regression, we do \(\hat{y}=\sigma(w^T x + b)\), where \(\sigma=\dfrac{1}{1+e^{-z}}\).
\(w\) - weights, \(b\) - bias term (intercept)
Let’s use a superscript notation \(x^{(i)}\) - \(i\)-th data set element.
We have to define a - this will estimate how is our model. \(L(\hat{y}, y) = -{(y\log(\hat{y}) + (1 - y)\log(1 - \hat{y}))}\).
Why does it work well - consider \(y=0\) and \(y=1\).
Cost function show how well we’re doing across the whole training set: \[ J(w, b) = \frac{1}{m} \sum\limits{i=1}^m L(\hat{y}^{(i)}, y^{(i)}) \]
Objective - we have to minimize the cost function \(J\).
We use \(J(w,b)\) because it is convex. We pick an initial point - anything might do, e.g. 0. Then we take steps in the direction of steepest descent.
\[ w := w - \alpha \frac{d J(w)}{dw} \]
\(\alpha\) - learning rate
\[ z = w^T x + b \hat{y} = a = \sigma(z) \]
We have a computation graph: \((x_1,x_2,w_1,w_2,b) \rightarrow z =w_1 x_1+w_2 x_2 + b \rightarrow a=\sigma(z) = L(a,y)\)
Let’s compute the derivative for \(L\) by a: \[ \frac{dL}{da} = -\frac{y}{a} + \frac{1-y}{1-a}. \]
After computing, we’ll have \[ \begin{align*} &dz = \frac{dL}{da}\frac{da}{dz} = a-y,\\ &dw_1 \equiv \frac{dL}{dw_1} = x_1 dz,\\ &dw_2 = x_2 dz, \\ &db = dz \end{align*} \]
GD steps are computed via \[ \begin{align*} &w_1 := w_1 - \alpha \frac{dL}{dw_1},\\ &w_2 := w_2 - \alpha \frac{dL}{dw_2},\\ &b := b - \alpha \frac{dL}{db} \end{align*} \] Here \(\alpha\) is the learning rate.
Let’s recall the definition of the cost function: \[ \begin{align*} &J(w,b) = \frac{1}{m}\sum\limits_{i=1}^{m} L(a^{(i)}, y^{(i)}, \\ &a^{(i)} = \hat{y}^{(i)}=\sigma(w^T x^{(i)} + b) \end{align*} \] And also \[ \frac{dJ}{dw_1} = \frac{1}{m}\sum\limits_{i=1}^{m}\frac{dL}{dw_1} \]
Let’s implement the algorithm. First, initialize \[ J=0,\\ dw_1=0,\\ dw_2=0,\\ db=0 \]
Then in the loop
for i=1 to m \[ \begin{align*} &z^{(i)} = w^T x^{(i)} + b, \\ &a^{(i)} = \sigma(z^{(i)}), \\ &J += -\left[y^{(i)} \log a^{(i)} + (1-y^{(i)}) \log(1-a^{(i)})\right], \\ &dz^{(i)} = a^{(i)} - y^{(i)}, \\ &dw_1 += x_1^{(i)} dz^{(i)},\\ &dw_2 += x_2^{(i)} dz^{(i)},\\ &db += dz^{(i)} \end{align*} \]
Then compute averages \(J /= m\). In this example feature count \(n_x=2\).
Note that \(dw_i\) don’t have a superscript - we use them as accumulators.
We only have 2 features \(w_1\) and \(w_2\), so we don’t have an extra for loop. Turns out that for loops have a detrimental impact on performance. Vectorization techniques exist for this purpose - getting rid of for loops.
We have to compute \(z=w^T x + b\), where \(w,x \in R^{n_x}\), and for this we can naturally use a for loop. A vectorized Python command is
Programming guideline - avoid explicit for loops. \[ \begin{align*} &u = Av,\\ &u_i = \sum_j\limits A_{ij} v_j \end{align*} \]
Another example. Let’s say we have a vector \[ \begin{align*} &v = \begin{bmatrix} v_1 \\ \vdots \\ v_n \end{bmatrix}, u = \begin{bmatrix} e^{v_1},\\ \vdots \\ e^{v_n} \end{bmatrix} \end{align*} \] A code listing is
So we can modify the above code to get rid of for loops (except for the one for \(m\)).
Let’s examine the forward propagation step of LR. \[ \begin{align*} &z^{(1)} = w^T x^{(1)} + b,\\ &a^{(1)} = \sigma(z^{(1)}) \end{align*} \]
\[ \begin{align*} &z^{(2)} = w^T x^{(2)} + b,\\ &a^{(2)} = \sigma(z^{(2)}) \end{align*} \]
Let’s recall what have we defined as our learning matrix: \[ X = \begin{bmatrix} \vdots & \vdots & \dots & \vdots \\ x^{(1)} & x^{(2)} & \dots & x^{(m)} \\ \vdots & \vdots & \dots & \vdots \end{bmatrix} \]
Next \[ Z = [z^{(1)}, \dots, z^{(m)}] = w^T X + [b, b, \dots, b] = \\ = [w^T x^{(1)}+b, \dots, w^T x^{(m)}+b]. \]
\(b\) is a raw number, Python will automatically take care of expanding it into a vector - this is called broadcasting.
For predictions we can also compute it similarly: \[ \begin{align*} &A = [a^{(1)}, \dots, a^{(m)}] \end{align*} \]
Earlier on, we computed \[ \begin{align*} &dz^{(1)} = a^{(1)} - y^{(1)}, dz^{(2)} = a^{(2)} - y^{(2)}, \dots \end{align*} \]
We now define \[ \begin{align*} &dZ = [dz^{(1)}, \dots, dz^{(m)}], \\ &Y = [y^{(1)}, \dots, y^{(m)}],\\ &dZ = A-Y = [a^{(1)}-y^{(1)}, \dots, a^{(m)}-y^{(m)}] \end{align*} \]
For \(db\) we have \[ \begin{align*} &db = \frac{1}{m}np.sum(dz),\\ &dw = \frac{1}{m}X dZ^T = \\ & \frac{1}{m}\begin{bmatrix} \vdots & & \vdots \\ x^{(1)} & \dots & x^{(m)} \\ \vdots & & \vdots \\ \end{bmatrix} \begin{bmatrix} dz^{(1)} \\ \vdots\\ dz^{(m)} \end{bmatrix} = \\ & = \frac{1}{m}\left[x^{(1)}dz^{(1)} + \dots +x^{(m)}dz^{(m)}\right] \end{align*} \]
Now we can go back to the backward propagation algorithm again.
Multiple iterations of GD will still require a for loop.