Lviv University
Hyperparameters
Important
Train and dev sets should come from the same distribution.
Examples
Solutions for high bias
Solutions for high variance
Bias-variance tradeoff
A machine learning concept. Bigger network and getting more data help to solve the tradeoff problem.
Definition
Regularization is a strategy used in machine/deep learning designed to reduce the test error, possibly at the expense of increased training error.
Definition 2
Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.
\[\begin{align*} &\hat{y} = \sum\limits_{i=0}^d w_i x_i, \\ &L = \sum (y-\hat{y})^2 \end{align*}\]
Soft penalty
Larger value of \(d\) increases overfitting. Decreasing \(d\) = economy of parameters.
Instead of reducing a number of parameters, we can apply a soft penalty.
\(L_2\) regularization
Consider logistic regression. We introduce an additional summand:
\(J(w, b) = \frac{1}{m} \sum\limits_{i=1}^m L(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2m}\|w\|_2^2\)
\(\|w\|_2^2 = \sum\limits_{j=1}^{n_x} w_j^2 = w^T w\).
\(\lambda\) is called a regularization parameter.
Why don’t we regularize \(b\)
\(L_1\) regularization
\(J(w, b) = \frac{1}{m} \sum\limits_{i=1}^m L(\hat{y}^{(i)}, y^{(i)}) +\dfrac{\lambda}{2m}\|w\|_1\).
Here \(\|w\|_1 =\sum\limits_{j=1}^{n_x} |w_j|\).
Here we end up having sparse vectors for \(w\) - meaning, they will contain lots of zeroes.
A general case
For neural network, we add this to the cost function: \[ J(\vec{W}, \vec{b}) = \frac{1}{m} \sum\limits_{i=1}^m L(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2m}\sum\limits_{l=1}^L\|\vec{W}^{[l]}\|_F^2 \] We define the matrix norm (Frobenius norm) as \[ \|\vec{W}^{[l]}\|_F^2 = \sum\limits_{i=1}^{n^{[l]}}\sum\limits_{j=1}^{n^{[l-1]}}\left(w_{ij}^{[l]}\right)^2. \]
A general case
New \(dW^{[l]}\) becomes \[ dW^{[l]} = (\text{old one}) + \dfrac{\lambda}{m} W^{[l]}. \]
Weight decay
\(L_2\) regularization is sometimes called weight decay. The gradient descent step: \[\begin{align*} &\vec{W}^{[l]} = \vec{W}^{[l]}-\alpha\left((\text{old one}) + \dfrac{\lambda}{m} \vec{W}^{[l]}\right) = \\ &= \vec{W}^{[l]}\left(1-\dfrac{\alpha \lambda}{m}\right) - \alpha(\text{old one}). \end{align*}\]
\(L_2\) regularizer encourages weight values to decay towards \(0\).
\(L_1\) vs \(L_2\)
\(L_2\)-regularization impact
Definition
Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by combining several models. The idea is to train several different models separately, then have all the models vote on the output for test examples.
Ensemble methods
This is an example of a general strategy in machine learning called model averaging.
Techniques employing this strategy are known as ensemble methods.
Rationale
Different models will not make same errors on the test set.
Bagging
Sampling
Aka DropConnect.
Overview
Inverted dropout
Create a random matrix e.g. for layer 3:

\[\begin{align*} &d3 = np.random.randn(a3.shape[0], a3.shape[1]) < keep\_prob \\ &a3 = np.multiply(a3, d3) \\ &a3 /= keep\_prob \end{align*}\]
This ensures that the expected value of keep_prob remains the same. At test time we’re not using dropout.
Features
keep_prob by layer.Downside
We don’t have a well-defined cost function.
Notes
keep_prob to keep the same expected value for the activations.Data augmentation
For example, flip images to generate extra training samples. Or do random distortions. 
Early stopping
Orthogonalization
Orthogonalization: think about minimizing cost and not overfitting separately.
Downside
Downside of early stopping is that it merges these two tasks.
Early stopping
Note
It is probably the most commonly used form of regularization in deep learning. Its popularity is due to both its effectiveness and its simplicity.
Noise injection: injecting a matrix of random values from a Gaussian distribution.
Impact
learning_rate.Vanishing / Exploding Gradients

If we use linear activation function \(g(z)=z\), then we can show that \(y = w^{[L]}*w^{[L-1]}*\dots*w^{[1]}\).
Suppose that weight matrices look like this: \[\begin{align*} & w^{[l]} = \begin{bmatrix} 1.5 & 0 \\ 0 & 1.5 \end{bmatrix} \end{align*}\] Then we’ll have that \(\hat{y} = 1.5^{L-1}x\).
So the value of \(\hat{y}\) will explode. Conversely, if we have 0.5s in the weight matrix, then activation values will vanish.
Same thing will happen to derivatives.
This problem can be solved by careful initialization of the weights.
Suppose we have a single neuron.
So we’ll have \(z=w_1 x_1 + \dots + w_n x_n\).
The larger \(n\) becomes, the less should the weights \(w_i\) be.
We can set the variance of w to be \(\dfrac{1}{n}\) for \(tanh\) (Xavier initialization) (or \(\dfrac{2}{n}\) for ReLU).
Sometimes also this is used: \(\sqrt{\dfrac{2}{n^{[l-1]}n^{[l]}}}\). \[ w^{[l]} = np.random.randn(w.shape)* np.sqrt(1/n^{[l-1]}) \]
Kinds
Example
\[\begin{align*} & z = h(w_1 x + b_1) \\ & y = h(w_2 x + b_2) \\ & h(a) = ln (1+ exp(a)) \text{ (soft ReLU)} \\ & y(x) = h(w_2 h(w_1 x + b_1) + b_2)\\ &\dfrac{\partial y}{\partial w_1} = \dfrac{w_2 x exp\left(w_1 x + b_1 + b_2 + w_2 ln\left[1+e^{w_1 x + b_1}\right]\right)}{(1 + e^{w_1 x + b_1})(1+exp(b_2 + w_2 ln \left[1+e^{w_1 x + b_1}\right])}. \end{align*}\]
Idea
Automatically generate the code for gradient calculations, based on forward propagation equations.
It augments the forward prop code with additional variables.

Forward-mode
\[\begin{align*} &f(x_1, x_2) = x_1 x_2 + exp(x_1 x_2) - sin(x_2) \\ & \text { for } \dfrac{\partial f}{\partial x_1} \text {define tangent variables } \dot{v_i} = \dfrac{\partial v_i}{\partial x_1} \\ & \dot{v_i} = \dfrac{\partial v_i}{\partial x_1} = \sum\limits_{j \in pa(i)} \dfrac{\partial v_j}{\partial x_1}\dfrac{\partial v_i}{\partial v_j} = \sum\limits_{j \in pa(i)} \dot{v_j}\dfrac{\partial v_i}{\partial v_j}\\ & pa(i) \text{ being set of parents to node i} \end{align*}\]
Reverse-mode
Augment each intermediate variable \(v_i\) with additional varibles \(\overline{v_i}\).
\(f\) - output function. \[\begin{align*} & \overline{v_i} = \dfrac{\partial f}{\partial v_i} = \sum\limits_{j \in pa(i)} \dfrac{\partial f}{\partial v_j}\dfrac{\partial v_j}{\partial v_i} = \sum\limits_{j \in pa(i)} \overline{v_j}\dfrac{\partial v_j}{\partial v_i}\\ \end{align*}\]
Note
Backpropagation is a special-case of reverse-mode autodiff.
Gradient checking
We check analytical gradients numerically.
Gives much better approximation of derivatives (two-sided) \(f(\theta+\epsilon), f(\theta-\epsilon)\) Approximation error becomes \(O(\epsilon^2)\), for one-sided approximation it’s \(O(\epsilon)\).
Procedure
First, we take all our parameters \(W^{[l]}, b^{[l]}\) and reshape them into one data vector \(\theta\).
So the cost function will be transformed in a following way: \[\begin{align*} &J(W^{[1]}, b^{[1]},\dots, W^{[L]}, b^{[L]}) = J(\theta) \end{align*}\] Differentials can also be reshaped into a vector \(d\theta\).
Procedure
Then we compute a differential for each \(i\): \[\begin{align*} &d\theta_{approx}[i] = \dfrac{J(\theta_1,\dots, \theta_i+\epsilon,\dots) - J(\theta_1,\dots, \theta_i-\epsilon,\dots)}{2\epsilon} \approx d\theta[i]. \end{align*}\] How do we check? \[ val = \dfrac{\|d\theta_{approx}-d\theta\|_2}{\|d\theta_{approx}\|^2 +\|d\theta\|^2} \]
OK
If \(\epsilon = 10^{-7}\) and \(val=10^{-7}\), then everything’s great.
Not OK
If val is big, gradients should be rechecked.
Notes
If we split training sets in smaller sets, we call them mini-batches. 
We’ll denote first minibatch \(X^{\{1\}}\). Same for \(Y\).
for t in 1...5000:
\[\begin{align*} &\text{forward prop on } X^{\{t\}} \\ &Z^{[1]} = W^{[1]}X^{\{t\}} + b^{[1]}\\ &A^{[1]} = g^{[1]}(Z^{[1]})\\ &\vdots \\ &A^{[l]} = g^{[l]}(Z^{[l]})\\ &J^{\{t\}} = \dfrac{1}{1000} \sum L(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2 * 1000} \sum \|w^{[l]}\|^2_F \\ &\text{backward prop to compute gradients of } J^{\{t\}}\\ & w^{[l]} = w^{[l]} - \alpha dw^{[l]},\, b^{[l]} = b^{[l]} - \alpha db^{[l]} \end{align*}\] So we take 5000 gradient descent steps in one epoch.

Note
Mini-batch cost will have oscillations, depending on characteristics of mini-batches.
Note
Stochastic GD will oscillate a lot. Disadvantage is that we lose the speedup from vectorization.
Guidelines for mini-batch size
Momentum: intuition
Speed
Faster than GD.
Exponentially weighted (moving) averages for yearly temperatures: \[ V_t = \beta*V_{t-1} + (1-\beta)*\theta_t \] where \(v_0=0\), \(\theta_i\) is temperature on day \(i\).
\(V_t\) averages over last \(\dfrac{1}{1-\beta}\) temperature.
E.g., if \(\beta=0.9\), then \(V_t\) averages over last 10 days.
High values
With high \(\beta\) values, we get a much smoother plot, but shifted to the right, because large \(\beta\) values cause slower adaptation of the graph.
Low values
With smaller \(\beta\) values, the graph is noisier, but it adapts faster.
Procedure
Recursively expand \(V_{100}\): \[ V_{100} = \sum\limits_{i=1}^100 (1-\beta)\beta^{100-i} \theta_i \] All these coefficients above add up to a number close to \(1\).
We multiply daily temperature with an exponentially decaying function. \[ (1-\epsilon)^{\dfrac{1}{\epsilon}} = \dfrac{1}{e}. \]
Implementation
\[ v_{\theta} := \beta v_{\theta} + (1-\beta)\theta_i \] Very efficient from computation and memory efficiency points of view.
Bias Correction in Exponentially Weighted Averages
The problem is that the curves starts really low.

Suggestion
Divide \(v_t\) by \(1-\beta^t\).
Basic idea
To compute exponentially weighted average of gradients and use that in GD update step. Helps the Gradient Descent process in navigating flat regions and local optima.
Procedure
On iteration \(t\) we compute \(v_{dw} = \beta v_{dw} + (1-\beta)dw\).
In this formula, we can think of first summand as velocity, second - as acceleration, \(\beta\) - as friction.
Similarly, we compute \(v_{db}\).
Then we update the weights as follows: \[ W := W - \alpha v_{dW}, \, b := b - \alpha v_{db} \] This smoothes out the GD steps. Results in much smaller oscillations in vertical direction, while moving horizontally.
Implementation details
On iteration \(t\):
Note
We have \(\alpha\) and \(\beta\) as hyperparameters.
Sometimes \(1-\beta\) is omitted.
Important notes
Definition: RMSprop: acronym for “Root Mean Square prop”.
Vertical oscillations - this is parameter \(b\). \(w\) is horizontal direction. \[\begin{align*} &s_{dW} = \beta s_{dW} + (1-\beta)dW^2, \; s_{db} = \beta s_{db} + (1-\beta)db^2, \\ &W := W - \alpha \dfrac{dW}{\sqrt{s_{dw}+\epsilon}}, \, b := b - \alpha \dfrac{db}{\sqrt{s_{db}+\epsilon}} \end{align*}\]
Intuition
We will denote the parameter \(\beta_2\).
Description
Works across a wide range of DL architectures. It’s a combination of GD with momentum and RMSprop.
Adam stands for ADAptive Moment estimation.
Computation
\[\begin{align*} & v_{dW}=0,\, s_{dW}=0 \\ & v_{db}=0,\, s_{db}=0 \\ & \text{on iteration } t \text{ compute } dW, db \\ & v_{dW} = \beta_1 v_{dW} + (1-\beta_1)dW,\\ & v_{db} = \beta_1 v_{db} + (1-\beta_1)db,\\ & s_{dW} = \beta_2 s_{dW} + (1-\beta_2)dW^2,\\ & v_{db} = \beta_2 v_{db} + (1-\beta_2)db^2 \end{align*}\]
Computation
We also implement bias correction: \[\begin{align*} &V_{dW}^{corrected} = \dfrac{V_{dW}}{1-\beta_1^t},\\ &V_{db}^{corrected} = \dfrac{V_{db}}{1-\beta_1^t},\\ &S_{dW}^{corrected} = \dfrac{S_{dW}}{1-\beta_2^t},\\ &S_{db}^{corrected} = \dfrac{S_{db}}{1-\beta_2^t},\\ \end{align*}\]
Computation
And the update steps: \[\begin{align*} &W := W - \alpha \dfrac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}} + \epsilon},\\ &b := b - \alpha \dfrac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}} + \epsilon} \end{align*}\]
Choice of hyperparameters
Clarification
\[ \alpha = \dfrac{1}{1+decayRate\times epochNumber}\alpha_0 \]
We can speed up the learning algorithm by slowly reducing the learning rate over time.
Suppose we have small mini-batches.
decay_rate becomes another hyperparameter.
Another option is exponential decay: \(\alpha = 0.95^{epochNumber} \alpha_0\).
Some other options:
Manual decay: works if we train on a small number of models.
Fixed interval scheduling

In multidimensional spaces we are much less likely to encounter local optima, but rather saddle points.
Our understanding of these spaces is still evolving.
Plateau - a region where a derivative is close to zero for a long time.