LSTMs
LSTMs to Model Physiological Time Series
Harini Suresh, Nicholas Locascio, MIT
Neural Networks
output of each node (perceptron) (matrix notation)
These allow us to draw non-linear decision boundaries.
Stochastic Gradient Descent (SGD)
- Initialize randomly
- For N Epochs
- For each training example :
- Compute loss gradient
- Update
- For each training example :
Example:
Using back-propagation and chain rule to calculate loss change with respect to a specific weight:
$with\ respect\ to\ W_2$
$with\ respect\ to\ W_1$
Recurrent Neural Networks (RNNs)
- Remembers previous state. Each hidden unit produces:
- Function of input
- Function of its own previous state/output
- Weights $W$ and $U$ stay the same across a sequence, so the model does not need to relearn something that applies later in a sequence
- $S_n$ can contain information from all past timesteps
- Benefit of Recurrent Neural Networks (over vanilla):
- Maintains sequence order
- Shares parameters across sequence so rules don’t need to be relearned
- Keep track of long-term dependencies
Possible Task - Language Model
- In addition to producing a state at each time, model produces an output (by multiplying another set of weights, $V$)
- Output is the probability distribution over the most likely next words, given what the network has seen before
- Loss function for training will measure similarity of output to training set
Possible Task - Sentiment Classification
- Output is probability distribution over possible classes (+/0/-)
- Network creates a representation of entire sequence and prediction is made based on cell state at last time step.
Possible Task - Machine Translation
- Made up of two RNNs
- Encoder takes a source sentence and feeds last cell state (encoded meaning of sentence) into second network
- Decoder produces output sentence in different language
Training RNNs - Backpropagation Through Time
- Take derivative (gradient) of loss with respect to each parameter
- Shift parameters in the opposite direction of derivative to minimize loss
At $k=0$, the third term in the summation, $\frac{\partial s_2}{\partial s_k} = \frac{\partial s_2}{\partial s_1} \frac{\partial s_1}{\partial s_0}$
The last two terms are the contributions of $W$ in previous time steps to the error at time step $t$
RNN Vanishing Gradient Problem
- $W$ is sampled from a standard normal distribution, mostly $W<1$
- $f$ is $tanh$ or $sigmoid$, so $f’<1$
- As the gap between time steps get bigger, we multiply a lot of small numbers together
- Errors due to further back time steps have increasingly smaller gradients since we have to pass through a long chain rule for it to be counted in the loss. Parameters become biased to capture shorter-term dependencies
Solution 1 - Activation Functions
- ReLU derivative terms are not always less than 1 (unlike $tanh$ or $sigmoid$), so its derivatives will not contribute to shrinking the product