Lviv University
Comparison to image processing
For a 10 word sentence using a 100-dimensional embedding we would have a 10×100 matrix as our input - this would be the “image”.
Filters
Description
Intuitions broken ?
Benefits
A big argument for CNNs is that they are fast. Very fast.
Classification tasks
Good idea:
Order of words lost
Bad idea (unless you do it right):
Comparison with transformers
CNN benefits
The main advantage of CNNs over previous NLP algorithms is that
Comparison to other NLP methods
Stencil example

Considerations
What does CNN do with a kernel?
Convolution is steps 1 and 2.







Manual kernel
Let’s start with a manual kernel first.
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import pandas as pd
tags = 'ADV ADJ VERB NOUN'.split()
quote = 'The right word may be effective, but no word was ever as effective as a rightly timed pause.'
tokens = pos_tag(word_tokenize(quote), tagset='universal')
tagged_words = [[word] + [int(tag == t) for t in tags] for word, tag in tokens]
df = pd.DataFrame(tagged_words, columns=['token'] + tags).T
print(df) 0 1 2 3 4 5 6 7 8 9 10 11 12 \
token The right word may be effective , but no word was ever as
ADV 0 0 0 0 0 0 0 0 0 0 0 1 1
ADJ 0 1 0 0 0 1 0 0 0 0 0 0 0
VERB 0 0 0 1 1 0 0 0 0 0 1 0 0
NOUN 0 0 1 0 0 0 0 0 0 1 0 0 0
13 14 15 16 17 18 19
token effective as a rightly timed pause .
ADV 0 0 0 1 0 0 0
ADJ 1 0 0 0 0 0 0
VERB 0 0 0 0 1 0 0
NOUN 0 0 0 0 0 1 0
The y value reaches a maximum value of 3 where all 3 values of 1 in the kernel line up perfectly with the three 1’s forming the same pattern within the part-of-speech tags for the sentence.
nn.Embedding
The nn.Embedding layer is a simple lookup table that maps an index value to a weight matrix of a certain dimension.
Training
Description
torch.nn.init.uniform_()Initialization
Initialization
Pre-trained embeddings: advantages
Pre-trained embeddings example

Abstract
We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.
The input layer is a sentence comprised of concatenated word2vec word embeddings. That’s followed by a convolutional layer with multiple filters, then a max-pooling layer, and finally a softmax classifier.
Let \(\boldsymbol{x}_i \in \textrm{R}^k\) be the \(k\)-dimensional word vector corresponding to the \(i\)-th word in a sentence.
A sentence of length \(n\) (padded when necessary) is represented as \[ \boldsymbol{x}_{1:n} = \boldsymbol{x}_1 \oplus \boldsymbol{x}_2 \oplus \dots \boldsymbol{x}_n, \] where \(\oplus\) is the concatenation operator.
In general, \(\boldsymbol{x}_{i:i+j}\) will refer to the concatenation of words \(\boldsymbol{x}_i, \boldsymbol{x}_{i+1}, \dots, \boldsymbol{x}_{i+j}\).
A convolution operation involves a filter \(\boldsymbol{w} \in \textrm{R}^{hk}\), which is applied to a window of \(h\) words to produce a feature.
For example, a feature \(c_i\) is generated from a window of words \(\boldsymbol{x}_{i:i+h-1}\) by \[ c_i = f(\boldsymbol{w} \cdot \boldsymbol{x}_{i:i+h-1} + b), \] where \(b \in \textrm{R}\) is a bias term and \(f\) is an activation function.
This filter is applied to each possible window of words in the sentence \[
\left\{\boldsymbol{x}_{1:h}, \boldsymbol{x}_{2:h+1}, \dots, \boldsymbol{x}_{n-h+1:n}\right\}
\] to produce a feature map: \[
\boldsymbol{c} = \left[c_1,c_2,\dots,c_{n-h+1}\right], \; c \in \textrm{R}^{n-h+1}.
\] 
Pooling
Apply a max-overtime pooling operation (Collobert et al., 2011) over the feature map and take the maximum value \[ \hat{\boldsymbol{c}} = \max{\boldsymbol{c}} \] as the feature corresponding to this particular filter.
Hyperparameters
Dataset descriptions
Dataset descriptions
Variants
Zhang, Zhao, Lecun (2015)
Design
Main component is the temporal convolutional module. Suppose we have a discrete input function \[ g(x) \in [1,l]\rightarrow \mathrm{R}, \] and a discrete kernel function \[ f(x) \in [1,k]\rightarrow \textrm{R}. \]
The convolution \(h(y) \in [1, \lfloor(l-k+1)/d\rfloor] \rightarrow \mathrm{R}\) between \(f(x)\) and \(g(x)\) with stride \(d\) is defined as \[ h(y) = \sum\limits_{x=1}^k f(x) \cdot g(y\cdot d - x + c), \] where \(c=k-d+1\) is a offset constant.
Parametrization
Module is parameterized by a set of such kernel functions \(f_{ij}(x)\), where \(i=1,2,\dots,m\), and \(j=1,2,\dots,n\) which we call weights, on a set of inputs \(g_i(x)\) and outputs \(h_j(y)\).
We call each \(g_i\) (or \(h_j\)) input (or output) features, and \(m\) (or \(n\)) input (or output) feature size.
Output \(h_j(y)\) is obtained by a sum of the convolutions between \(g_i(x)\) and \(f_{ij}(x)\).
Temporal max-pooling
A 1-D version of max-pooling used in computer vision.
Given a discrete input function \(g(x) \in [1,l]\rightarrow \mathrm{R}\), the max-pooling function \(h(y) \in [1, \lfloor(l-k+1)/d\rfloor] \rightarrow \mathrm{R}\) of \(g(x)\) is defined as \[ h(y) = \max\limits_{x=1}^k g(y\cdot d - x + c). \]
Parameters
Quantization
Alphabet size: 70
Model design. Number of features: 70. Input feature length: 1014.
Number of features: 70. Input feature length: 1014.
Character-aware neural language models (2015)