Lviv University
When?
Approximately between 200,000 years ago and 60,000 years ago (between the appearance of the first anatomically modern humans in southern Africa and the last exodus from Africa respectively)
How many?
Human languages count: 7097 (as of 2018).
Speaker (writer)
Listener (reader)
A medium for communication
rather than pure representation.
Plato vs Sophists
Sophists argued that physical reality can only be experienced through language.
Language impacts cognition. Also known as linguistic relativity.
Predecessors
Australian aboriginal language Guugu Yimithirr have no words for relative (or ego- centric) directions, such as front, back, right, or left. Instead they use absolute directions, saying, for example, the equivalent of “I have a pain in my north arm.”
(Norvig “Artificial Intelligence: A Modern Approach”)
Predecessors
Wilhelm von Humboldt: language as a spirit of a nation.
Wanner experiment (1974)
Subjects remember exact words with 50% accuracy but remember the content with 90% accuracy.
Hence - there must an internal nonverbal representation.
Polanyi’s paradox (1966)
The theory that human knowledge of how the world functions and of our own capability are, to a large extent, beyond our explicit understanding. (aka tacit knowledge).
Definition
A formal language \(L\) over an alphabet \(\Sigma\) is a subset of \(\Sigma^*\), that is, a set of words over that alphabet.
Definition
We define a language model as a probability distribution describing the likelihood of any string.
ELIZA - an example of primitive pattern matching.
Regular expressions - a tool for describing text patterns.
Definition
An algebraic notation for characterizing a set of strings
Definition
A Kleene algebra is a set \(A\) together with two binary operations \(+: A \times A \rightarrow A\) and \(\cdot : A \times A \rightarrow A\) and one function \(\ast : A \rightarrow A\), written as \(a + b\), \(ab\) and \(a\ast\) respectively, so that the following axioms are satisfied.
Associativity of \(+\) and \(\cdot\): \(a + (b + c) = (a + b) + c\) and \(a(bc) = (ab)c\) \(\forall a, b, c \in A\).
Commutativity of \(+\): \(a + b = b + a\) \(\forall a, b \in A\)
Distributivity: \(a(b + c) = (ab) + (ac)\) and \((b + c)a = (ba) + (ca)\) \(\forall a, b, c \in A\)
Identity elements for \(+\) and \(\cdot\): \(\exists 0 \in A:\forall a \in A: a + 0 = 0 + a = a\); \(\exists 1 \in A: \forall a \in A: a1 = 1a = a\).
Annihilation by 0: \(a0 = 0a = 0 \forall a \in A\). The above axioms define a semiring.
\(+\) is idempotent: \(a + a = a \quad \forall a \in A\).
Ordering
It is now possible to define a partial order \(\leq\) on \(A\) by setting \(a \leq b\) if and only if \(a + b = b\) (or equivalently: \(a \leq b\) if and only if \(\exists x \in A: a + x = b\).
With any definition, \(a \leq b \leq a \Rightarrow a = b\). With this order we can formulate the last four axioms about the operation \(\ast\):
Intuitively, one should think of a + b as the “union” or the “least upper bound” of a and b and of ab as some multiplication which is monotonic, in the sense that \(a \leq b \Rightarrow ax \leq bx\).
The idea behind the star operator is \(a\ast = 1 + a + aa + aaa + ...\). From the standpoint of programming language theory, one may also interpret + as “choice”, \(\cdot\) as “sequencing” and \(\ast\) as “iteration”.
Concatenation
A sequence of characters
/someword/
| Regex | Example |
|---|---|
| /alt/ | The alternative option would be… |
| /simple/ | A simple regex |
Disjunction
A single character to choose among multiple options
/[asd]/
| Regex | Example |
|---|---|
| /[0123456789]/ | Some number examples are 0, 3, 5 |
| /[rgx]/ | A simple regex |
Disjunction with range
A single character to choose among multiple options
/[a-z]/
| Regex | Example |
|---|---|
| /[a-z]/ | Phone number: 067-1234567 |
Cleene *
Zero or more occurrences of the previous character or regex.
| Regex | Example |
|---|---|
| /[a-z]*/ | Phone number: 067-1234567 |
Cleene +
One or more occurrences of the previous character or regex.
| Regex | Example |
|---|---|
| /[a-z]+/ | Phone number: 067-1234567 |
Wildcard dot
Any character except newline
| Regex | Example |
|---|---|
| /.+/ | Phone number: 067-1234567 |
| /beg.n/ | begin, began, begun |
Anchors
Matching beginning and end of lines
| Regex | Example |
|---|---|
| /^/ | Start of line |
| /$/ | End of line |
/\b/ |
Word boundary |
/\B/ |
Non-word boundary |
Disjunction operator
Either/or for strings
| Regex | Example |
|---|---|
/cat|dog/ |
Raining cats and dogs |
Grouping
Used for precedence
| Regex | Example |
|---|---|
/(cat|dog)*/ |
Raining cats and dogs |
Substitutions
Used for precedence
| Regex | Before | After |
|---|---|---|
s/(cat|dog)*/frog/ |
Raining cats and dogs | Raining frogs and frogs |
s/behaviour/behavior/ |
Cognitive-behavioural therapy | Cognitive-behavioral therapy |
ELIZA sample
User1: Men are all alike.
ELIZA1: User2: They’re always bugging us about something or other.
ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE
User3: Well, my boyfriend made me come here.
ELIZA3: YOUR BOYFRIEND MADE YOU COME HERE
User4: He says I’m depressed much of the time.
ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED
ELIZA sample
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Corpus
A computer-readable collection of text or speech.
Punctuation
Marks indicating how a piece of written text should be read and understood.
Utterance
Spoken correlate of a sentence.
Disfluency
Break or disruption that occurs in the flow of speech.
Two types:
Word types
Are the number of distinct words in a corpus; if the set of words in the vocabulary is \(V\) , the number of types is the vocabulary size \(|V|\).
Word instances
Are the total number \(N\) of running words.
(sometimes also called word tokens).
Example
To be, or not to be, that is the question.
| Corpus | Types = \(|V|\) | Instances = \(N\) |
|---|---|---|
| Shakespeare | 31 thousand | 884 thousand |
| Brown corpus | 38 thousand | 1 million |
| Switchboard telephone conversations | 20 thousand | 2.4 million |
| COCA | 2 million | 440 million |
| Google n-grams | 13 million | 1 trillion |
Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978)
\[ |V|= kN^{\beta}. \] Here \(k\) and \(\beta\) are positive constants, and \(0 <\beta <1\).
Zipf’s law
If \(t_1\) is the most common term in the collection, \(t_2\) is the next most common, and so on, then the collection frequency \(cf_i\) of the \(i\)-th most common term is proportional to \(\frac{1}{i}\): \[ cf_i \propto \frac{1}{i}, \] or \[ cf_i = ci^k. \]
Wordforms
Example: cat vs cats. We say these two words are different wordforms but have the same lemma.
Lemma
A lemma is a set of lexical forms having the same stem, and usually the same major part-of-speech.
Example
The wordform is the full inflected or derived form of the word. The two wordforms cat and cats thus have the same lemma, which we can represent as cat.
Morphemes
The smallest meaning-bearing unit of a language.
Examples
Indistinguisble -> [in, distinguish, able]
Affix taxonomy
Affixes are classified into two types:
By position
By function
They can be prefixes or suffixes e.g. unkind , kingship etc.
Varieties depending on:
Text normalization consists of:
Two types:
I'm, l'homme.Penn Treebank tokenization standard
Used for the parsed corpora (treebanks) released by the Lin- guistic Data Consortium (LDC).
nltk.regexp_tokenize
>>> text = ’That U.S.A. poster-print costs $12.40...’
>>> pattern = r’’’(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(?:-\w+)* # words with optional internal hyphens
... | \$?\d+(?:\.\d+)?%? # currency, percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"’?():_‘-] # these are separate tokens; includes ], [
... ’’’
>>> nltk.regexp_tokenize(text, pattern)
[’That’, ’U.S.A.’, ’poster-print’, ’costs’, ’$12.40’, ’...’]Word tokenization is more complex in languages like written Chinese, Japanese, and Thai, which do not use spaces to mark potential word-boundaries.
Morphemes In Chinese
for example, words are composed of characters (called hanzi in Chinese). Each character generally represents a single unit of meaning (a morpheme).
Word segmentation
For Japanese and Thai the character is too small a unit, and so algorithms for word segmentation are required.
Definition
We use the data to infer the tokens. We call these tokens subwords.
Parts
Examples
Token Learner
Start with a vocabulary that is just the set of all individual characters.
Examine the training corpus, choose the two symbols that are most frequently adjacent
2.1. For example, if (‘A’, ‘B’) frequently occur together, it will add a new merged symbol ‘AB’ to the vocabulary
2.2. And replaces every adjacent ’A’ ’B’ in the corpus with the new ‘AB’.
Continue counting and merging, creating new longer and longer character strings, until \(k\) merges have been done creating k novel tokens; \(k\) is thus a parameter of the algorithm.
4.The resulting vocabulary consists of the original set of characters plus \(k\) new symbols.
Note
The algorithm is usually run inside words (not merging across word boundaries), so the input corpus is first white-space-separated to give a set of strings, each corresponding to the characters of a word, plus a special end-of-word symbol , and its counts.
function BYTE-PAIR ENCODING(strings C, number of merges k) returns vocab \(V\)
\(V \leftarrow\) unique characters in C # initial set of tokens is characters
for i = 1 to k do # merge tokens k times
\(\quad\) \(t_L\), \(t_R\) \(\leftarrow\) #Most frequent pair of adjacent tokens in C
\(\quad\) \(t_{NEW} \leftarrow t_L + t_R\) # make new token by concatenating
\(\quad\) \(V \leftarrow V + t_{NEW}\) # update the vocabulary
\(\quad\) Replace each occurrence of \(t_L\), \(t_R\) in \(C\) with \(t_{NEW}\). # update the corpus
return \(V\)
Token Segmenter:
Simplest method: case folding.
Note
Not very useful for text classification: compare US (the country) and us (pronoun).
Lemmatization
The task of determining that two words have the same root, despite their surface differences.
be \(\rightarrow\) is, are
Performed using morphological parsing.
Morphological parsing
Splitting each word into morphemes of two types:
Naive version: stemming
This means just dropping affixes.
Porter stemmer
Example
collection \(\rightarrow\) CVCV…C
illustrate \(\rightarrow\) VCVC…V
Both can be presented as:
\[ [C](VC)^m[V] \]
m is called the measure of the word.
Rules
Standard form: \[ (\textbf{condition})\textbf{S}_1 \rightarrow \textbf{S}_2 \] There are five phases of rule application
How to read
If a word ends with the suffix \(S_1\)
AND
the stem before \(S_1\) satisfies the given condition
THEN \(S_1\) is replaced by \(S_2\).
Conditions
And the condition part may also contain expressions with and, or and not.
Example
\((m > 1) EMENT \rightarrow\) will perform this transformation:
replacement \(\rightarrow\) replac
Other stemmers:
Lovins stemmer
The first published stemming algorithm, is essentially a heavily parametrized find-and-replace function.
Snowball stemmer
An updated version of the Porter stemmer. It differs from Porter in two main ways:
Lancaster stemmer (also called Paice stemmer)
The most aggressive English stemming algorithm.
Stemming errors
policy to police)European to Europe)Challenges
Definition
Minimum edit distance between two strings is defined as the minimum number of editing operations:
needed to transform one string into another.
String alignment

Levenshtein distance
Each of the 3 operations has cost 1.
Alternatively, we can forbid substitutions (this is equivalent to saying that substitutions have cost 2).
Wagner-Fischer minimum edit distance algorithm.
Notation
Calculation
\[ D[i,j] = \min \begin{cases} D[i-1,j] + \text{del_cost}(source[i]),\\ D[i,j-1] + \text{ins_cost}(target[j]),\\ D[i-1,j-1] + \text{sub_cost}(source[i], target[j]). \end{cases} \]
Calculation without substitution
\[ D[i,j] = \min \begin{cases} D[i-1,j] + 1,\\ D[i,j-1] + 1,\\ D[i-1,j-1] + \begin{cases} 2; \quad \text{if} \quad source[i] \neq target[j]), \\ 0; \quad \text{if} \quad source[i] = target[j]) \end{cases} \end{cases} \]
Wagner-Fischer algorithm
function MIN-EDIT-DISTANCE(source, target) returns min-distance
\(n \leftarrow LENGTH(source)\)
\(m \leftarrow LENGTH(target)\)
Create a distance matrix \(D[n+1,m+1]\)
# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
\(\quad\) \(D[i,0] \leftarrow D[i-1,0]\) + del_cost(source[i])
for each column j from 1 to m do
\(\quad\) \(D[0,j] \leftarrow D[0, j-1] + ins-cost(target[j])\)
Wagner-Fischer algorithm
# Recurrence relation:
for each row i from 1 to n do
\(\quad\) for each column j from 1 to m do
\(\quad\) \(\quad\) \(D[i, j] \leftarrow MIN( D[i−1, j]\) + del_cost(source[i]),
\(\quad\)\(\quad\) \(\quad\) \(D[i−1, j−1] + sub\_cost(source[i], target[j])\),
\(\quad\)\(\quad\) \(\quad\) \(D[i, j−1] + ins\_cost(target[j]))\)
# Termination
return D[n,m]
Hamming distance
A number of positions at which the corresponding symbols are different.
Therefore, identical to Levenshtein with only substitution allowed.
Can only work for strings of similar length.
def hamming_distance(string1: str, string2: str) -> int:
"""Return the Hamming distance between two strings."""
if len(string1) != len(string2):
raise ValueError("Strings must be of equal length.")
dist_counter = 0
for n in range(len(string1)):
if string1[n] != string2[n]:
dist_counter += 1
return dist_counter