nlp

useful tools

eli5 has nice text highlighting for interp

nlp basics

basics come from book “Speech and Language Processing”
language models** - assign probabilities to sequences of words
- ex. n-gram model - assigns probs to shorts squendes of words, known as n-grams
  - for full sentence, use markov assumption
- eval: perplexity (PP) - inverse probability of the test set, normalized by the number of words (want to minimize it)
  - $PP(W_{test}) = P(w_1, …, w_N)^{-1/N}$
  - can think of this as the weighted average branching factor of a language
  - should only be compared across models w/ same vocab
- vocabulary
  - sometimes closed, otherwise have unkown words, which we assign its own symbol
  - can fix training vocab, or just choose the to V words and have the best be unkown
topic models (e.g. LDA) - apply unsupervised learning on large sets of text to learn sets of associated words
embeddings - vectors for representing words
- ex. tf-idf - defined as counts of nearby words (big + sparse)
  - pointwise mutual info - instead of counts, consider whether 2 words co-occur more than we would have expected by chance
- ex. word2vec - short, dense vectors
  - intuition: train classifier on binary prediction: is word w likely to show up near this word? (algorithm also called skip-gram)
  - the weights are the embeddings
  - also GloVe, which is based on ratios of word co-occurrence probs

dl for nlp

some recent topics based on this blog
when training rnn, accumulate gradients over sequence and then update all at once
stacked rnns have outputs of rnns feed into another rnn
bidirectional rnn - one rnn left to right and another right to left (can concatenate, add, etc.)
standard seq2seq
- encoder reads input and outputs context vector (the hidden state)
- decoder (rnn) takes this context vector and generates a sequence
attention
- encoder reads input and ouputs context vector after each word
- decoder at each step uses a different weighted combination of these context vectors
  - specifically, at each step decoder concatenates its hidden state w/ the attention vector (the weighted combination of the context vectors)
  - this is fed to a feedforward net to output a word
transformer - proposed in attention is all you need paper
- self-attention - layer that lets word learn its relation to other layers
- many stacked layers in encoder + decoder (not rnn - self-attention + feed forward)
- self-attention 3 components
  - queries
  - keys
  - values
  - for each word, want score telling how much importance to place on each other word (queries * keys)
  - softmax this and use it to do weighted sum of values
- multi-headed attention has several of each of these (then just concat them)
recent papers
- Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le)
- ELMo (by Matthew Peters and researchers from AI2 and UW CSE) - no word embeddings - train embeddings w/ bidirectional lstm (on language modelling)
- context vector is weighted sum of context vector at each word
- ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), the
- OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever)
culminated in bert
- semi-supervised learning (predict masked word - this is bidirectional) + supervised finetuning
these ideas are starting to be applied to vision cnns