nlp
useful tools
- eli5 has nice text highlighting for interp
nlp basics
- basics come from book “Speech and Language Processing”
- language models** - assign probabilities to sequences of words
- ex. n-gram model - assigns probs to shorts squendes of words, known as n-grams
- for full sentence, use markov assumption
- eval: perplexity (PP) - inverse probability of the test set, normalized by the number of words (want to minimize it)
- $PP(W_{test}) = P(w_1, …, w_N)^{-1/N}$
- can think of this as the weighted average branching factor of a language
- should only be compared across models w/ same vocab
- vocabulary
- sometimes closed, otherwise have unkown words, which we assign its own symbol
- can fix training vocab, or just choose the to V words and have the best be unkown
- ex. n-gram model - assigns probs to shorts squendes of words, known as n-grams
- topic models (e.g. LDA) - apply unsupervised learning on large sets of text to learn sets of associated words
- embeddings - vectors for representing words
- ex. tf-idf - defined as counts of nearby words (big + sparse)
- pointwise mutual info - instead of counts, consider whether 2 words co-occur more than we would have expected by chance
- ex. word2vec - short, dense vectors
- intuition: train classifier on binary prediction: is word w likely to show up near this word? (algorithm also called skip-gram)
- the weights are the embeddings
- also GloVe, which is based on ratios of word co-occurrence probs
- ex. tf-idf - defined as counts of nearby words (big + sparse)
dl for nlp
- some recent topics based on this blog
- when training rnn, accumulate gradients over sequence and then update all at once
- stacked rnns have outputs of rnns feed into another rnn
- bidirectional rnn - one rnn left to right and another right to left (can concatenate, add, etc.)
- standard seq2seq
- encoder reads input and outputs context vector (the hidden state)
- decoder (rnn) takes this context vector and generates a sequence
- attention
- encoder reads input and ouputs context vector after each word
- decoder at each step uses a different weighted combination of these context vectors
- specifically, at each step decoder concatenates its hidden state w/ the attention vector (the weighted combination of the context vectors)
- this is fed to a feedforward net to output a word
- transformer - proposed in attention is all you need paper
- self-attention - layer that lets word learn its relation to other layers
- many stacked layers in encoder + decoder (not rnn - self-attention + feed forward)
- self-attention 3 components
- queries
- keys
- values
- for each word, want score telling how much importance to place on each other word (queries * keys)
- softmax this and use it to do weighted sum of values
- multi-headed attention has several of each of these (then just concat them)
- recent papers
- Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le)
- ELMo (by Matthew Peters and researchers from AI2 and UW CSE) - no word embeddings - train embeddings w/ bidirectional lstm (on language modelling)
- context vector is weighted sum of context vector at each word
- ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), the
- OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever)
- culminated in bert
- semi-supervised learning (predict masked word - this is bidirectional) + supervised finetuning
- these ideas are starting to be applied to vision cnns