Gong Qingfeng | nlp

nlp

useful tools

  • eli5 has nice text highlighting for interp

nlp basics

  • basics come from book “Speech and Language Processing”
  • language models** - assign probabilities to sequences of words
    • ex. n-gram model - assigns probs to shorts squendes of words, known as n-grams
      • for full sentence, use markov assumption
    • eval: perplexity (PP) - inverse probability of the test set, normalized by the number of words (want to minimize it)
      • $PP(W_{test}) = P(w_1, …, w_N)^{-1/N}$
      • can think of this as the weighted average branching factor of a language
      • should only be compared across models w/ same vocab
    • vocabulary
      • sometimes closed, otherwise have unkown words, which we assign its own symbol
      • can fix training vocab, or just choose the to V words and have the best be unkown
  • topic models (e.g. LDA) - apply unsupervised learning on large sets of text to learn sets of associated words
  • embeddings - vectors for representing words
    • ex. tf-idf - defined as counts of nearby words (big + sparse)
      • pointwise mutual info - instead of counts, consider whether 2 words co-occur more than we would have expected by chance
    • ex. word2vec - short, dense vectors
      • intuition: train classifier on binary prediction: is word w likely to show up near this word? (algorithm also called skip-gram)
      • the weights are the embeddings
      • also GloVe, which is based on ratios of word co-occurrence probs

dl for nlp

  • some recent topics based on this blog
  • when training rnn, accumulate gradients over sequence and then update all at once
  • stacked rnns have outputs of rnns feed into another rnn
  • bidirectional rnn - one rnn left to right and another right to left (can concatenate, add, etc.)
  • standard seq2seq
    • encoder reads input and outputs context vector (the hidden state)
    • decoder (rnn) takes this context vector and generates a sequence
  • attention
    • encoder reads input and ouputs context vector after each word
    • decoder at each step uses a different weighted combination of these context vectors
      • specifically, at each step decoder concatenates its hidden state w/ the attention vector (the weighted combination of the context vectors)
      • this is fed to a feedforward net to output a wordScreen Shot 2019-04-11 at 7.57.14 PM
  • transformer - proposed in attention is all you need paper
    • self-attention - layer that lets word learn its relation to other layers
    • many stacked layers in encoder + decoder (not rnn - self-attention + feed forward)
    • self-attention 3 components
      • queries
      • keys
      • values
      • for each word, want score telling how much importance to place on each other word (queries * keys)
      • softmax this and use it to do weighted sum of values
    • multi-headed attention has several of each of these (then just concat them)
  • recent papers
  • culminated in bert
    • semi-supervised learning (predict masked word - this is bidirectional) + supervised finetuning
  • these ideas are starting to be applied to vision cnns