info retrieval

introduction
related fields
web crawler
inverted index
query processing
user
ranking model
latent semantic analysis - removes noise
probabalistic ranking principle - different approach, ML
retrieval evaluation
- feedback as model interpolation
- mp3
reading

introduction

building blocks of search engines
- search (user initiates)
- reccomendations - proactive search engine (program initiates e.g. pandora, netflix)
- information retrieval - activity of obtaining info relevant to an information need from a collection of resources
- information overload - too much information to process
- memex - device which stores records so it can be consulted with exceeding speed and flexibility (search engine)
IR pieces
1. Indexed corpus (static)
  - crawler and indexer - gathers the info constantly, takes the whole internet as input and outputs some representation of the document
    - web crawler - automatic program that systematically browses web
  - document analyzer - knows which section has what -takes in the metadata and outputs the index (condensed), manage content to provide efficient access of web documents
2. User
  - query parser - parses the search terms into managed system representation
3. Ranking
  - ranking model -takes in the query representation and the indices, sorts according to relevance, outputs the results
  - also need nice display
  - query logs - record user’s search history
  - user modeling - assess user’s satisfaction
steps
1. repository -> document representation
2. query -> query representation
3. ranking is performed between the 2 representations and given to the user
4. evaluation - by users
information retrieval:
1. reccomendation
2. question answering
3. text mining
4. online advertisement

they are all getting closer, database approximate search and information extraction converts unstructed data to structured:

database systems	information retrieval
structured data	unstructured data
semantics are well-defined	semantics are subjective
structured query languages (ex. SQL)	simple keyword queries
exact retrieval	relevance-drive retrieval
emphasis on efficiency	emphasis on effectiveness

natural language processing - currently the bottleneck
- deep understainding of language
- cognitive approaches vs. statistical
- small scale problems vs. large
developing areas
- currently mobile search is big - needs to use less data, everything needs to be more summarized
- interactive retrieval - like a human being, should collaborate
core concepts
- information need - desire to locate and obtain info to satisfy a need
- query - a designed representation of user’s need
- document - representation of info that could satisfy need
- relevance - relatedness between documents and need, this is vague
  - multiple perspectives: topical, semantic, temporal, spatial (ex. gas stations shouldn’t be behind you)
Yahoo used to have system where you browsed based on structure (browsing), but didn’t have queries (querying)
- better when user doesn’t know keywords, just wants to explore
- push mode - systems push relevant info to users without a query
- pull mode - users pull out info using keywords

web crawler

web crawler determines upper bound for search engine
loop over all URL’s (difficult to set its order)
- make sure it’s not visited
- read it and save it as indexed
- setItVisited
visiting strategy
- breadth first - has to memorize all nodes on previous level
- depth first - explore the web by branch
- focused crawlings - prioritize the new links by predefined strategies
  - not all documents are equally important
  - prioritize by in-degree
  - prioritize by PageRank - breadth-first in early state then approximate periodically
  - prioritize by topical relevance
    - estimate the similarity by anchortext or text near anchor
  - some websites provide site map for google, disallows certain pages (ex. cnn.com/robots.txt)
  - some websites push info to google so it doesn’t need to be crawled (ex. news websites)
- need to revisit to get changed info
  - uniform re-visiting (what google does)
  - proportional re-visiting (visiting frequency is proportional to page’s update frequency)
html parsing
- shallow parsing - only keep text between title and p tags
- automatic wrapper generation - regular expression for HTML tags’ combination
representation
- long string has no semantic meaning
- list of sentences - sentence is just short document (recursive definition)
- list of words
- tokenization - break a stream of text into meaningful units
  - several statistical methods
- bag-of-words representation
  - we get frequencies, but lose grammar and order
- N-grams (improved)
  - continguous sequence of n items from a given sequence of text
    - for example, keep pairs of words
    - google uses n = 7
  - increase vocabulary to V^n
full text indexing
- pros: preserves all information, full automatic
- cons: vocab gap: car vs. cars, very large storage
- Zipf’s law - frequency of any word is inversely proportional to its rank in the frequency table
  - frequencies decrease linearly
  - discrete version of power law stopwords - we ignore these and get meaningful part
    - head words take large portion but are meaningless e.g. the, a, an
    - tail words - major portion of dictionary, but rare e.g. dextrosinistral
    - risk: we lost structure ex. this is not a good option -> option
- normalization
  - convert different forms of a word to normalized form
    - USA St. Louis -> Saint Louis
    - rule based: delete period, all lower case
    - dictionary based: equivalence classes ex. cell phone -> mobile phone
    - stemming: ladies -> lady, referring -> refer
      - risks: lay -> lie
      - solutions
        
        Porter stemmer - pattern of vowel-consonant sequence
        
        Krovertz stemmer - morphological rules
      - empirically, stemming still hurts performance
- modern search engines don’t do stemming or stopword removal
- more advanced NLP techniques are applied - ex. did you search for a person? location?

inverted index

simple attempt
- documents have been craweld from web, tokenized/normalized, represented as bag-of-words
- try to match keywords to the documents
- space complexity O(d*v) where d = # docs, v = vocab size
  - Zipf’s law: most of space is wasted so we only store the occurred words
  - instead of an array, we store a linked list for each doc
- time complexity O(qd_lengthd_num) where q=length of query
solution
- look-up table for each word, key is word, value is list of documents that contain it
- time-complexity O(q*l) where l is average length of list of documents containing word
  - by Zipf’s law, d_length « l
data structures
- hashtable - modest size (length of dictionary)
- postings - very large - sequential access, contain docId, term freq, term position…
  - compression is needed
- sorting-based inverted index construction - map-reduce
  - from each doc extract tuples of (termId (key in hashtable), docId, count)
  - sort by termId within each doc
  - merge sort to get one list sorted by key in hashtable
  - compress terms with same termId and put into hashtable
features
- needs to support approximate search, proximity search, dynamic index update
- dynamic index update
  - periodically rebuild the index - acceptable if change is small over time and missing new documents is fine
- auxiliary index
  - keep index for new docs in memory
  - merge to index when size exceeds threshold
  - soln: multiple auxiliary indices on disk, logarithmic merging
- index compression
  - save space
  - increase cache efficiency
  - improve disk-memory transfer rate
  - coding theory: E[L] = ∑p(x_l) * l
  - instead of storing docId, we store gap between docIDs since they are ordered
    - biased distr. gives great compression: frequent words have smaller gaps, infrequent words have large gaps, so the large numbers don’t matter (Zipf’s law)
    - variable-length coding: less bits for small (high frequency) integers
more things put into index
- document structure
  - title, abstract, body, bullets, anchor
  - entity annotation
  - these things are fed to the query

query processing

parse syntax ex. Barack AND Obama, orange OR apple
same processing as on documents: tokenization -> normalization -> stemming -> stopword removal
speed-up: start from lowest frequency to highest ones (easy to toss out documents)
phrase matching “computer science”
- N-grams doesn’t work, could be very long phrase
- soln: generalized postings match
- equality condition check with requirement of position patter between two query terms
- ex. t2.pos-t1.pos (t1 must be immediately before t2 in any matched document)
- proximity query: $\vert t2.pos-t1.pos\vert <= k $
spelling correction
- pick nearest alternative or pick most common alternative
- proximity between query terms
- edit distance = minimum number of edit operations to transform one string to another
  - insert, replace, delete
- speed-up
  - fix prefix length
  - build character-level inverted index
  - consider layout of a keyboard
- phonetic similarity ex. “herman” -> “Hermann”
  - solve with phonetic hashing - similar-sounding terms hash to same value

user

result display
1. relevance
2. diversity
3. navigation - query suggestion, search by example
list of links has always been there
search engine reccomendations largely bias the user
direct answers (advanced I’m feeling lucky)
- ex. 100 cm to inches
google was using user’s search result feedback
- spammers were abusing this
social things have privacy concerns
instant search (refreshes search while you’re typing)
- slightly slows things down
carrot2 - browsing not querying
- foam tree display, has circles with sizes representing popularity
- has a learning curve
pubmed - knows something about users, has keyword search and more
result display
- relevance
most users only look at top left
- this can be changed with multimedia content
- HCI is attracting more attention now
mobile search
- multitouch
- less screen space

ranking model

naive boolean query “obama” AND “healthcare” NOT “news”
- unions, intersects, lists
- often over-constrained or under-constrained
- also doesn’t give you relevance of returned documents
- you can’t actually return all the documents
instead we have rank docs for the users (top-k retrieval) with different kinds of relevance
1. vector space model (uses similarity between query and document)
- how to define similarity measure
- both doc and query represented by concept vectors
  - k concepts define high-dimensional space
  - element of vector corresponds to concept weight
  - concepts should be orthogonal (non-overlapping in meaning)
  - could use terms, n-grams, topics, usually bag-of-words
- weights: not all terms are equally important
  - TF - term frequency weighting - a frequent term is more important
    - normalization: tf(t,d) = 1+log(f), if f(t,d) > 0
      - or proportionally: = a+(1-a)*f/max(f)
  - IDF weighting - a term is more discriminative if it occurs only in fewer docs
    - IDF(t) = 1+log(N/(d_num(t))) where N = total # docs, d_num(t) = # docs containt t
    - total term frequency doesn’t work because words can frequently occur in a subset
  - combining TF and IDF - most widely used
    - w(t,d) = TF(t,d) * IDF(t)
- similarity measure
  - Euclidean distance - penalizes longer docs too much
  - cosine similarity - dot product and then normalize
- drawbacks
  - assumes term independence
  - assume query and doc to be the same
  - lack of predictive adequacy
  - lots of parameter tuning

(uses probablity of relevance)
- vocabulary - set of words user can query with

latent semantic analysis - removes noise

terms aren’t necessarily orthogonal in vectors space model
- synonmys: car vs. automobile
- polysems: fly (action vs. insect)
independent concept space is preferred (axes could be sports, economics, etc.)
constructing concept space
- automatic term expansion - cluster words based on thesaurus (WordNet does this)
- word sense disambiguation - use dictionary, word-usage context
latent semantic analysis
- assumption - there is some underlying structure that is obscurred by randomness of word choice
  - random noise contaminates term-document data
linear algebra - singular value decomposition
- m x n matrix C with rank r
- decompose into U * D * V^T, where D is an r x r diagonal matrix (like eigenvalues^2)
- U and V are orthogonal matrices
- we put the eigenvalues in D into descending order and only take the first k values to be nonzero
- this is low rank decomposition
- multiply the D’s of different docs get similarity
- eigenvector is new representation of each doc
principle component analysis - separate things based on direction that maximizes variance
put query into low-rank space
LSA can also be used beyond text
𝑂(𝑀𝑁2)

probabalistic ranking principle - different approach, ML

total probablility - use bayes’s rule over a partition
Hypothesis space H={H_1,…,H_n}, training data E
$P(H_i\vert E) = P(E\vert H_i)P(H_i)/P(E)$
prior = P(H_i)
posterior = $P(H_i\vert E)$
to pick the most likely hypothesis H*, we drop P(E)
- $P(H_i\vert E) = P(E\vert H_i)P(H_i)$
losses - rank by descending loss
- a1 = loss(retrieved $\vert $ non-relevant)
- a2 = loss(not retrieved $\vert $ relevant)
we need to make a relevance measure function
- assume independent relevance, sequential browsing
- most existing ir research has fallen into this line of thinking
conditional models for $P(R=1\vert Q,D)$
- basic idea - relevance depends on how well a query matches a document
- $P(R=1\vert Q,D)$ = g(Rep(Q,D),t)
- linear regression
MLE: prediction = $argmax(P(X\vert 0))$
Bayesian: prediction = $argmax(P(X\vert 0)) P(0)$

ml

features/attributes for ranking - many things
use logistic regression to find relevance
little guidance on feature selection
this model has completely taken over

generative models for $P(R=1\vert Q,D)$

compute Odd($R=1\vert Q,D$) using Bayes’ rule

language models

a model specifying probabilty distributions for different word sequences (generative model)
- too much memory for n-gram, so we use unigrams
generate text by sampling from discrete distribution
maximum likelihood estimation (MLE)
- sampling with replacement (like picking marbles from bag) - gives you probability distributions
when you get a query see which document is more likely to generate the query
MLE can’t represent unseen words (ex. ipad)
smoothing
- we want to avoid log zero for these words, but we can’t arbitrarily add to the zero
- instead we add to the zero probabilities and subtract from the probabilities of observed words
  1. additive smoothing - add a constant delta to the counts of each word
    - skews the counts in favor of infrequent terms - all words are treated equally
  2. absolute discounting - subtract from each nonzero word, distribute among zeros
    - reference smoothing - use reference language model to choose what to add
  3. linear interpolation - subtract a percentage of your probability, distribute among zeros
  4. dirichlet prior/bayesian - not affected by document length
    - effect of smoothing is to get rid of log(0) and to devalue very common words and add weight to infrequent words
    - longer documents should borrow less because they see the more uncommon words

retrieval evaluation

evaluation criteria
- small things - speed, # docs returned, spelling correction, suggestions
- most important this is satisfying user’s information need
Cranfield experiments - retreived documents’ relevance is a good proxy of a system’s utility in satisfying user’s information need
standard benchmark - TREC, hosted by NIST
elements of evaluation
1. document collection
2. set of information needs expressible as queries
3. relevance judgements - binary relevant, nonrelevant for each query-document pair
stats
- type 1: false positive - wrongly returneda
- precision - fraction of retrieved documents that are relevant = $p(relevant|retrieved)$ = tp/(tp+fp)
- recall - fraction of relevant documents that are retrieved = $p(retrieved|relevant)$ = tp/(tp+fn)
- they generally trade off
evaluation is in terms of one query
1. unordered evaluation - consider the documents unordered
- calculate the precision P and recall P
- combine them with harmonic mean: F = 1 / (a(1/P)+(1-a)1/R) where a assigns weights, usually pick a=1
  - F = 2/(1/P+1/R)
- we do this instead of normal mean because values very close to 0 results in very large denominators
  1. ranked evaluation w/ binary relevance - consider the ranked results
- precision vs recall has sawtooth shape curve
  - recall never decreases
  - precision increases if we find relevant doc, decreases if irrelevant 1. eleven-point interpolated (use recall levels 0,.1,.2,…,1.0)
  - shouldn’t really use 1.0 - not very meaningful 2. precision@k
  - ignore all docs ranked lower than k
  - only use relevant docs
  - recall@k is problematic because it is hard to know how many docs are relevant 3. MAP - mean average precision - usually best
  - considers rank position of each relevant doc
  - compute p@k for each relevant doc
  - average precision = average of those p@k
  - mean average precision = mean over all the queries
  - weakness - assumes users are interested in finding many relevant docs, requires many relevance judgements 4. MRR - mean reciprocal rank - only want one relevant doc
  - uses: looking for fact, known-item search, navigational queries, query auto completion
  - reciprocal rank = 1/k where k is ranking position of 1st relevant document
  - mean reciprocal rank = mean over all the queries
ranked evaluation w/ numerical relevance
- binary relevance is insufficient - highly relevant documents are more useful
- gain is accumulated starting at the top and discounted at lower ranks
- typical discount is 1/log(rank)
- DCG (discounted cumulative gain) - total gain accumulated at a particular rank position p
  - DCG_p = rel_1 + sum(i=1 to p) rel_i/log_2(i)
  - DCG_p = sum_{i=1 to p}(2^rel_i - 1)/(log_2(1+i)) where rel_i is usually 0 to 4
  - this is what is actually used
  - emphasize on retrieving highly relevant documents
  - different queries have different numbers of relevant docs - have to normalized DCG
- normalized DCG - normalize by the DCG of the ideal ranking
statistical significance tests - difference could just be because of p values you chose
- p-value - prob of data using null hypothesis, if p < alpha we reject null hypothesis
  1. sign test
  - hypothesis - difference median is zero 2. wilcoxon signed rank test
  - hypothesis - data are paired and come from the same population 3. paired t-test
  - difference has zero mean value 4. one-tail v. two tail
  - lol use two-tail
- kappa statistic - measures accuracy of assesor - P(judges agree)-P(judges agree randomly) / (1-P(judges agree randomly))
  - = 0 if they agree by chance
  - otherwise 1 or < 0
  - P(judges agree randomly) = marginals for yes-yes and no-no
- pooling - hard to annotate all docs - relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems

feedback as model interpolation

important that we take distance from Q to D not D to Q
- this is because the measure is asymmetric

mp3

2^rel - rel can be 0 or 1
whenever you change stopword removal/stemming, have to rebuild index
- otherwise, you will think they are all important

reading

as we may think

there are too many published things, hard to keep track

19 web search basics

client server design
1. server communicates with client via a protocal such as http in a markup language such as html
2. client - generally a brower - can ignore what it doesn’t understand
we need to include autoritativeness when thinking about a document’s relevance
we can view html pages as nodes and hyperlinks as directed edges
power law: number of web pages w/ in-degree i ~ 1/(i^a)
bowtie structure: three types of webpages IN -> SCC -> OUT
spam - would repeat keywords to be included in searches
- there is paid inclusion
cloaking - different page is shown to crawler than to user
doorway page - text and metadata to rank highly - then redirects
SEO (search engine optimizers) - consulting for helping people rank highly
- search engine marketing - how to budget different keywords
some search engines started out without advertising
advertising - per click, per view
- competitors can click spam the ads of opponents
types of queries
- informational - general info
- navigational - specific website
- transactional - buying or downloading
difficult to get size of index
shingling - count repeating consecutive sequences

2.2, 20.1, 20.2

hard to tokenize

2.3, 2.4, 4, 5.2, 5.3

compression and vocaublary

1.3, 1.4 boolean retrieval

find lists for each term, then intersect or union or complement
- lists need to be sorted by docId so we can just increment the pointers
we start with shortest lists and do operations to make things faster
- at any point we only want to look at the smallest possible list

6.2, 6.3, 6.4 vector space model

tf(t,d) = term frequency of term t in doc d
- uses bag of words - order doesn’t matter, just frequency
- often replaced by wf(t,d) = 1+log(tf(t,d)) else 0 because have way more terms doesn’t make it way more relevant
- also could normalize ntf(t,d) = a + (1-a)*tf(t,d)/tf_max(d) where a is a smoothing term
idf(t) = inverse document frequency of term t
- collection frequency = total number of occurrences of a term in the collection.
- document frequency df(t) = #docs in that contain term t.
- idf(t) = log(N/df(t)) where N = #docs
combination weighting scheme: tf_idf(t,d) = tf(t,d)*idf(t) - (tf is actually log)
document score = sum over terms tf_idf(t,d)
cosine similarity = doc1*doc2 / ($\vert doc1\vert *\vert doc2\vert $) (this is the dot product)
- we want the highest possible similarity
- euclidean distance penalizes long documents too much
similarity = cosine similarity of (query,doc)
only return top k results
pivoted normalized document length? - generally penalizes long document, but avoids overpenalizing

introduction

related fields

web crawler

inverted index

query processing

user

ranking model

latent semantic analysis - removes noise

probabalistic ranking principle - different approach, ML

ml

generative models for $P(R=1\vert Q,D)$

language models

retrieval evaluation

feedback as model interpolation

mp3

reading

as we may think

19 web search basics

2.2, 20.1, 20.2

2.3, 2.4, 4, 5.2, 5.3

1.3, 1.4 boolean retrieval

6.2, 6.3, 6.4 vector space model