info retrieval
- introduction
- related fields
- web crawler
- inverted index
- query processing
- user
- ranking model
- latent semantic analysis - removes noise
- probabalistic ranking principle - different approach, ML
- retrieval evaluation
- reading
introduction
- building blocks of search engines
    - search (user initiates)
- reccomendations - proactive search engine (program initiates e.g. pandora, netflix)
- information retrieval - activity of obtaining info relevant to an information need from a collection of resources
- information overload - too much information to process
- memex - device which stores records so it can be consulted with exceeding speed and flexibility (search engine)
 
- IR pieces
    - Indexed corpus (static)
        - crawler and indexer - gathers the info constantly, takes the whole internet as input and outputs some representation of the document
            - web crawler - automatic program that systematically browses web
 
- document analyzer - knows which section has what -takes in the metadata and outputs the index (condensed), manage content to provide efficient access of web documents
 
- crawler and indexer - gathers the info constantly, takes the whole internet as input and outputs some representation of the document
            
- User
        - query parser - parses the search terms into managed system representation
 
- Ranking
        - ranking model -takes in the query representation and the indices, sorts according to relevance, outputs the results
- also need nice display
- query logs - record user’s search history
- user modeling - assess user’s satisfaction
 
 
- Indexed corpus (static)
        
- steps   - repository -> document representation
- query -> query representation
- ranking is performed between the 2 representations and given to the user
- evaluation - by users
 
- information retrieval:
    - reccomendation
- question answering
- text mining
- online advertisement
 
related fields
they are all getting closer, database approximate search and information extraction converts unstructed data to structured:
| database systems | information retrieval | 
|---|---|
| structured data | unstructured data | 
| semantics are well-defined | semantics are subjective | 
| structured query languages (ex. SQL) | simple keyword queries | 
| exact retrieval | relevance-drive retrieval | 
| emphasis on efficiency | emphasis on effectiveness | 
- natural language processing - currently the bottleneck
    - deep understainding of language
- cognitive approaches vs. statistical
- small scale problems vs. large
 
- developing areas
    - currently mobile search is big - needs to use less data, everything needs to be more summarized
- interactive retrieval - like a human being, should collaborate
 
- core concepts
    - information need - desire to locate and obtain info to satisfy a need
- query - a designed representation of user’s need
- document - representation of info that could satisfy need
- relevance - relatedness between documents and need, this is vague
        - multiple perspectives: topical, semantic, temporal, spatial (ex. gas stations shouldn’t be behind you)
 
 
- Yahoo used to have system where you browsed based on structure (browsing), but didn’t have queries (querying)
    - better when user doesn’t know keywords, just wants to explore
- push mode - systems push relevant info to users without a query
- pull mode - users pull out info using keywords
 
web crawler
- web crawler determines upper bound for search engine
- loop over all URL’s (difficult to set its order)
    - make sure it’s not visited
- read it and save it as indexed
- setItVisited
 
- visiting strategy
    - breadth first - has to memorize all nodes on previous level
- depth first - explore the web by branch
- focused crawlings - prioritize the new links by predefined strategies
        - not all documents are equally important
- prioritize by in-degree
- prioritize by PageRank - breadth-first in early state then approximate periodically
- prioritize by topical relevance
            - estimate the similarity by anchortext or text near anchor
 
- some websites provide site map for google, disallows certain pages (ex. cnn.com/robots.txt)
- some websites push info to google so it doesn’t need to be crawled (ex. news websites)
 
- need to revisit to get changed info
        - uniform re-visiting (what google does)
- proportional re-visiting (visiting frequency is proportional to page’s update frequency)
 
 
- html parsing
    - shallow parsing - only keep text between title and p tags
- automatic wrapper generation - regular expression for HTML tags’ combination
 
- representation
    - long string has no semantic meaning
- list of sentences - sentence is just short document (recursive definition)
- list of words
- tokenization - break a stream of text into meaningful units
        - several statistical methods
 
- bag-of-words representation
        - we get frequencies, but lose grammar and order
 
- N-grams (improved)
        - continguous sequence of n items from a given sequence of text
            - for example, keep pairs of words
- google uses n = 7
 
- increase vocabulary to V^n
 
- continguous sequence of n items from a given sequence of text
            
 
- full text indexing
    - pros: preserves all information, full automatic
- cons: vocab gap: car vs. cars, very large storage
- Zipf’s law - frequency of any word is inversely proportional to its rank in the frequency table
        - frequencies decrease linearly
- discrete version of power law
   stopwords	- we ignore these and get meaningful part
            - head words take large portion but are meaningless e.g. the, a, an
- tail words - major portion of dictionary, but rare e.g. dextrosinistral
- risk: we lost structure ex. this is not a good option -> option
 
 
- normalization
        - convert different forms of a word to normalized form
            - USA St. Louis -> Saint Louis
- rule based: delete period, all lower case
- dictionary based: equivalence classes ex. cell phone -> mobile phone
- stemming: ladies -> lady, referring -> refer
                - risks: lay -> lie
- solutions
                    - Porter stemmer - pattern of vowel-consonant sequence
- Krovertz stemmer - morphological rules
 
- empirically, stemming still hurts performance
 
 
 
- convert different forms of a word to normalized form
            
- modern search engines don’t do stemming or stopword removal
- more advanced NLP techniques are applied - ex. did you search for a person? location?
 
inverted index
- simple attempt
    - documents have been craweld from web, tokenized/normalized, represented as bag-of-words
- try to match keywords to the documents
- space complexity O(d*v) where d = # docs, v = vocab size
        - Zipf’s law: most of space is wasted so we only store the occurred words
- instead of an array, we store a linked list for each doc
 
- time complexity O(qd_lengthd_num) where q=length of query
 
- solution
    - look-up table for each word, key is word, value is list of documents that contain it
- time-complexity O(q*l) where l is average length of list of documents containing word
        - by Zipf’s law, d_length « l
 
 
- data structures
    - hashtable - modest size (length of dictionary)
- postings - very large - sequential access, contain docId, term freq, term position…
        - compression is needed
 
- sorting-based inverted index construction - map-reduce
        - from each doc extract tuples of (termId (key in hashtable), docId, count)
- sort by termId within each doc
- merge sort to get one list sorted by key in hashtable
- compress terms with same termId and put into hashtable
 
 
- features
    - needs to support approximate search, proximity search, dynamic index update
- dynamic index update
        - periodically rebuild the index - acceptable if change is small over time and missing new documents is fine
 
- auxiliary index
        - keep index for new docs in memory
- merge to index when size exceeds threshold
- soln: multiple auxiliary indices on disk, logarithmic merging
 
- index compression
        - save space
- increase cache efficiency
- improve disk-memory transfer rate
- coding theory: E[L] = ∑p(x_l) * l
- instead of storing docId, we store gap between docIDs since they are ordered
            - biased distr. gives great compression: frequent words have smaller gaps, infrequent words have large gaps, so the large numbers don’t matter (Zipf’s law)
- variable-length coding: less bits for small (high frequency) integers
 
 
 
- more things put into index
    - document structure
        - title, abstract, body, bullets, anchor
- entity annotation
- these things are fed to the query
 
 
- document structure
        
query processing
- parse syntax ex. Barack AND Obama, orange OR apple
- same processing as on documents: tokenization -> normalization -> stemming -> stopword removal
- speed-up: start from lowest frequency to highest ones (easy to toss out documents)
- phrase matching “computer science”
    - N-grams doesn’t work, could be very long phrase
- soln: generalized postings match
- equality condition check with requirement of position patter between two query terms
- ex. t2.pos-t1.pos (t1 must be immediately before t2 in any matched document)
- proximity query: $\vert t2.pos-t1.pos\vert <= k $
 
- spelling correction
    - pick nearest alternative or pick most common alternative
- proximity between query terms
- edit distance = minimum number of edit operations to transform one string to another
        - insert, replace, delete
 
- speed-up
        - fix prefix length
- build character-level inverted index
- consider layout of a keyboard
 
- phonetic similarity ex. “herman” -> “Hermann”
        - solve with phonetic hashing - similar-sounding terms hash to same value
 
 
user
- result display
    - relevance
- diversity
- navigation - query suggestion, search by example
 
- list of links has always been there
- search engine reccomendations largely bias the user
- direct answers (advanced I’m feeling lucky)
    - ex. 100 cm to inches
 
- google was using user’s search result feedback
    - spammers were abusing this
 
- social things have privacy concerns
- instant search (refreshes search while you’re typing)
    - slightly slows things down
 
- carrot2 - browsing not querying
    - foam tree display, has circles with sizes representing popularity
- has a learning curve
 
- pubmed - knows something about users, has keyword search and more
- result display
    - relevance
 
- most users only look at top left
    - this can be changed with multimedia content
- HCI is attracting more attention now
 
- mobile search
    - multitouch
- less screen space
 
ranking model
- naive boolean query “obama” AND “healthcare” NOT “news”
    - unions, intersects, lists
- often over-constrained or under-constrained
- also doesn’t give you relevance of returned documents
- you can’t actually return all the documents
 
- instead we have rank docs for the users (top-k retrieval) with different kinds of relevance
    - vector space model (uses similarity between query and document)
 - how to define similarity measure
- both doc and query represented by concept vectors
        - k concepts define high-dimensional space
- element of vector corresponds to concept weight
- concepts should be orthogonal (non-overlapping in meaning)
- could use terms, n-grams, topics, usually bag-of-words
 
- weights: not all terms are equally important
        - TF - term frequency weighting - a frequent term is more important
            - normalization: tf(t,d) = 1+log(f), if f(t,d) > 0
                - or proportionally: = a+(1-a)*f/max(f)
 
 
- normalization: tf(t,d) = 1+log(f), if f(t,d) > 0
                
- IDF weighting - a term is more discriminative if it occurs only in fewer docs
            - IDF(t) = 1+log(N/(d_num(t))) where N = total # docs, d_num(t) = # docs containt t
- total term frequency doesn’t work because words can frequently occur in a subset
 
- combining TF and IDF - most widely used
            - w(t,d) = TF(t,d) * IDF(t)
 
 
- TF - term frequency weighting - a frequent term is more important
            
- similarity measure
        - Euclidean distance - penalizes longer docs too much
- cosine similarity - dot product and then normalize
 
- drawbacks
        - assumes term independence
- assume query and doc to be the same
- lack of predictive adequacy
- lots of parameter tuning
 
 
- (uses probablity of relevance)
    - vocabulary - set of words user can query with
 
latent semantic analysis - removes noise
- terms aren’t necessarily orthogonal in vectors space model
    - synonmys: car vs. automobile
- polysems: fly (action vs. insect)
 
- independent concept space is preferred (axes could be sports, economics, etc.)
- constructing concept space
    - automatic term expansion - cluster words based on thesaurus (WordNet does this)
- word sense disambiguation - use dictionary, word-usage context
 
- latent semantic analysis
    - assumption - there is some underlying structure that is obscurred by randomness of word choice
        - random noise contaminates term-document data
 
 
- assumption - there is some underlying structure that is obscurred by randomness of word choice
        
- linear algebra - singular value decomposition
    - m x n matrix C with rank r
- decompose into U * D * V^T, where D is an r x r diagonal matrix (like eigenvalues^2)
- U and V are orthogonal matrices
- we put the eigenvalues in D into descending order and only take the first k values to be nonzero
- this is low rank decomposition
- multiply the D’s of different docs get similarity
- eigenvector is new representation of each doc
 
- principle component analysis - separate things based on direction that maximizes variance
- put query into low-rank space
- LSA can also be used beyond text
- 𝑂(𝑀𝑁2)
probabalistic ranking principle - different approach, ML
- total probablility - use bayes’s rule over a partition
- Hypothesis space H={H_1,…,H_n}, training data E
- $P(H_i\vert E) = P(E\vert H_i)P(H_i)/P(E)$
- prior = P(H_i)
- posterior = $P(H_i\vert E)$
- to pick the most likely hypothesis H*, we drop P(E)
    - $P(H_i\vert E) = P(E\vert H_i)P(H_i)$
 
- losses - rank by descending loss
    - a1 = loss(retrieved $\vert $ non-relevant)
- a2 = loss(not retrieved $\vert $ relevant)
 
- we need to make a relevance measure function
    - assume independent relevance, sequential browsing
- most existing ir research has fallen into this line of thinking
 
- conditional models for $P(R=1\vert Q,D)$
    - basic idea - relevance depends on how well a query matches a document
- $P(R=1\vert Q,D)$ = g(Rep(Q,D),t)
- linear regression
 
- MLE: prediction = $argmax(P(X\vert 0))$
- Bayesian: prediction = $argmax(P(X\vert 0)) P(0)$
ml
- features/attributes for ranking - many things
- use logistic regression to find relevance
- little guidance on feature selection
- this model has completely taken over
generative models for $P(R=1\vert Q,D)$
- compute Odd($R=1\vert Q,D$) using Bayes’ rule
language models
- a model specifying probabilty distributions for different word sequences (generative model)
    - too much memory for n-gram, so we use unigrams
 
- generate text by sampling from discrete distribution
- maximum likelihood estimation (MLE)
    - sampling with replacement (like picking marbles from bag) - gives you probability distributions
 
- when you get a query see which document is more likely to generate the query
- MLE can’t represent unseen words (ex. ipad)
- smoothing
    - we want to avoid log zero for these words, but we can’t arbitrarily add to the zero
- instead we add to the zero probabilities and subtract from the probabilities of observed words
        - additive smoothing - add a constant delta to the counts of each word
            - skews the counts in favor of infrequent terms - all words are treated equally
 
- absolute discounting - subtract from each nonzero word, distribute among zeros
            - reference smoothing - use reference language model to choose what to add
 
- linear interpolation - subtract a percentage of your probability, distribute among zeros
- dirichlet prior/bayesian - not affected by document length
            - effect of smoothing is to get rid of log(0) and to devalue very common words and add weight to infrequent words
- longer documents should borrow less because they see the more uncommon words
 
 
- additive smoothing - add a constant delta to the counts of each word
            
 
retrieval evaluation
- evaluation criteria
    - small things - speed, # docs returned, spelling correction, suggestions
- most important this is satisfying user’s information need
 
- Cranfield experiments - retreived documents’ relevance is a good proxy of a system’s utility in satisfying user’s information need
- standard benchmark - TREC, hosted by NIST
- elements of evaluation
    - document collection
- set of information needs expressible as queries
- relevance judgements - binary relevant, nonrelevant for each query-document pair
 
- stats
    - type 1: false positive - wrongly returneda
- precision - fraction of retrieved documents that are relevant = $p(relevant|retrieved)$ = tp/(tp+fp)
- recall - fraction of relevant documents that are retrieved = $p(retrieved|relevant)$ = tp/(tp+fn)
- they generally trade off
 
- evaluation is in terms of one query
    - unordered evaluation - consider the documents unordered
 - calculate the precision P and recall P
- combine them with harmonic mean: F = 1 / (a(1/P)+(1-a)1/R) where a assigns weights, usually pick a=1
        - F = 2/(1/P+1/R)
 
- we do this instead of normal mean because values very close to 0 results in very large denominators
        - ranked evaluation w/ binary relevance - consider the ranked results
 
- precision vs recall has sawtooth shape curve
        - recall never decreases
- precision increases if we find relevant doc, decreases if irrelevant 1. eleven-point interpolated (use recall levels 0,.1,.2,…,1.0)
- shouldn’t really use 1.0 - not very meaningful 2. precision@k
- ignore all docs ranked lower than k
- only use relevant docs
- recall@k is problematic because it is hard to know how many docs are relevant 3. MAP - mean average precision - usually best
- considers rank position of each relevant doc
- compute p@k for each relevant doc
- average precision = average of those p@k
- mean average precision = mean over all the queries
- weakness - assumes users are interested in finding many relevant docs, requires many relevance judgements 4. MRR - mean reciprocal rank - only want one relevant doc
- uses: looking for fact, known-item search, navigational queries, query auto completion
- reciprocal rank = 1/k where k is ranking position of 1st relevant document
- mean reciprocal rank = mean over all the queries
 
 
- ranked evaluation w/ numerical relevance
    - binary relevance is insufficient - highly relevant documents are more useful
- gain is accumulated starting at the top and discounted at lower ranks
- typical discount is 1/log(rank)
- DCG (discounted cumulative gain) - total gain accumulated at a particular rank position p
        - DCG_p = rel_1 + sum(i=1 to p) rel_i/log_2(i)
- DCG_p = sum_{i=1 to p}(2^rel_i - 1)/(log_2(1+i)) where rel_i is usually 0 to 4
- this is what is actually used
- emphasize on retrieving highly relevant documents
- different queries have different numbers of relevant docs - have to normalized DCG
 
- normalized DCG - normalize by the DCG of the ideal ranking
 
- statistical significance tests - difference could just be because of p values you chose
    - p-value - prob of data using null hypothesis, if p < alpha we reject null hypothesis
        - sign test
 - hypothesis - difference median is zero 2. wilcoxon signed rank test
- hypothesis - data are paired and come from the same population 3. paired t-test
- difference has zero mean value 4. one-tail v. two tail
- lol use two-tail
 
- kappa statistic - measures accuracy of assesor - P(judges agree)-P(judges agree randomly) / (1-P(judges agree randomly))
        - = 0 if they agree by chance
- otherwise 1 or < 0
- P(judges agree randomly) = marginals for yes-yes and no-no
 
- pooling - hard to annotate all docs - relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems
 
- p-value - prob of data using null hypothesis, if p < alpha we reject null hypothesis
        
feedback as model interpolation
- important that we take distance from Q to D not D to Q
    - this is because the measure is asymmetric
 
mp3
- 2^rel - rel can be 0 or 1
- whenever you change stopword removal/stemming, have to rebuild index
    - otherwise, you will think they are all important
 
reading
as we may think
- there are too many published things, hard to keep track
19 web search basics
- client server design
    - server communicates with client via a protocal such as http in a markup language such as html
- client - generally a brower - can ignore what it doesn’t understand
 
- we need to include autoritativeness when thinking about a document’s relevance
- we can view html pages as nodes and hyperlinks as directed edges
- power law: number of web pages w/ in-degree i ~ 1/(i^a)
- bowtie structure: three types of webpages IN -> SCC -> OUT
- spam - would repeat keywords to be included in searches
    - there is paid inclusion
 
- cloaking - different page is shown to crawler than to user
- doorway page - text and metadata to rank highly - then redirects
- SEO (search engine optimizers) - consulting for helping people rank highly
    - search engine marketing - how to budget different keywords
 
- some search engines started out without advertising
- advertising - per click, per view
    - competitors can click spam the ads of opponents
 
- types of queries
    - informational - general info
- navigational - specific website
- transactional - buying or downloading
 
- difficult to get size of index
- shingling - count repeating consecutive sequences
2.2, 20.1, 20.2
- hard to tokenize
2.3, 2.4, 4, 5.2, 5.3
- compression and vocaublary
1.3, 1.4 boolean retrieval
- find lists for each term, then intersect or union or complement
    - lists need to be sorted by docId so we can just increment the pointers
 
- we start with shortest lists and do operations to make things faster
    - at any point we only want to look at the smallest possible list
 
6.2, 6.3, 6.4 vector space model
- tf(t,d) = term frequency of term t in doc d
    - uses bag of words - order doesn’t matter, just frequency
- often replaced by wf(t,d) = 1+log(tf(t,d)) else 0 because have way more terms doesn’t make it way more relevant
- also could normalize ntf(t,d) = a + (1-a)*tf(t,d)/tf_max(d) where a is a smoothing term
 
- idf(t) = inverse document frequency of term t
    - collection frequency = total number of occurrences of a term in the collection.
- document frequency df(t) = #docs in that contain term t.
- idf(t) = log(N/df(t)) where N = #docs
 
- combination weighting scheme: tf_idf(t,d) = tf(t,d)*idf(t) - (tf is actually log)
- document score = sum over terms tf_idf(t,d)
- cosine similarity = doc1*doc2 / ($\vert doc1\vert *\vert doc2\vert $)  (this is the dot product)
    - we want the highest possible similarity
- euclidean distance penalizes long documents too much
 
- similarity = cosine similarity of (query,doc)
- only return top k results
- pivoted normalized document length? - generally penalizes long document, but avoids overpenalizing