Term Vocabulary and Postings Lists Web Search and Mining Lecture 3: The term vocabulary and postings lists
Term Vocabulary and Postings Lists 1 Lecture 3: The term vocabulary and postings lists Web Search and Mining
Term Vocabulary and Postings Lists Recap of the previous lecture Basic inverted indexes Structure: Dictionary and Postings BRUTUS 124113145173174 CAeSAR 24561657132 calpurnIA→[23154101 Key step in construction Sorting Boolean query processing Intersection by linear time"merging Simple optimizations
Term Vocabulary and Postings Lists 2 Recap of the previous lecture ▪ Basic inverted indexes: ▪ Structure: Dictionary and Postings ▪ Key step in construction: Sorting ▪ Boolean query processing ▪ Intersection by linear time “merging” ▪ Simple optimizations
Term Vocabulary and Postings Lists Plan for this lecture Elaborate basic indexing Preprocessing to form the term vocabulary Documents Tokenization What terms do we put in the index? Postings Faster merges: skip lists Positional postings and phrase queries
Term Vocabulary and Postings Lists 3 Plan for this lecture Elaborate basic indexing ▪ Preprocessing to form the term vocabulary ▪ Documents ▪ Tokenization ▪ What terms do we put in the index? ▪ Postings ▪ Faster merges: skip lists ▪ Positional postings and phrase queries
Term Vocabulary and Postings Lists Recall the basic indexing pipeline Documents to Ga?b Friends. Romans, countrymen. be indexed Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer friend 24 averted index roman countryman put 1316
Term Vocabulary and Postings Lists 4 Recall the basic indexing pipeline Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen
Term Vocabulary and Postings Lists Document Delineation Parsing a document What format is it in? pdf/word/ excel/html? What language is it in? What character set is in use? Each of these is a classification problem which we will study later in the course But these tasks are often done heuristically
Term Vocabulary and Postings Lists 5 Parsing a document ▪ What format is it in? ▪ pdf/word/excel/html? ▪ What language is it in? ▪ What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically … Document Delineation