What makes language modeling hard: resolving ambiguity is hard. “the chef made her duck” Contents Basic Text Processing regex ELIZA tokenization and corpus Herdan’s Law text normalization tokenization + Subword Tokenization BPE Word Normalization lemmatization through morphological parsing only take stems from morphemes: porter stemmer sentence segmentation N-Grams Edit Distance DP costs O(nm), backtrace costs O(n+m). minimum edit distance weighted edit distance backtracing Ngrams N-Grams Markov Assumption Unigrams Backoff and Stupid Backoff Interpolation OOV Words Model Evaluation perplexity open vocabulary Text Classification Text Classification Bag of Words Naive Bayes Naive Bayes for Text Classification Binary Naive Bayes Lexicon Naive Bays Language Modeling Harmonic Mean Macroaverage and Microaverage Logistic Regression Generative Classifier vs Discriminate Classifier Logistic Regression Text Classification decision boundary cross entropy loss stochastic gradient descent Information Retrial Information Retrival Term-Document Matrix Inverted Index + postings list Boolean Retrieval positional index Ranked Information Retrial Ranked Information Retrieval feast or famine problem free text query score Jaccard Coefficient log-frequency weighting document frequency ("idf weight") TF-IDF SMART notation vector-space model Vector Semantics sense principle of contrast word relatedness semantic field synonymy and antonyms affective meaning vector semantics transposing a Term-Document Matrix term-term matrix word2vec skip-gram with negative sampling POS and NER POS Tagging NER Tagging Dialogue Systems Dialogue Chatbot PARRY Recommender Systems Recommender System Dora Dora Neural Nets Neural Networks The Web Web Graph Social Network