版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、1CS 388: Natural Language Processing:N-Gram Language ModelsRaymond J. MooneyUniversity of Texas at Austin1CS 388: Natural Language ProLanguage ModelsFormal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language.For NLP, a probabilistic model of a langua
2、ge that gives a probability that a string is a member of a language is more useful.To specify a correct probability distribution, the probability of all sentences in a language must sum to 1.Language ModelsFormal grammarsUses of Language ModelsSpeech recognition“I ate a cherry” is a more likely sent
3、ence than “Eye eight uh Jerry” OCR & Handwriting recognitionMore probable sentences are more likely correct readings.Machine translationMore likely sentences are probably better translations.GenerationMore likely sentences are probably better NL generations. Context sensitive spelling correction“The
4、ir are problems wit this sentence.”Uses of Language ModelsSpeech Completion PredictionA language model also supports predicting the completion of a sentence.Please turn off your cell _Your program does not _Predictive text input systems can guess what you are typing and give choices on how to comple
5、te it.Completion PredictionA languagN-Gram ModelsEstimate probability of each word given prior context.P(phone | Please turn off your cell)Number of parameters required grows exponentially with the number of words of prior context.An N-gram model uses only N1 words of prior context.Unigram: P(phone)
6、Bigram: P(phone | cell)Trigram: P(phone | your cell)The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram mod
7、el is a (N1)-order Markov model.N-Gram ModelsEstimate probabilN-Gram Model FormulasWord sequencesChain rule of probabilityBigram approximationN-gram approximationN-Gram Model FormulasWord sequEstimating ProbabilitiesN-gram conditional probabilities can be estimated from raw text based on the relativ
8、e frequency of word sequences.To have a consistent probabilistic model, append a unique start () and end () symbol to every sentence and treat these as additional words.Bigram:N-gram:Estimating ProbabilitiesN-gramGenerative Model & MLEAn N-gram model can be seen as a probabilistic automata for gener
9、ating sentences.Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T.Initialize sentence with N1 symbolsUntil is generated do: Stochastically pick the next word based on the condit
10、ional probability of each word given the previous N 1 words.Generative Model & MLEAn N-graExample from TextbookP( i want english food ) = P(i | ) P(want | i) P(english | want) P(food | english) P( | food) = .25 x .33 x .0011 x .5 x .68 = .000031P( i want chinese food ) = P(i | ) P(want | i) P(chines
11、e | want) P(food | chinese) P( | food) = .25 x .33 x .0065 x .52 x .68 = .00019Example from TextbookP( i wTrain and Test CorporaA language model must be trained on a large corpus of text to estimate good parameter values.Model can be evaluated based on its ability to predict a high probability for a
12、 disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate).Ideally, the training (and test) corpus should be representative of the actual application data.May need to adapt a general model to a small amount of new (in-domain) data by adding highly
13、weighted small corpus to original training data.Train and Test CorporaA languaUnknown WordsHow to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?Train a model that includes an explicit symbol for an unknown word ().Choose a vocabulary in a
14、dvance and replace other words in the training corpus with .Replace the first occurrence of each word in the training data with .Unknown WordsHow to handle worEvaluation of Language ModelsIdeally, evaluate use of model in end application (extrinsic, in vivo)RealisticExpensiveEvaluate on ability to m
15、odel test corpus (intrinsic).Less realisticCheaperVerify at least once that intrinsic evaluation correlates with an extrinsic one.Evaluation of Language ModelsIPerplexityMeasure of how well a model “fits” the test data.Uses the probability that the model assigns to the test corpus.Normalizes for the
16、 number of words in the test corpus and takes the inverse.Measures the weighted average branching factor in predicting the next word (lower is better).PerplexityMeasure of how well Sample Perplexity EvaluationModels trained on 38 million words from the Wall Street Journal (WSJ) using a 19,979 word v
17、ocabulary.Evaluate on a disjoint set of 1.5 million WSJ words.UnigramBigramTrigramPerplexity962170109Sample Perplexity EvaluationMoSmoothingSince there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so MLE incorrectly assig
18、ns zero to many parameters (a.k.a. sparse data).If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity).In practice, parameters are smoothed (a.k.a. regularized) to reassign some probability mass to u
19、nseen events.Adding probability mass to unseen events requires removing it from seen ones (discounting) in order to maintain a joint distribution that sums to 1.SmoothingSince there are a comLaplace (Add-One) Smoothing“Hallucinate” additional training data in which each possible N-gram occurs exactl
20、y once and adjust estimates accordingly. where V is the total number of possible (N1)-grams (i.e. the vocabulary size for a bigram model).Bigram:N-gram:Tends to reassign too much mass to unseen events, so can be adjusted to add 01 (normalized by V instead of V).Laplace (Add-One) Smoothing“HaAdvanced
21、 SmoothingMany advanced techniques have been developed to improve smoothing for language models.Good-TuringInterpolationBackoffKneser-NeyClass-based (cluster) N-gramsAdvanced SmoothingMany advanceModel CombinationAs N increases, the power (expressiveness) of an N-gram model increases, but the abilit
22、y to estimate accurate parameters from sparse data decreases (i.e. the smoothing problem gets worse).A general approach is to combine the results of multiple N-gram models of increasing complexity (i.e. increasing N).Model CombinationAs N increaseInterpolationLinearly combine estimates of N-gram mod
23、els of increasing order.Learn proper values for i by training to (approximately) maximize the likelihood of an independent development (a.k.a. tuning) corpus.Interpolated Trigram Model:Where:InterpolationLinearly combine BackoffOnly use lower-order model when data for higher-order model is unavailab
24、le (i.e. count is zero).Recursively back-off to weaker models until data is available.Where P* is a discounted probability estimate to reserve mass for unseen events and s are back-off weights (see text for details).BackoffOnly use lower-order moA Problem for N-Grams:Long Distance DependenciesMany times local context does not provide the most useful predictive clues, which instead are provided by long-distance dependencies.Syntactic dependencies“The man next to the large oak tree near the grocery store on the corner is tall.”“The men next to the large oak tr
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年齐齐哈尔高等师范专科学校高职单招职业适应性测试参考题库有答案解析
- 2026年宜昌点军区招聘城管执法协管员5人备考题库及完整答案详解一套
- 2026年重庆电力高等专科学校单招综合素质笔试参考题库带答案解析
- 2026年“环境友好高分子材料教育部工程研究中心(四川大学)”主任招聘备考题库完整答案详解
- 2026年合肥市肥东县人民政府行政复议委员会面向社会招聘非常任委员的备考题库有答案详解
- 2026年巴中市南江县公安局公开招聘警务辅助人员64人备考题库及参考答案详解1套
- 2026年兰州资源环境职业技术大学单招综合素质笔试备考题库带答案解析
- 2025年井冈山市厦坪镇人民政府面向社会公开招聘工作人员备考题库及一套参考答案详解
- 2026年宝钛集团有限公司招聘备考题库含答案详解
- 2026年云南省医药兴达有限公司招聘12人备考题库带答案详解
- 2025黑龙江大庆市工人文化宫招聘工作人员7人考试历年真题汇编带答案解析
- 2026中央纪委国家监委机关直属单位招聘24人考试笔试模拟试题及答案解析
- 2026年内蒙古化工职业学院单招职业适应性考试必刷测试卷附答案解析
- 财务数字化转型与业财数据深度融合实施路径方案
- 后勤保障医院运维成本智能调控
- 循证护理在儿科护理中的实践与应用
- 少儿无人机课程培训
- GB 46750-2025民用无人驾驶航空器系统运行识别规范
- 麻醉睡眠门诊科普
- 电力绝缘胶带施工方案
- 预防性试验收费标准全解析(2025版)
评论
0/150
提交评论