




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Landmark-Based Speech RecognitionThe Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations What are Landmarks?Time-frequency regions of high mutual information between phone and signal (maxima of I(phone label; acoustics(t,f) )Acoustic events with similar im
2、portance in all languages, and across all speaking stylesAcoustic events that can be detected even in extremely noisy environments Syllable Onset Consonant Release Syllable Nucleus Vowel Center Syllable Coda Consonant ClosureWhere do these things happen?I(phone;acoustics) experiment: Hasegawa-Johnso
3、n, 2000Landmark-Based Speech RecognitionONSETNUCLEUSCODANUCLEUSCODAONSETPronunciationVariants: backed up backtup . back up backt ihp wackt ihpLattice hypothesis: backed up SyllableStructureScoresWordsTimesTalk OutlineOverviewAcoustic ModelingSpeech data and acoustic featuresLandmark detectionEstimat
4、ion of real-valued “distinctive features” using support vector machines (SVM)Pronunciation ModelingA Dynamic Bayesian network (DBN) implementation of Articulatory PhonologyA Discriminative Pronunciation model implemented using Maximum Entropy (MaxEnt)Technological EvaluationRescoring of word lattice
5、 output from an HMM-based recognizerErrors that we fixed: Channel noise, Laughter, etceteraNew errors that we caused: Pronunciation models trained on 3 hours cant compete with triphone models trained on 3000 hours.Future PlansOverviewHistoryResearch described in this talk was performed between June
6、30 and August 17, 2004, at the Johns Hopkins summer workshop WS04Scientific GoalTo use high-dimensional machine learning technologies (SVM, DBN) to create representations capable of learning, from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments
7、Technological GoalLong-term: To create a better speech recognizerShort-term: lattice rescoring, applied to word lattices produced by SRIs NN/HMM hybridAcoustic Model: SVMsp(landmark|SVM)MFCC (5ms & 1ms frame period), Formants, Phonetic & Auditory Model Parameters concatenate 4-15 framesPronunciation
8、 Model (DBN or MaxEnt)First-Pass ASR Word Latticep(SVM|word)Rescoring: Log-Linear Score Combinationp(MFCC,PLP|word), p(word|words)word label, start & end timesOverview of Systems to be DescribedI. Acoustic ModelingGoal: Learn precise and generalizable models of the acoustic boundary associated with
9、each distinctive feature.Methods: Large input vector space (many acoustic feature types) Regularized binary classifiers (SVMs) SVM outputs “smoothed” using dynamic programming SVM outputs converted to posterior probability estimates once/5ms using histogramSpeech DatabasesSizePhonetic Transcr.Word L
10、atticesNTIMIT14hrsmanual-WS96&973.5hrsmanual-SWB1 WS04 subset12hrsauto-SRIBBNEval0110hrs-BBN & SRIRT03 Dev6hrs-SRIRT03 Eval6hrs-SRIAcoustic and Auditory FeaturesMFCCs, 25ms window (standard ASR features)Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecondNoise-robust MUS
11、IC-based formant frequencies, amplitudes, and bandwidths (Zheng & Hasegawa-Johnson, ICSLP 2004)Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996)Rate-place model of neural response fields in the cat auditory cortex (Carlyon & S
12、hamma, JASA 2003)What are Distinctive Features? What are Landmarks?Distinctive feature = a binary partition of the phonemes (Jakobson, 1952) that compactly describes pronunciation variability (Halle) and correlates with distinct acoustic cues (Stevens)Landmark = Change in the value of a Manner Featu
13、re+sonorant to sonorant, sonorant to +sonorant5 manner features: consonantal, continuant, syllabic, silencePlace and Voicing features: SVMs are only trained at landmarks Primary articulator: lips, tongue blade, or tongue bodyFeatures of primary articulator: anterior, stridentFeatures of secondary ar
14、ticulator: nasal, voiced Landmark Detection using Support Vector Machines (SVMs)False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM(3) Linear SVM: EER = 0.15%(4) Radial Basis Function SVM: Equal Error Rate=0.13%Niyogi & Burges, 1999,
15、 2002(1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2%(2) HMM (*): False Rejection Error=0.3%Dynamic Programming Smooths SVMs Maximize Pi p( features(ti) | X(ti) ) p(ti+1-ti | features(ti) Soft-decision “smoothing” mode: p( acoustics | landmarks ) computed, fed to pronunciation modelCues for Place
16、 of Articulation:MFCC+formants + ratescale, within 150ms of landmarkKernel:Transform toInfinite-DimensionalHilbertSpaceNiyogi & Burges, 2002: p(class|acoustics) Sigmoid Model in Discriminant DimensionSoft-Decision Landmark ProbabilitiesSVM Extracts a Discriminant DimensionSVM Discriminant Dimension
17、=argmin(error(margin)+1/width(margin)Juneja & Espy-Wilson, 2003: p(class|acoustics) Histogram in Discriminant DimensionORSoft Decisions once/5ms:p ( manner feature d(t) | Y(t) )p( place feature d(t) | Y(t), t is a landmark )SVMHistogram2000-dimensional acoustic feature vectorDiscriminant yi(t)Poster
18、ior probability of distinctive featurep(di(t)=1 | yi(t)II. Pronunciation ModelingGoal: Represent a large number of pronunciation variants, in a controlled fashion, using distinctive features. Pick out the distinctive features that are most important for each word recognition task.Methods: Distinctiv
19、e feature based lexicon + dynamic programming alignmentDynamic Bayesian Network model of Articulatory Phonology (articulation-based pronunciation variability model)MaxEnt search for lexically discriminative features (perceptually based “pronunciation model”)1. Distinctive-Feature Based LexiconMerger
20、 of English Switchboard and Callhome dictionariesConverted to landmarks using Hasegawa-Johnsons perl transcription tools Landmarks in blue, Place and voicing features in green.AGO(0.441765) +syllabic +reduced +back AX +continuant + sonorant +velar +voiced G closure +continuant +sonorant +velar +voic
21、ed G release +syllabic low high +back +round +tense OWAGO(0.294118) +syllabic +reduced back IX + continuant +sonorant +velar +voiced G closure +continuant +sonorant +velar +voiced G release +syllabic low high +back +round +tense OWDynamic Programming Lexical Search Choose the word that maximizes Pi
22、p( features(ti) | X(ti) ) p(ti+1-ti | features(ti) p(features(ti)|word)LIP-OPTT-OPENTT-LOCTB-LOCTB-OPENVELUMVOICINGwarmth w ao r m p th - Phone insertion?I dont know ah dx uh_n ow_n - Phone deletion?several s eh r v ax l - Exchange of two phones?2. Articulatory PhonologyMany pronunciation phenomena
23、can be parsimoniously described as resulting from asynchrony and reduction of sub-phonetic featuresinstruments ih_n s ch em ih_n n severybody eh r uw ayOne set of features based on articulatory phonology Browman & Goldstein 1990:Dynamic Bayesian Network Model(Livescu and Glass, 2004)The model is imp
24、lemented as a dynamic Bayesian network (DBN):A representation, via a directed graph, of a distribution over a set of variables that evolve through timeExample DBN with three features:= 1= 1)|Pr(|)Pr(212;1aindindaasync=-= .1 0 0 4 .2 .7 0 0 2 .1 .2 .7 0 1 0 .1 .2 .7 0 3 2 1 0given by baseform pronunc
25、iations. . . The DBN-SVM Hybrid Developed at WS04Tongue frontPalatalTongue closedSemi-closedGlideATongue MidTongue openTongue openWordCanonical FormSurface FormPlace x:Multi-FrameObservationincludingSpectrum,Formants,& AuditoryModelTongue frontFrontVowelLIKETongue Frontp( gPGR(x) | palatal glide rel
26、ease)p( gGR(x) | glide release )SVM OutputsManner3. Discriminative Pronunciation ModelRationale: baseline HMM-based system already provides high-quality hypotheses 1-best error rate from N-best lists: 24.4% (RT-03 dev set)oracle error rate: 16.2% Method: Use landmark detection only where necessary,
27、to correct errors made by baseline recognition system Example:Ref: that cannot be that hard to sneak onto an airplane Hyp: they can be a that hard to speak on an airplanefsh_60386_1_0105420_0108380Identifying Confusable HypothesesUse existing alignment algorithms for converting lattices into confusi
28、on networks (Mangu, Brill & Stolcke 2000)Hypotheses ranked by posterior probabilityGenerated from n-best lists without 4-gram or pronunciation model scores ( higher WER compared to lattices)Multi-words (“I_dont_know”) were split prior to generating confusion networksairplaneanonontosneakspeaktohard*
29、DEL*bethat cancantatheyIdentifying Confusable HypothesesHow much can be gained from fixing confusions?Baseline error rate: 25.8%Oracle error rates when selecting correct word from confusion set:# hypotheses to select from Including homophonesNot including homophones223.9% 23.9%323.0%23.0%422.4%22.5%
30、522.0%22.1%Selecting Relevant LandmarksNot all landmarks are equally relevant for distinguishing between competing word hypotheses (e.g. vowel features irrelevant for sneak vs. speak) Using all available landmarks might deteriorate performance when irrelevant landmarks have weak scores (but: redunda
31、ncy might be useful)Automatic selection algorithmShould optimally distinguish set of confusable words (discriminative) Should rank landmark features according to their relevance for distinguishing words (i.e. output should be interpretable in phonetic terms)Should be extendable to features beyond la
32、ndmarksMaximum-Entropy Landmark SelectionConvert each word in confusion set into fixed-length landmark-based representation using idea from information retrieval:Vector space consisting of binary relations between two landmarksManner landmarks: precedence, e.g. V Son. Cons.Manner & place features: o
33、verlap, e.g. V o +highpreserves basic temporal information Words represented as frequency entries in feature vectorNot all possible relations are used (phonotactic constraints, place features detected dependent on manner landmarks)Dimensionality of feature space: 40 - 60Word entries derived from pho
34、ne representation plus pronunciation rulesVector-Space Word RepresentationStart FricFric StopFric SonFric VowelStop VowelVowel o highVowel o frontFric o stridentspeak11001111sneak10100111seek10010111he10010110she10010111steak11001011.Maximum-Entropy DiscriminationUse maxent classifier Here: y = word
35、s, x = acoustics, f = landmark relationshipsWhy maxent classifier?Discriminative classifierPossibly large set of confusable wordsLater addition of non-binary features Training: ideally on real landmark detection output Here: on entries from lexicon (includes pronunciation variants) Maximum-Entropy D
36、iscriminationExample: sneak vs. speakDifferent model is trained for each confusion set landmarks can have different weights in different contexts speakSC +blade -2.47 FR SC -2.47FR SIL 2.11SIL ST 1.75.sneakSC +blade 2.47 FR SC 2.47FR SIL -2.11SIL ST -1.75.Landmark QueriesSelect N landmarks with high
37、est weightsAsk landmark detection module to produce scores for selected landmarks within word boundaries given by baseline systemExample: sneak 1.70 1.99 SC +blade ? sneak 1.70 1.99 SC +blade 0.75 0.56 ConfusionnetworksLandmarkdetectorsIII. EvaluationAcoustic Feature Selection1. Accuracy per Frame (
38、%), Stop Releases only, NTIMIT2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)MFCCs+ShapeMFCCs+Formants
39、KernelLinearRBFLinearRBF+/- lips78.390.792.795.0+/- blade73.487.179.685.1+/- body73.085.285.787.2SVM Training: Mixed vs. Targeted DataTrainNTIMITNTIMIT&SWBNTIMITSwitchboardTestNTIMITNTIMIT&SWBSwitchboardSwitchboardKernelLinearRBFLinearRBFLinearRBFLinearRBFspeech onset95.196.286.989.971.462.281.681.6
40、speech offset79.688.576.386.465.378.668.483.7consonant onset 94.595.591.493.570.372.795.897.7consonant offset91.793.794.396.880.386.292.896.8continuant onset89.494.187.395.069.181.986.292.0continuant offset90.894.990.494.669.368.889.694.3sonorant onset95.697.297.896.785.286.596.396.3sonorant offset9
41、5.396.494.097.475.675.295.296.4syllabic onset90.795.291.495.569.578.987.992.6syllabic offset90.188.987.192.954.460.888.289.7DBN-SVM: Models Nonstandard PhonesI dont know/d/ becomesflap/n/ becomes a creakynasal glideDBN-SVM Design DecisionsWhat kind of SVM outputs should be used in the DBN?Method 1 (
42、EBS/DBN): Generate landmark segmentation with EBS using manner SVMs, then apply place SVMs at appropriate points in the segmentationForce DBN to use EBS segmentationAllow DBN to stray from EBS segmentation, using place/voicing SVM outputs whenever availableMethod 2 (SVM/DBN): Apply all SVMs in all f
43、rames, allow DBN to consider all possible segmentationsIn a single passIn two passes: (1) manner-based segmentation; (2) place+manner scoringHow should we take into account the distinctive feature hierarchy?How do we avoid “over-counting” evidence?How do we train the DBN (feature transcriptions vs.
44、SVM outputs)?DBN-SVM Rescoring ExperimentsFor each lattice edge:SVM probabilities computed over edge duration and used as soft evidence in DBNDBN computes a score S P(word | evidence)Final edge score is a weighted interpolation of baseline scores and EBS/DBN or SVM/DBN scoreDateExperimental setup 3-
45、speaker WER (# errors)RT03 dev WER - Baseline27.7 (550)26.8 Jul31_0EBS/DBN, “hierarchically-normalized” SVM output probabilities, DBN trained on subset of ICSI transcriptions27.6 (549)26.8 Aug1_19 + improved silence modeling27.6 (549) Aug2_19EBS/DBN, unnormalized SVM probs + fricative lip feature27.
46、3 (543)26.8 Aug4_2 + DBN trained using SVM outputs27.3 (543) Aug6_20 + full feature hierarchy in DBN27.4 (545) Aug7_3 + reduction probabilities depend on word frequency27.4 (544) Aug8_19 + retrained SVMs + nasal classifier + DBN bug fixes27.4 (544) Aug11_19SVM/DBN, 1 passMiserable failure! Aug14_0SV
47、M/DBN, 2 pass27.3 (542) Aug14_20SVM/DBN, 2 pass, using only high-accuracy SVMs27.2 (541)Discriminative Pronunciation ModelWERInsertionsDeletionsSubstitutionsBaseline25.8% 2.6% (982)9.2% (3526)14.1% (5417)Rescored 25.8% 2.6% (984)9.2% (3524)14.1% (5408)RT-03 dev set, 35497 Words, 2930 Segments, 36 Sp
48、eakers(Switchboard and Fisher data)Rescored: product combination of old and new prob. distributions, weights 0.8 (old), 0.2 (new)Correct/incorrect decision changed in about 8% of all cases Slightly higher number of fixed errors vs. new errors AnalysisWhen does it work? Detectors give high probabilit
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 做防水协议书范本
- 立春宣传课件图片
- 2025年磁盘用微晶玻璃基板项目合作计划书
- 2025年循环流化床锅炉合作协议书
- 2025年高收缩腈纶项目合作计划书
- 2025版酒店餐厅场地租赁及美食合作合同
- 二零二五年度贷款购买别墅买卖合同细则
- 二零二五版山林资源开发合作协议范本
- 2025版06289工程招标与合同法适用及合规性审查合同
- 2025版个人教育贷款补充协议示范书
- 中国出版集团公司数字出版理念实践与思考
- 园林二级技师试题及答案
- 护理部培训课件
- 2025年上半年上海科学院招考易考易错模拟试题(共500题)试卷后附参考答案
- 2025年山东兖矿化工有限公司招聘笔试参考题库含答案解析
- 飞书项目管理
- (中级)数据安全管理员(四级)职业技能鉴定考试题库-中(多选、判断题)
- 第五届应急管理普法知识竞赛考试题库500题(含答案)
- 2024年计算机软件水平考试-初级信息处理技术员考试近5年真题附答案
- 酒水饮料运输协议模板
- DB3401T 218-2021 芡实米加工技术规程
评论
0/150
提交评论