The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations

上传人：职*** IP属地：山东上传时间：2022-07-25 格式：PPT 页数：40 大小：1.77MB 积分：12 举报 版权申诉

The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations_第2页

The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations_第3页

The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations_第4页

The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations_第5页

已阅读5页，还剩35页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Landmark-Based Speech RecognitionThe Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations What are Landmarks?Time-frequency regions of high mutual information between phone and signal (maxima of I(phone label; acoustics(t,f) )Acoustic events with similar im

2、portance in all languages, and across all speaking stylesAcoustic events that can be detected even in extremely noisy environments Syllable Onset Consonant Release Syllable Nucleus Vowel Center Syllable Coda Consonant ClosureWhere do these things happen?I(phone;acoustics) experiment: Hasegawa-Johnso

3、n, 2000Landmark-Based Speech RecognitionONSETNUCLEUSCODANUCLEUSCODAONSETPronunciationVariants: backed up backtup . back up backt ihp wackt ihpLattice hypothesis: backed up SyllableStructureScoresWordsTimesTalk OutlineOverviewAcoustic ModelingSpeech data and acoustic featuresLandmark detectionEstimat

4、ion of real-valued “distinctive features” using support vector machines (SVM)Pronunciation ModelingA Dynamic Bayesian network (DBN) implementation of Articulatory PhonologyA Discriminative Pronunciation model implemented using Maximum Entropy (MaxEnt)Technological EvaluationRescoring of word lattice

5、 output from an HMM-based recognizerErrors that we fixed: Channel noise, Laughter, etceteraNew errors that we caused: Pronunciation models trained on 3 hours cant compete with triphone models trained on 3000 hours.Future PlansOverviewHistoryResearch described in this talk was performed between June

6、30 and August 17, 2004, at the Johns Hopkins summer workshop WS04Scientific GoalTo use high-dimensional machine learning technologies (SVM, DBN) to create representations capable of learning, from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments

7、Technological GoalLong-term: To create a better speech recognizerShort-term: lattice rescoring, applied to word lattices produced by SRIs NN/HMM hybridAcoustic Model: SVMsp(landmark|SVM)MFCC (5ms & 1ms frame period), Formants, Phonetic & Auditory Model Parameters concatenate 4-15 framesPronunciation

8、 Model (DBN or MaxEnt)First-Pass ASR Word Latticep(SVM|word)Rescoring: Log-Linear Score Combinationp(MFCC,PLP|word), p(word|words)word label, start & end timesOverview of Systems to be DescribedI. Acoustic ModelingGoal: Learn precise and generalizable models of the acoustic boundary associated with

9、each distinctive feature.Methods: Large input vector space (many acoustic feature types) Regularized binary classifiers (SVMs) SVM outputs “smoothed” using dynamic programming SVM outputs converted to posterior probability estimates once/5ms using histogramSpeech DatabasesSizePhonetic Transcr.Word L

10、atticesNTIMIT14hrsmanual-WS96&973.5hrsmanual-SWB1 WS04 subset12hrsauto-SRIBBNEval0110hrs-BBN & SRIRT03 Dev6hrs-SRIRT03 Eval6hrs-SRIAcoustic and Auditory FeaturesMFCCs, 25ms window (standard ASR features)Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecondNoise-robust MUS

11、IC-based formant frequencies, amplitudes, and bandwidths (Zheng & Hasegawa-Johnson, ICSLP 2004)Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996)Rate-place model of neural response fields in the cat auditory cortex (Carlyon & S

12、hamma, JASA 2003)What are Distinctive Features? What are Landmarks?Distinctive feature = a binary partition of the phonemes (Jakobson, 1952) that compactly describes pronunciation variability (Halle) and correlates with distinct acoustic cues (Stevens)Landmark = Change in the value of a Manner Featu

13、re+sonorant to sonorant, sonorant to +sonorant5 manner features: consonantal, continuant, syllabic, silencePlace and Voicing features: SVMs are only trained at landmarks Primary articulator: lips, tongue blade, or tongue bodyFeatures of primary articulator: anterior, stridentFeatures of secondary ar

14、ticulator: nasal, voiced Landmark Detection using Support Vector Machines (SVMs)False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM(3) Linear SVM: EER = 0.15%(4) Radial Basis Function SVM: Equal Error Rate=0.13%Niyogi & Burges, 1999,

15、 2002(1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2%(2) HMM (*): False Rejection Error=0.3%Dynamic Programming Smooths SVMs Maximize Pi p( features(ti) | X(ti) ) p(ti+1-ti | features(ti) Soft-decision “smoothing” mode: p( acoustics | landmarks ) computed, fed to pronunciation modelCues for Place

16、 of Articulation:MFCC+formants + ratescale, within 150ms of landmarkKernel:Transform toInfinite-DimensionalHilbertSpaceNiyogi & Burges, 2002: p(class|acoustics) Sigmoid Model in Discriminant DimensionSoft-Decision Landmark ProbabilitiesSVM Extracts a Discriminant DimensionSVM Discriminant Dimension

17、=argmin(error(margin)+1/width(margin)Juneja & Espy-Wilson, 2003: p(class|acoustics) Histogram in Discriminant DimensionORSoft Decisions once/5ms:p ( manner feature d(t) | Y(t) )p( place feature d(t) | Y(t), t is a landmark )SVMHistogram2000-dimensional acoustic feature vectorDiscriminant yi(t)Poster

18、ior probability of distinctive featurep(di(t)=1 | yi(t)II. Pronunciation ModelingGoal: Represent a large number of pronunciation variants, in a controlled fashion, using distinctive features. Pick out the distinctive features that are most important for each word recognition task.Methods: Distinctiv

19、e feature based lexicon + dynamic programming alignmentDynamic Bayesian Network model of Articulatory Phonology (articulation-based pronunciation variability model)MaxEnt search for lexically discriminative features (perceptually based “pronunciation model”)1. Distinctive-Feature Based LexiconMerger

20、 of English Switchboard and Callhome dictionariesConverted to landmarks using Hasegawa-Johnsons perl transcription tools Landmarks in blue, Place and voicing features in green.AGO(0.441765) +syllabic +reduced +back AX +continuant + sonorant +velar +voiced G closure +continuant +sonorant +velar +voic

21、ed G release +syllabic low high +back +round +tense OWAGO(0.294118) +syllabic +reduced back IX + continuant +sonorant +velar +voiced G closure +continuant +sonorant +velar +voiced G release +syllabic low high +back +round +tense OWDynamic Programming Lexical Search Choose the word that maximizes Pi

22、p( features(ti) | X(ti) ) p(ti+1-ti | features(ti) p(features(ti)|word)LIP-OPTT-OPENTT-LOCTB-LOCTB-OPENVELUMVOICINGwarmth w ao r m p th - Phone insertion?I dont know ah dx uh_n ow_n - Phone deletion?several s eh r v ax l - Exchange of two phones?2. Articulatory PhonologyMany pronunciation phenomena

23、can be parsimoniously described as resulting from asynchrony and reduction of sub-phonetic featuresinstruments ih_n s ch em ih_n n severybody eh r uw ayOne set of features based on articulatory phonology Browman & Goldstein 1990:Dynamic Bayesian Network Model(Livescu and Glass, 2004)The model is imp

24、lemented as a dynamic Bayesian network (DBN):A representation, via a directed graph, of a distribution over a set of variables that evolve through timeExample DBN with three features:= 1= 1)|Pr(|)Pr(212;1aindindaasync=-= .1 0 0 4 .2 .7 0 0 2 .1 .2 .7 0 1 0 .1 .2 .7 0 3 2 1 0given by baseform pronunc

25、iations. . . The DBN-SVM Hybrid Developed at WS04Tongue frontPalatalTongue closedSemi-closedGlideATongue MidTongue openTongue openWordCanonical FormSurface FormPlace x:Multi-FrameObservationincludingSpectrum,Formants,& AuditoryModelTongue frontFrontVowelLIKETongue Frontp( gPGR(x) | palatal glide rel

26、ease)p( gGR(x) | glide release )SVM OutputsManner3. Discriminative Pronunciation ModelRationale: baseline HMM-based system already provides high-quality hypotheses 1-best error rate from N-best lists: 24.4% (RT-03 dev set)oracle error rate: 16.2% Method: Use landmark detection only where necessary,

27、to correct errors made by baseline recognition system Example:Ref: that cannot be that hard to sneak onto an airplane Hyp: they can be a that hard to speak on an airplanefsh_60386_1_0105420_0108380Identifying Confusable HypothesesUse existing alignment algorithms for converting lattices into confusi

28、on networks (Mangu, Brill & Stolcke 2000)Hypotheses ranked by posterior probabilityGenerated from n-best lists without 4-gram or pronunciation model scores ( higher WER compared to lattices)Multi-words (“I_dont_know”) were split prior to generating confusion networksairplaneanonontosneakspeaktohard*

29、DEL*bethat cancantatheyIdentifying Confusable HypothesesHow much can be gained from fixing confusions?Baseline error rate: 25.8%Oracle error rates when selecting correct word from confusion set:# hypotheses to select from Including homophonesNot including homophones223.9% 23.9%323.0%23.0%422.4%22.5%

30、522.0%22.1%Selecting Relevant LandmarksNot all landmarks are equally relevant for distinguishing between competing word hypotheses (e.g. vowel features irrelevant for sneak vs. speak) Using all available landmarks might deteriorate performance when irrelevant landmarks have weak scores (but: redunda

31、ncy might be useful)Automatic selection algorithmShould optimally distinguish set of confusable words (discriminative) Should rank landmark features according to their relevance for distinguishing words (i.e. output should be interpretable in phonetic terms)Should be extendable to features beyond la

32、ndmarksMaximum-Entropy Landmark SelectionConvert each word in confusion set into fixed-length landmark-based representation using idea from information retrieval:Vector space consisting of binary relations between two landmarksManner landmarks: precedence, e.g. V Son. Cons.Manner & place features: o

33、verlap, e.g. V o +highpreserves basic temporal information Words represented as frequency entries in feature vectorNot all possible relations are used (phonotactic constraints, place features detected dependent on manner landmarks)Dimensionality of feature space: 40 - 60Word entries derived from pho

34、ne representation plus pronunciation rulesVector-Space Word RepresentationStart FricFric StopFric SonFric VowelStop VowelVowel o highVowel o frontFric o stridentspeak11001111sneak10100111seek10010111he10010110she10010111steak11001011.Maximum-Entropy DiscriminationUse maxent classifier Here: y = word

35、s, x = acoustics, f = landmark relationshipsWhy maxent classifier?Discriminative classifierPossibly large set of confusable wordsLater addition of non-binary features Training: ideally on real landmark detection output Here: on entries from lexicon (includes pronunciation variants) Maximum-Entropy D

36、iscriminationExample: sneak vs. speakDifferent model is trained for each confusion set landmarks can have different weights in different contexts speakSC +blade -2.47 FR SC -2.47FR SIL 2.11SIL ST 1.75.sneakSC +blade 2.47 FR SC 2.47FR SIL -2.11SIL ST -1.75.Landmark QueriesSelect N landmarks with high

37、est weightsAsk landmark detection module to produce scores for selected landmarks within word boundaries given by baseline systemExample: sneak 1.70 1.99 SC +blade ? sneak 1.70 1.99 SC +blade 0.75 0.56 ConfusionnetworksLandmarkdetectorsIII. EvaluationAcoustic Feature Selection1. Accuracy per Frame (

38、%), Stop Releases only, NTIMIT2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)MFCCs+ShapeMFCCs+Formants

39、KernelLinearRBFLinearRBF+/- lips78.390.792.795.0+/- blade73.487.179.685.1+/- body73.085.285.787.2SVM Training: Mixed vs. Targeted DataTrainNTIMITNTIMIT&SWBNTIMITSwitchboardTestNTIMITNTIMIT&SWBSwitchboardSwitchboardKernelLinearRBFLinearRBFLinearRBFLinearRBFspeech onset95.196.286.989.971.462.281.681.6

40、speech offset79.688.576.386.465.378.668.483.7consonant onset 94.595.591.493.570.372.795.897.7consonant offset91.793.794.396.880.386.292.896.8continuant onset89.494.187.395.069.181.986.292.0continuant offset90.894.990.494.669.368.889.694.3sonorant onset95.697.297.896.785.286.596.396.3sonorant offset9

41、5.396.494.097.475.675.295.296.4syllabic onset90.795.291.495.569.578.987.992.6syllabic offset90.188.987.192.954.460.888.289.7DBN-SVM: Models Nonstandard PhonesI dont know/d/ becomesflap/n/ becomes a creakynasal glideDBN-SVM Design DecisionsWhat kind of SVM outputs should be used in the DBN?Method 1 (

42、EBS/DBN): Generate landmark segmentation with EBS using manner SVMs, then apply place SVMs at appropriate points in the segmentationForce DBN to use EBS segmentationAllow DBN to stray from EBS segmentation, using place/voicing SVM outputs whenever availableMethod 2 (SVM/DBN): Apply all SVMs in all f

43、rames, allow DBN to consider all possible segmentationsIn a single passIn two passes: (1) manner-based segmentation; (2) place+manner scoringHow should we take into account the distinctive feature hierarchy?How do we avoid “over-counting” evidence?How do we train the DBN (feature transcriptions vs.

44、SVM outputs)?DBN-SVM Rescoring ExperimentsFor each lattice edge:SVM probabilities computed over edge duration and used as soft evidence in DBNDBN computes a score S P(word | evidence)Final edge score is a weighted interpolation of baseline scores and EBS/DBN or SVM/DBN scoreDateExperimental setup 3-

45、speaker WER (# errors)RT03 dev WER - Baseline27.7 (550)26.8 Jul31_0EBS/DBN, “hierarchically-normalized” SVM output probabilities, DBN trained on subset of ICSI transcriptions27.6 (549)26.8 Aug1_19 + improved silence modeling27.6 (549) Aug2_19EBS/DBN, unnormalized SVM probs + fricative lip feature27.

46、3 (543)26.8 Aug4_2 + DBN trained using SVM outputs27.3 (543) Aug6_20 + full feature hierarchy in DBN27.4 (545) Aug7_3 + reduction probabilities depend on word frequency27.4 (544) Aug8_19 + retrained SVMs + nasal classifier + DBN bug fixes27.4 (544) Aug11_19SVM/DBN, 1 passMiserable failure! Aug14_0SV

47、M/DBN, 2 pass27.3 (542) Aug14_20SVM/DBN, 2 pass, using only high-accuracy SVMs27.2 (541)Discriminative Pronunciation ModelWERInsertionsDeletionsSubstitutionsBaseline25.8% 2.6% (982)9.2% (3526)14.1% (5417)Rescored 25.8% 2.6% (984)9.2% (3524)14.1% (5408)RT-03 dev set, 35497 Words, 2930 Segments, 36 Sp

48、eakers(Switchboard and Fisher data)Rescored: product combination of old and new prob. distributions, weights 0.8 (old), 0.2 (new)Correct/incorrect decision changed in about 8% of all cases Slightly higher number of fixed errors vs. new errors AnalysisWhen does it work? Detectors give high probabilit

人人文库> 全部分类> 行业资料 > 信息产业

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations

文档简介

温馨提示

最新文档

评论

The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations

文档简介

温馨提示

最新文档

评论

相关文档