




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、DeepFeaturesforTextSpottingMaxJaderberg,AndreaVedaldi,AndrewZissermanVisualGeometryGroup,DepartmentofEngineeringScience,UniversityofOxfordAbstract.Thegoalofthisworkistextspottinginnaturalimages.Thisisdividedintotwosequentialtasks:detectingwordsregionsintheimage,andrecognizingthewordswithintheseregio
2、ns.Wemakethefollowingcontributions:rst,wedevelopaConvolutionalNeuralNet-work(CNN)classierthatcanbeusedforbothtasks.TheCNNhasanovelarchitecturethatenablesecientfeaturesharing(byusinganumberoflayersincommon)fortextdetection,charactercase-sensitiveandinsensitiveclassication,andbigramclassication.Itexce
3、edsthestate-of-the-artperformanceforallofthese.Second,wemakeanumberoftechnicalchangesoverthetraditionalCNNarchitectures,includingnodownsamplingforaper-pixelslidingwindow,andmulti-modelearn-ingwithamixtureoflinearmodels(maxout).Third,wehaveamethodofautomateddataminingofFlickr,thatgenerateswordandchar
4、acterlevelannotations.Finally,thesecomponentsareusedtogethertoformanend-to-end,state-of-the-arttextspottingsystem.Weevaluatethetext-spottingsystemontwostandardbenchmarks,theICDARRobustReadingdatasetandtheStreetViewTextdataset,anddemonstrateimprovementsoverthestate-of-the-artonmultiplemeasures.1Intro
5、ductionWhiletextrecognitionfromscanneddocumentsiswellstudiedandtherearemanyavailablesystems,theautomaticdetectionandrecognitionoftextwithinimagestextspotting(Fig.1)isfarlessdeveloped.However,textcontainedwithinimagescanbeofgreatsemanticvalue,andsoisanimportantstepto-wardsbothinformationretrievalanda
6、utonomoussystems.Forexample,textspottingofnumbersinstreetviewdataallowstheautomaticlocalizationofhousesnumbersinmaps20,readingstreetandshopsignsgivesroboticve-hiclesscenecontext39,andindexinglargevolumesofvideodatawithtextobtainedbytextspottingenablesfastandaccurateretrievalofvideodatafromatextsearc
7、h26.2MaxJaderberg,AndreaVedaldi,AndrewZisserman(a)(b)Fig.1.(a)Anend-to-endtextspottingresultfromthepresentedsystemontheSVTdataset.(b)RandomlysampledcroppedworddataautomaticallyminedfromFlickrwithaweakbaselinesystem,generatingextratrainingdata.pipeline.ToachievethisweuseaConvolutionalNeuralNetwork(CN
8、N)27andgenerateaper-pixeltext/no-textsaliencymap,acase-sensitiveandcase-insensitivecharactersaliencymap,andabigramsaliencymap.Thetextsaliencymapdrivestheproposalofwordboundingboxes,whilethecharacterandbigramsaliencymapsassistinrecognizingthewordwithineachboundingboxthroughacombinationofsoftcosts.Our
9、workisinspiredbytheexcellentperformanceofCNNsforcharacterclassication6,8,47.Ourcontributionsarethreefold:First,weintroduceamethodtosharefeatures44whichallowsustoextendourcharacterclassierstoothertaskssuchascharacterdetectionandbigramclassicationataverysmallextracost:werstgenerateasinglerichfeaturese
10、t,bytrainingastronglysupervisedcharacterclassier,andthenusetheinter-mediatehiddenlayersasfeaturesforthetextdetection,charactercase-sensitiveandinsensitiveclassication,andbigramclassication.Thisproceduremakesbestuseoftheavailabletrainingdata:plentifulforcharacter/non-characterbutlesssofortheothertask
11、s.ItisreminiscentoftheCaeidea14,buthereitisnotnecessarytohaveexternalsourcesoftrainingdata.AsecondkeynoveltyinthecontextoftextdetectionistoleveragetheconvolutionalstructureoftheCNNtoprocesstheentireimageinonegoinsteadofrunningCNNclassiersoneachcroppedcharacterproposal27.Thisallowsustogenerateecientl
12、y,inasinglepass,allthefeaturesrequiredtodetectwordboundingboxes,andthatweuseforrecognizingwordsfromaxedlexiconusingtheViterbialgorithm.WealsomakeatechnicalcontributioninshowingthatourCNNarchitectureusingmaxout21asthenon-linearactivationfunctionhassuperiorperformancetothemorestandardrectiedlinearunit
13、.Ourthirdcontributionisamethodforautomaticallyminingandannotatingdata(Fig.1).SinceCNNscanhavemanymillionsoftrainableparameters,werequirealargecorpusoftrainingdatatominimizeovertting,andminingisuse-fultocheaplyextendavailabledata.OurminingmethodcrawlsimagesfromtheInternettoautomaticallygeneratewordle
14、velandcharacterlevelboundingboxannotations,andaseparatemethodisusedtoautomaticallygeneratecharacterlevelboundingboxannotationswhenonlywordlevelboundingboxannotationsaresupplied.DeepFeaturesforTextSpotting3Inthefollowingwerstdescribethedataminingprocedure(Sect.2)andthentheCNNarchitectureandtraining(S
15、ect.3).Ourend-to-end(imagein,textout)textspottingpipelineisdescribedinSect.4.Finally,Sect.5evaluatesthemethodonanumberofstandardbenchmarks.Weshowthattheperformanceexceedsthestateoftheartacrossmultiplemeasures.RelatedWork.Decomposingthetext-spottingproblemintotextdetectionandtextrecognitionwasrstprop
16、osedby12.Authorshavesubsequentlyfocusedsolelyontextdetection7,11,16,50,51,ortextrecognition31,36,41,oroncombiningbothinend-to-endsystems40,39,49,3234,45,35,6,8,48.Textdetectionmethodsareeitherbasedonconnectedcomponents(CCs)11,16,50,49,3235orslidingwindows40,7,39,45.Connectedcomponentmeth-odssegmentp
17、ixelsintocharacters,thengrouptheseintowords.Forexample,Epshteinetal.takecharactersasCCsofthestrokewidthtransform16,whileNeumannandMatas34,33useExtremalRegions29,ormorerecentlyorientedstrokes35,asCCsrepresentingcharacters.Slidingwindowmethodsapproachtextspottingasastandardtaskofobjectdetection.Forexa
18、mple,Wangetal.45usearandomferns38slidingwindowclassiertondcharactersinanimage,groupingthemusingapictorialstructuresmodel18foraxedlexicon.Wang&Wuetal.47buildonthexedlexiconproblembyusingCNNs27withunsupervisedpre-trainingasin13.Alsharifetal.6andBissaccoetal.8,alsouseCNNsforcharacterclassicationbot
19、hmethodsover-segmentawordboundingboxandndanapproximatesolutiontotheoptimalwordrecognitionresult,in8usingbeamsearchandin6usingaHiddenMarkovModel.TheworksbyMishraetal.31andNovikovaetal.36focuspurelyontextrecognitionassumingaperfecttextdetectorhasproducedcroppedimagesofwords.In36,Novikovacombinesbothvi
20、sualandlexiconconsistencyintoasingleprobabilisticmodel.2DataminingforwordandcharacterannotationsInthissectionwedescribeamethodforautomaticallyminingsuitablephotosharingwebsitestoacquirewordandcharacterlevelannotateddata.Thisan-notationisusedtoprovideadditionaltrainingdatafortheCNNinSect.5.WordMining
21、.PhotosharingwebsitessuchasFlickr3containalargerangeofscenes,includingthosecontainingtext.Inparticular,the“TypographyandLettering”grouponFlickr4containsmainlyphotosorgraphicscontainingtext.Asthetextdepictedinthescenesarethefocusoftheimages,theusergiventitlesoftheimagesoftenincludethetextinthescene.C
22、apitalizingonthisweaklysupervisedinformation,wedevelopasystemtondtitletextwithintheimage,automaticallygeneratingwordandcharacterlevelboundingboxannotations.Usingaweakbaselinetext-spottingsystembasedontheStrokeWidthTrans-form(SWT)16anddescribedinSect.5,wegeneratecandidateworddetections4MaxJaderberg,A
23、ndreaVedaldi,AndrewZissermanforeachimagefromFlickr.Ifadetectedwordisthesameasanyoftheimagestitletextwords,andtherearethesamenumberofcharactersfromtheSWTdetectionphaseaswordcharacters,wesaythatthisisanaccurateworddetec-tion,andusethisdetectionaspositivetexttrainingdata.Wesettheparameterssothatthereca
24、llofthisprocessisverylow(outof130000images,only15000wordswerefound),buttheprecisionisgreaterthan99%.Thismeansthepre-cisionishighenoughfortheminedFlickrdatatobeusedaspositivetrainingdata,buttherecallistoolowforittobeusedforbackgroundno-texttrainingdata.WewillrefertothisdatasetasFlickrType,whichcontai
25、ns6792images,14920words,and71579characters.Fig.1showssomepositivecroppedwordsrandomlysampledfromtheautomaticallygeneratedFlickrTypedataset.Althoughthisprocedurewillcauseabiastowardsscenetextthatcanbefoundwithasimpleend-to-endpipeline,itstillgeneratesmoretrainingexamplesthatcanbeusedtopreventtheovert
26、tingofourmodels.AutomaticCharacterAnnotation.InadditiontominingdatafromFlickr,wealsousethewordrecognitionsystemdescribedinSect.4.2toautomaticallygeneratecharacterboundingboxannotationsfordatasetswhichonlyhavewordlevelboundingboxannotations.Foreachcroppedword,weperformtheoptimalttingofthegroundtrutht
27、exttothecharactermapusingthemethoddescribedinSect.4.2.Thisplacesinter-characterbreakpointswithimpliedcharactercen-ters,whichcanbeusedasroughcharacterboundingboxes.WedothisfortheSVTandOxfordCornmarketdatasets(thataredescribedinsection5),allowingustotrainandtestonanextra22,000croppedcharactersfromthos
28、edatasets.3FeaturelearningusingaConvolutionalNeuralNetworkTheworkhorseofatext-spottingsystemisthecharacterclassier.Theoutputofthisclassierisusedtorecognizewordsand,inoursystem,todetectim-ageregionsthatcontaintext.Text-spottingsystemsappeartobeparticularlysensitivetotheperformanceofcharacterclassicat
29、ion;forexample,in8in-creasingtheaccuracyofthecharacterclassierby7%ledtoa25%increaseinwordrecognition.Inthissectionwethereforeconcentrateonmaximizingtheperformanceofthiscomponent.Toclassifyanimagepatchxinoneofthepossiblecharacters(orbackground),weextractasetoffeatures(x)=(1(x),2(x),.,K(x)andthenlearn
30、abi-naryclassierfcforeachcharactercofthealphabetC.Classiersarelearnedtoyieldaposteriorprobabilitydistributionp(c|x)=fc(x)overcharactersandthelatterismaximizedtorecognizethecharacterc¯containedinpatchx:c¯=argmaxcCp(c|x).Traditionally,featuresaremanuallyengineeredandop-timizedthroughalaborio
31、ustrial-and-errorcycleinvolvingadjustingthefeaturesandre-learningtheclassiers.Inthiswork,weproposeinsteadtolearntherep-resentationusingaCNN27,jointlyoptimizingtheperformanceofthefeaturesaswellasoftheclassiers.Asnotedintherecentliterature,awelldesignedDeepFeaturesforTextSpotting5learnablerepresentati
32、onofthistypecaninfactyieldsubstantialperformancegains25.CNNsareobtainedbystackingmultiplelayersoffeatures.AconvolutionallayerconsistofKlinearltersfollowedbyanon-linearresponsefunction.Theinputtoaconvolutionallayerisafeaturemapzi(u,v)where(u,v)iarespatialcoordinatesandzi(u,v)RCcontainsCscalarfeatures
33、orchannelsckzi(u,v).Theoutputisanewfeaturemapzi+1suchthatzi+1=hi(Wikzi+bik),whereWikandbikdenotethek-thlterkernelandbiasrespectively,andhiisanon-linearactivationfunctionsuchastheRectiedLinearUnit(ReLU)hi(z)=max0,z.Convolutionallayerscanbeintertwinedwithnormalization,subsampling,andmax-poolinglayersw
34、hichbuildtranslationinvarianceinlocalneighborhoods.Theprocessstartswithz1=xandendsbyconnectingthelastfeaturemaptoalogisticregressorforclassication.AlltheparametersofthemodelarejointlyoptimizedtominimizetheclassicationlossoveratrainingsetusingStochasticGradientDescent(SGD),back-propagation,andotherim
35、provementsdiscussedinSect.3.1.InsteadofusingReLUsasactivationfunctionhi,inourexperimentsitwasfoundempiricallythatmaxout21yieldssuperiorperformance.Maxout,inpar-ticularwhenusedinthenalclassicationlayer,canbethoughtofastakingthemaximumresponseoveramixtureofnlinearmodels,allowingtheCNNtoeasily212ziissi
36、mplytheirpointwisemaximum:hi(zi(u,v)=maxzi(u,v),zi(u,v).Moregenerally,thek -thmaxoutoperatorhkisobtainedbyselectingasub-setGk i1,2,.,Koffeaturechannelsandcomputingthemaximumover kthem:hki(zi(u,v)=maxkGk izi(u,v).Whiledierentgroupingstrategiesarepossible,heregroupsareformedbytakinggconsecutivechannel
37、softheinputmap:G1i=1,2,.,g,G2i=g+1,g+2,.,2gandsoon.Hence,givenKfeaturechannelsasinput,maxoutconstructsK =K/gnewchannels.Thissectiondiscussesthedetailsoflearningthecharacterclassiers.Trainingisdividedintotwostages.Intherststage,acase-insensitiveCNNcharacterclassierislearned.Inthesecondstage,theresult
38、ingfeaturemapsareappliedtootherclassicationproblemsasneeded.Theoutputisfourstate-of-the-artCNNclassiers:acharacter/backgroundclassier,acase-insensitivecharacterclassier,acase-sensitivecharacterclassier,andabigramclassier.Stage1:Bootstrappingthecase-insensitiveclassier.Thecase-insensitiveclassieruses
39、afour-layerCNNoutputtingaprobabilityp(c|x)overanalpha-betCincludingall26letters,10digits,andanoise/background(no-text)class,givingatotalof37classes(Fig.2)Theinputz1=xoftheCNNaregrayscalecroppedcharacterimagesof24×24pixels,zero-centeredandnormalizedbysubtractingthepatchmeananddividingbythestanda
40、rddeviation.Duetothesmallinputsize,nospatialpoolingordownsamplingisperformed.Startingfromtherstlayer,theinputimageisconvolvedwith96ltersofsize6MaxJaderberg,AndreaVedaldi,AndrewZissermangroups.Fig.3.Visualizationsofeachcharacterclasslearntfromthe37-waycase-insensitivecharacterclassierCNN.Eachimageiss
41、yntheticallygeneratedbymaximizingtheposteriorprobabilityofaparticularclass.Thisisimplementedbyback-propagatingtheerrorfromacostlayerthataimstomaximizethescoreofthatclass43,17.9×9,resultinginamapofsize16×16(toavoidboundaryeects)and96channels.The96channelsarethenpooledwithmaxoutingroupofsize
42、g=2,resultingin48channels.Thesequencecontinuesbyconvolvingwith128,512,148ltersofside9,8,1andmaxoutgroupsofsizeg=2,4,4,resultinginfeaturemapswith64,128,37channelsandsize8×8,1×1,1×1respectively.Thelast37channelsarefedintoasoft-maxtoconvertthemintocharacterprobabilities.Inpracticeweuse48
43、channelsinthenalclassicationlayerratherthan37asthesoftwareweuse,basedoncuda-convnet25,isoptimizedformultiplesof16convolutionallterswedohoweverusetheadditional12classesasextrano-textclasses,abstractingthisto37outputclasses.Wetrainusingstochasticgradientdescentandback-propagation,andalsousedropout22in
44、alllayersexcepttherstconvolutionallayertohelppreventovertting.Dropoutsimplyinvolvesrandomlyzeroingaproportionofthepa-rameters;theproportionwekeepforeachlayeris1,0.5,0.5,0.5.Thetrainingdataisaugmentedbyrandomrotationsandnoiseinjection.Byomittinganydownsamplinginournetworkandensuringtheoutputforeachcl
45、assisonepixelinsize,itisimmediatetoapplythelearntltersonafullimageinaconvolu-tionalmannertoobtainaper-pixeloutputwithoutalossofresolution,asshownDeepFeaturesforTextSpotting7inthesecondimageofFig4.Fig.3illustratesthelearnedCNNbyusingthevisualizationtechniqueof43.Stage2:Learningtheothercharacterclassi
46、ers.Trainingonalargeamountofannotateddata,andalsoincludingano-textclassinouralphabet,meansthehiddenlayersofthenetworkproducefeaturemapshighlyadeptatdiscriminatingcharacters,andcanbeadaptedforotherclassicationtasksre-latedtotext.Weusetheoutputsofthesecondconvolutionallayerasoursetofdiscriminativefeat
47、ures,(x)=z2.Fromthesefeatures,wetraina2-waytext/no-textclassier1,a63-waycase-sensitivecharacterclassier,andabi-gramclassier,eachoneusingatwo-layerCNNactingon(x)(Fig.2).ThelasttwolayersofeachofthesethreeCNNsresultinfeaturemapswith128-2,128-63,and128-604channelsrespectively,allresultingfrommaxoutgroup
48、ingofsizeg=4.Thesearealltrainedwith(x)asinput,withdropoutof0.5onalllayers,andne-tunedbyadaptivelyreducingthelearningrate.Thebigramclassierrecognisesinstancesoftwoadjacentcharacters,e.g.Fig6.TheseCNNscouldhavebeenlearnedindependently.However,sharingthersttwolayershastwokeyadvantages.First,thelow-leve
49、lfeatureslearnedfromcase-insensitivecharacterclassicationallowssharingtrainingdataamongtasks,reducingoverttingandimprovingperformanceinclassicationtaskswithlessinformativelabels(text/no-textclassication),ortaskswithfewertrainingexamples(case-sensitivecharacterclassication,bigramclassication).Second,
50、itallowssharingcomputations,signicantlyincreasingtheeciency.4End-to-EndPipelineThissectiondescribesthevariousstagesoftheproposedend-to-endtextspot-tingsystem,makinguseofthefeatureslearntinSect.3.Thepipelinestartswithadetectionphase(Sect.4.1)thattakesarawimageandgeneratescandidateboundingboxesofwords
51、,makinguseofthetext/no-textclassifer.Thewordscontainedwithintheseboundingboxesarethenrecognizedagainstaxedlex-iconofwords(Sect.4.2),drivenbythecharacterclassiers,bigramclassier,andothergeometriccues.Theaimofthedetectionphaseistostartfromalarge,rawpixelinputimageandgenerateasetofrectangularboundingbo
52、xes,eachofwhichshouldcontaintheimageofaword.Thisdetectionprocess(Fig.4)istunedforhighrecall,andgeneratesasetofcandidatewordboundingboxes.Theprocessstartsbycomputingatextsaliencymapbyevaluatingthecharacter/backgroundCNNclassierinaslidingwindowfashionacrosstheim-age,whichhasbeenappropriatelyzero-padde
53、dsothattheresultingtextsaliency1Trainingadedicatedclassierwasfoundtoyieldsuperiorperformancetousingthebackgroundclassinthe37-waycase-sensitivecharacterclassier.8MaxJaderberg,AndreaVedaldi,AndrewZissermanFig.4.Thedetectorphaseforasinglescale.Fromlefttoright:inputimage,CNNgeneratedtextsaliencymapusing
54、thattext/no-textclassier,aftertherunlengthsmoothingphase,afterthewordsplittingphase,theimpliedboundingboxes.Subse-quently,theboundingboxeswillbecombinedatmultiplescalesandundergolteringandnon-maximalsuppression.mapisthesameresolutionastheoriginalimage.AstheCNNistrainedtodetecttextatasinglecanonicalh
55、eight,thisprocessisrepeatedfor16dierentscalestotargettextheightsbetween16and260pixelsbyresizingtheinputimage.Giventhesesaliencymaps,wordboundingboxesaregeneratedindependentlyateachscaleintwosteps.Therststepistoidentifylinesoftext.Tothisend,theprobabilitymapisrstthresholdedtondlocalregionsofhighproba
56、bility.Thentheseregionsareconnectedintextlinesbyusingtherunlengthsmoothingalgorithm(RLSA):foreachrowofpixelsthemeanµandstandarddeviationofthespacingsbetweenprobabilitypeaksarecomputedandneighboringregionsareconnectedifthespacebetweenthemislessthan3µ0.5.Findingconnectedcomponentsofthelinked
57、regionsresultsincandidatetextlines.Thenextstepistosplittextlinesintowords.Forthis,theimageiscroppedtojustthatofatextlineandOtsuthresholding37isappliedtoroughlysegmentforegroundcharactersfrombackground.Adjacentconnectedcomponents(whicharehopefullysegmentedcharacters)arethenconnectediftheirhorizontals
58、pacingsarelessthanthemeanhorizontalspacingforthetextline,againusingRLSA.Theresultingconnectedcomponentsgivecandidateboundingboxesforindividualwords,whicharethenaddedtotheglobalsetofboundingboxesatallscales.Finally,theseboundingboxesarelteredbasedongeometricconstraints(boxheight,aspectratio,etc.)andundergonon-maximalsuppressionsortingthembydecreasingaverageper-pixeltextsaliencyscore.TheaimofthewordrecognitionstageistotakethecandidatecroppedwordimagesIRW×HofwidthWandheightHandestimatethetextcontainedinthem.Inordertorecognizeawordfromaxedlexicon,eachwordhypoth-esisisscoredusin
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年党员领导干部网上学法用法考试题及答案(共八套)
- 设备委托处置协议
- 酒店行政酒廊
- 银行装修售后服务备忘录
- 三农村公共卫生服务手册
- 咖啡连锁店原料配送
- 办公软件简明教程手册
- 电商平台销售额
- 医药冷链物流研究报告
- 钢结构安全施工方案
- GB/T 43493.2-2023半导体器件功率器件用碳化硅同质外延片缺陷的无损检测识别判据第2部分:缺陷的光学检测方法
- 2024年DIP管理专项考核试题
- 6.1认识经济全球化(上课)公开课
- 无创神经调控技术辅助阿尔茨海默病治疗的中国专家共识(2023)要点
- 六宫数独题目
- 韩愈简介完整
- 《学前儿童科学教育》第二章 幼儿科学教育的目标与内容课件
- 马克思主义与社会科学方法论习题与答案
- 幕墙开启扇维修施工方案
- 新人教版七年级上册英语单词默写-英译汉
- (新统编版)语文八年级上册 第四单元 大单元教学设计
评论
0/150
提交评论