版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
ArchitectureofaSearchEngine
XE"SearchEngine"
Theterm"searchengine
XE"Searchengine"
"isoftenusedgenericallytodescribebothcrawler
XE"Crawler:crawler"
-basedsearchenginesandhuman-powereddirectories.Thesetwotypesofsearchenginesgathertheirlistingsinradicallydifferentways.Crawler-basedsearchengines,suchasGoogle
XE"Google:Google"
,createtheirlistingsautomatically.They"crawl
XE"Crawler:crawl"
XE"Crawler:Crawl"
"or"spider
XE"Crawler:Spider"
"theweb,thenpeoplesearchthroughwhattheyhavefound.Ahuman-powereddirectory,suchastheOpenDirectory,dependsonhumansforitslistings.Yousubmitashortdescriptiontothedirectoryforyourentiresite,oreditorswriteoneforsitestheyreview.Asearchlooksformatchesonlyinthedescriptionssubmitted.
Atypicalcrawler
XE"Crawler:crawler"
-basedsearchengine
XE"Searchengine"
hasseveralmajorelements.Firstisthespider
XE"Crawler:Spider"
,alsocalledthecrawler.ThespidervisitsaWeb
XE"Web:Web"
page,readsit,andthenfollowslinkstootherpageswithinthesite.Thisiswhatitmeanswhensomeonereferstoasitebeing"spidered"or"crawled."Thespiderreturnstothesiteonaregularbasis,suchaseverymonthortwo,tolookforchanges.Everythingthespiderfindsgoesintothesecondpartofthesearchengine,theindex
XE"Index"
.Theindex,sometimescalledthecatalogue,islikeagiantbookcontainingacopyofeveryWebpagethatthespiderfinds.IfaWebpagechanges,thenthisbookisupdatedwithnewinformation.Searchengine
XE"SearchEngine"
software
XE"Searchenginesoftware"
isthethirdpartofasearchengine.Thisistheprogramthatsiftsthroughthemillionsofpagesrecordedintheindextofindmatchestoasearchandranktheminorderofwhatitbelievesismostrelevant.
Onecanalsopictureatypicalsearchengine
XE"Searchengine"
(ofanytype)usingthefollowingelements:
UserInterface
Thisisneededtotaketheuserquery.
SearchModule
Transformsthequerytoanunderstandableformat,thenperformsmatchingwiththeindex
XE"Index"
andfinallyreturnsresultsasoutputwiththeneededinformation.
Index
Adatabase/repository
XE"Repository"
XE"Crawler:Repository"
withthedatatobesearched.
Thearchitectureisdepictedinthefigurebelow:
Thesearchmoduleisthemostimportant.Therearelocatedmostofthesearchengine
XE"Searchengine"
algorithms,includingpagerank
XE"Pagerank"
algorithms,usedtosorttheoutputwhenpresentedtotheuser.Inthissecondapproach,thecrawler
XE"Crawler:crawler"
isconsideredtobe“behind”themainsearchengine,becauseitissomehowseparatefromit.
CrawlerArchitecture
Asearchengine
XE"Searchengine"
cannotworkwithoutaproperindex
XE"Index"
wherepossiblesearchedpagesarestored,usuallyinacompressedformat.Thisindexiscreatedbyspecializedrobots,whichcrawl
XE"Crawler:crawl"
XE"Crawler:Crawl"
theWeb
XE"Web:Web"
fornew/modifiedpages(theactualcrawlers,orspiders).
Typicalcrawler
XE"Crawler:crawler"
architecture
XE"Crawler:architecture"
isdepictedinthefigurebelow:
Letusnowinvestigateeachcomponent.
RetrievingModule
RetrieveseachdocumentfromtheWeb
XE"Web:Web"
andgivesittotheProcessmodule.
ProcessModule
ProcessesdatafromtheRetrievingModule.SendsnewdiscoveredURLstotheURL
XE"URL"
ListingModuleandtheWeb
XE"Web:Web"
pagetexttotheFormat&StoreModule.
URL
XE"URL"
ListingModule
FeedstheRetrievingModuleusingitslistofURLs.
Format&StoreModule
Convertsdatatobetterformatandstoresitintotheindex.
XE"Index"
Index
Database/repository
XE"Repository"
XE"Crawler:Repository"
withtheusefuldataretrieved.
TheProcessModuleisthecoordinatingmodule.ItcontrolstheRetrievingModulethroughtheURL
XE"URL"
ListingModuleandpreparesdataforindexing.Itisalsoperformingsomeautomatictextanalysis(stemming,removingofhighfrequencywords,etc),classification(keywordclustering,documentclustering,etc.),filtering(notalldocumentswillbestored),etc.
SearchEnginesExamples
SearchEngineshavebeenquitearesearchedissueinthepastfiveyears,afterthepapersofKleinberg
XE"Kleinberg"
XE"JohnKleinberg"
(Klein1997)andBrin
XE"Brin"
XE"SergeyBrin"
(Brin1998a,Brin1998b)appeared.InitialresearchwasfocusedonlyonbuildingGoogle
XE"Google:Google"
-likeengines.However,intimeresearchhasconcentratedontwomainaspects:searchpersonalization
XE"Personalization"
andimprovingsearchspeed.Theformerismostlyorientedondevelopingpersonalizedpagerank
XE"Pagerank"
algorithms(Widom2002b,Guha,Anderson2002,Moba2000,Widom2002a).ThesealgorithmsareextensionsoftheoriginalGooglepagerankalgorithmandexploitthepersonalizationvector
XE"Personalizationvector"
presentedinthepaper(Brin1998a).
Furthermore,otherresearchershavetriedtobuildtopicorientedsearchengine
XE"Searchengine"
s(Frankel1996,Heritage).Whiletheseprovidebetterresultsthennormalsearchengines,usersfinditdifficulttoswitchbetweenlotsofengineswhenwillingtoqueryondifferenttopics.
Amoresensiblesubjectissearchengine
XE"Searchengine"
speed.Itinvolvescrawlingspeed,index
XE"Index"
accessspeedandpagerank
XE"Pagerank"
speed.FuturesolutionswillprobablyfocusonthedistributednatureoftheWWW
XE"WWW"
.Somearealreadytryingtobuilddistributedindexesortocomputethepagerankinadistributedmanner(Kamvar2003b,Have1999).Thelatterapproachisprovedtobequiteeffective.AlocalpagerankisfirstcomputedforeachstronglyconnectedcomponentoftheWWWgraph,andthentheseranksarecombinedintoaninitialapproximationoftheGoogle
XE"Google:Google"
pagerank.Thepossibleparallelismofthefirststepisobvious.
Manyotherchallengesappearwhenwritingsearchengine
XE"Searchengine"
software.OnlytheexponentiallygrowthoftheWeb
XE"Web:Web"
sizecanbeenoughforareason.Everydayabout7.3millionspagesareaddedtotheWebandmanyothersaremodifiedorremoved[Guil2002].Also,accordingto[Google
XE"Google:Google"
],currentWebgraphcontainsmorethan3billionnodes.Otherchallengesemergeimmediately:
Accessibility.NotallpagesareaccessibleatallthetimeandnotallpagesareconnectedtobigcomponentsoftheWeb
XE"Web:Web"
graph.Yet,suchpagesmightcontainvaluableinformationandtheirdevelopmentshouldbesupportedbysearchengine
XE"Searchengine"
s(here,wemeandevelopmentbysupportingWebpages/sitestogetknownfast,thusconvincingotherWebdesignerstoaddlinkstowardsthemonthesitestheyaredeveloping).
Crawling.AstheWeb
XE"Web:Web"
sizeisgrowingfast,crawlingallitspageswillbemoreandmoredifficult.ExistingcrawlersarealreadytryingtodifferentiatebetweendifferentregionsoftheWebgraph(someofthemareupdatedeveryfewhoursandshouldbecrawledintensively,whilesomeotherareupdatedeveryfewmonthsorevenmoreseldomandthereisnoneedtocheckthemtoooften).
Storingcrawl
XE"Crawler:Crawl"
information.Thisispracticallydeducedfromthepreviousone.Aswewillhavetocrawlmoreandmorepages,andasstoragespaceislimited,solutionsmustbefoundtostoreasmuchcrawlinformationaspossible(currentapproachesinvolveconvertingeveryfiletypetoatextfile,followedbythecompressionofthetextfiles,etc.).
Managinginformation.ConsideringthatsomehowwewillbeabletostoreeverythingthatisintheWeb
XE"Web:Web"
,wewillstillhavetofindsolutionsinordertoaccessthisinformationfastenough.AWebsearchusingasearchengine
XE"Searchengine"
shouldnevertakemorethanafewseconds.
Finally,letmepresentsomegeneralsearchengine
XE"Searchengine"
s.ThemostknownareGoogle
XE"Google:Google"
(Google)andYahoo
XE"Yahoo"
!(Yahoo),butoneoftheoldestsearchenginesisAltavista(Altavista).Allexistingsearchengineshaveweaknesses,evenGoogle(linksearchesmustbeexact,itdoesnotsupportfullBoolean,itonlyindexesapartofthewebpagesorPDFfiles,etc.).Thisissuerepresentsarealreasonforbuildingmoresearchenginetestbedsandalgorithms.
Youcanfindlinkstoothersearchengine
XE"Searchengine"
sattheendofthebibliography(chapter8).Iwillonlyremindheresomewell-knownnames,like[Excite],[Lycos],etc.
Chapter3.PagerankBasics.
EarlyPageRankingSolutions
ThequalityoftheGoogle
XE"Google:Google"
pagerank
XE"Pagerank"
algorithmcanbeeasilyobservedwhenreadingpagerankresearchbeforeGoogle.Allapproacheswereonlybasedonpageappearanceandusersurfing.
InthefollowingIshallpresenttheYahoo
XE"Yahoo"
!rankingtechniques(Yahoo).MuchhasbeenwrittenaboutYahooovertheyears.Webmastershaveexchangedtipsonhowtogetlistedinthedirectory,whilereportershavecoveredthecompany'spastsuccessandrecenttroubles.Yet,thealgorithmofYahoo'ssearchhasnotattractednearlyasmuchattentionforonereasonoranother.Speculationandgeneraladviceaboutithavebeenavailable,butonlyafewpeoplehaveseriouslyattemptedtoexplainhowYahoorankssites,astheYahooCompanydidnotpresentpubliclyitsalgorithms.
EvenwhileYahoo
XE"Yahoo"
isadirectoryandnotasearchengine
XE"Searchengine"
,itdoeshaveasearchfeaturewhichvisitorsoftenuse.AsitethatranksbadlyonYahoo'ssearchwillmissoutonagreatdealoftrafficthatthedirectorycouldpotentiallyproduce.
ExperimentallypeoplediscoveredthatYahoo
XE"Yahoo"
usesalleastthefollowingelementswhenrankingpages:
PageTitle–Apparentlythisisthemostimportantaspectandincludingthepagetitleinthequerywordshelpsalot.Furthermore,apagethatdoesnothaveanyofitskeywordsinthetitleissupposedtobereturnedveryseldomasatoprankedresult.
PageCategory–Ifonemarksthedesiredcategoryinthesearch,pagesonthatissuewillsurelyhaveahigherpagerank
XE"Pagerank"
.Thisisalsoconsideredtobeveryimportant.
URL
XE"URL"
–The(partial)URLofthepagecanbeusedtolimitthequery.
PageDescription–Includingwordsfromthepagedescriptioninthequerywouldonlyincreasesearchspeed.Therelevanceofthereturnedpagesremainsalmostthesame(onecanseehowprimitivethisapproachhasbecomeinonlyaboutonedecade).
Yahoo
XE"Yahoo"
wasthefirstsearchengine
XE"Searchengine"
withclickthroughtracking.Eachclickinthesetofreturnedpagesofaquerywasrememberedandstatisticswerecreated.Still,thebetterideacamealsowithGoogle
XE"Google:Google"
,whichdevelopedtheGoogleToolbar
XE"Google:GoogleToolbar"
fortrackingdownclicksoneveryWeb
XE"Web:Web"
page(notonlyonthepagesbelongingtothesearchengine).
Currently,asearchonYahoo
XE"Yahoo"
ismostlikelytoprovidethesameresultsasasearchonGoogle
XE"Google:Google"
duetothecontractsignedbetweenthetwocompanies.
TheKleinberg
XE"Kleinberg"
XE"JohnKleinberg"
HITS
XE"HITSAlgorithm"
Algorithm
IntroductionandBasicNotations
Thiswasthefirstrealgoodalgorithmonpageranking.TheideabehindtheKleinberg
XE"Kleinberg"
XE"JohnKleinberg"
algorithm(Klein1997)istocomputetwoscoresforeachpageinacommunity,thatisahub
XE"Hub"
scoreandanauthority
XE"Authority"
score.
Butwhatisahub
XE"Hub"
?Ahubcanbeexplainedasbeingapagepointingtomanyotherimportantpages.SupposeyouarelookingforanarticleontraditionalRomaniancookingtechniques.Youwillusuallystartusingasearchengine
XE"Searchengine"
(like[Google
XE"Google:Google"
],[Yahoo
XE"Yahoo"
],[Altavista],[Excite]),butthenyoumustseewhichofitsresultsarereallythepagesyouarelookingfor.Mostoftheactualsearchenginesreturnauthoritiesfirst,whicharepagescontainingvaluableinformationpointedtobymanyhubs.Soahubispointingtowardsmanyauthoritiesandanauthority
XE"Authority"
ispointingtowardsmanyhubs.Comingbacktoourscenario,thesesearchengineresultsaresometimesverygood(exactlythearticleyouwerelookingfor),andsometimesquiteawayfromyourinterests.Forthissecondcase,onemaytrytousehubs.Hubsareusuallyoneormoretopic-orientedandtrytogatherlinkstowardsasmany(good)pagesonthespecificissueaspossible.
TheAlgorithm
ThealgorithmstartswithasetRofpageswithhighpagerank
XE"Pagerank"
(asifitwerecomputedbyasearchengine
XE"Searchengine"
).Then,thissetisfirstlyextendedusingthefollowingmethod(laterinthispaperweshallcallitTheKleinberg
XE"Kleinberg"
XE"JohnKleinberg"
Extension
XE"TheKleinbergextension"
):
LetS=R
ForeachpagepinRdo
Let+(p)denotethesetofallpagesppointsto.
Let--(p)denotethesetofallpagespointingtop.
Addallpagesin+(p)toS
If|-(p)|<=dThen
Addallpagesin-(p)toS
Else
Addanarbitrarysetofdpagesfrom-(p)toS
EndFor
ReturnS
Intheoriginalpapertheextensionisappliedusingastartingsetwith200pagesandthedvalueequalto50.Itisnecessarytolimitthenumberofaddedpagesthatpointtoapagep,becausetherecanbethousandsofthem.Forexample,therearecurrentlyabout661,000pagespointingto
www
XE"WWW"
.
andabout252,000pagespointingto
.
Afterthetargetedsetofpagesisgenerated,thehubsandauthoritiesscorearecomputed.Twoweightsaredefinedforeachpagep:
Anauthority
XE"Authority"
weight
XE"Authorityweight"
x<p>
Ahub
XE"Hub"
weight
XE"Hubweight"
y<p>
Ifppointstomanypageswithlargex-values,thenitshouldreceivealargey-value;andifpispointedtobymanypageswithlargey-values,thenitshouldreceivealargex-value.Thismotivatesthedefinitionoftwooperationsontheweights,whichwedenotebyIandO.Givenweights{x<p>}and{y<p>}theIoperationupdatesthex-weightsasfollows:
HereweconsideredEtobethesetofexistinglinks.Thex-valueofpagepisthesumofally-valuesofpagesqpointingtop.
Analogously,theOoperationupdatesthey-weightsasfollows:
They-valueofpagepisthesumofallx-valuesofpagesqpointedtobyp.TheIandOoperationspracticallymeanthatthehubsandauthoritiesarereinforcingoneanother.Thecomputationofhubsandauthoritiesisnowasimpleiterationontheseformulas,likebelow:
Initializex0andy0with(1,1,1,…,1)inRn,wherenisthenumberofpages.
Givenaknumberofsteps,applythefollowingcomputationktimes:
ApplytheIoperationto(xi-1,yi-1)obtainingxi’
ApplytheOoperationto(xi’,yi-1)obtainingyi’
Normalizexi’obtainingxi
Normalizeyi’obtainingyi
End
Return(xk,yk)astheauthority
XE"Authority"
andhub
XE"Hub"
scoresrequired.
Ofcourse,thisalgorithmcanfurtherbeusedtofilterthetopcauthoritiesandhubs,startingfromabasesetofpages.Thealgorithmproposedin[Kleinberg1997]forthispurposeisdescribedbelow:
Letbeacollectionofpages
Letkandcbesomenaturalnumbers
Runthepreviousiterativealgorithmontocompute(xk,yk)
Reportthepageswithclargestcoordinatesinxkasauthorities
Reportthepageswithclargestcoordinatesinykashubs
TheGoogle
XE"Google:Google"
PagerankAlgorithm
PageRank
XE"Pagerank"
isatopicmuchdiscussedbySearchEngineOptimization
XE"SearchEngineOptimization"
(SEO)experts.AttheheartofPageRankwealwaysfindoneormoremathematicalformulas.
HowisPageRank
XE"Pagerank"
Used?
PageRank
XE"Pagerank"
isoneofthemethodsGoogle
XE"Google:Google"
usestodetermineapage’srelevanceorimportance.ItisonlyonepartofthestorywhenitcomestotheGooglelisting,andtheotheraspectswillbepresentedbelow.PageRankisalsodisplayedonthetoolbarofyourbrowserifyouhaveinstalledtheGoogletoolbar
XE"Google:GoogleToolbar"
(
/
).Still,theToolbarPageRankonlygoesfrom0–10andseemstobesomethinglikealogarithmicscale:
ToolbarPageRank
XE"Pagerank"
(logbase10)
RealPageRank
XE"Pagerank"
0
0-10
1
100-1,000
2
1,000-10,000
3
10,000-100,000
4
andsoon...
WecannotknowtheexactdetailsofthescalebecausethemaximumPageRank
XE"Pagerank"
ofallpagesontheWeb
XE"Web:Web"
changeseverymonthwhenGoogle
XE"Google:Google"
doesitsre-indexing.Alsothetoolbarsometimesguesses.ThetoolbaroftenshowsaToolbarPageRankforpagesjustuploadedandwhichcannotpossiblybeintheindex
XE"Index"
yet.
WhatishappeningisthatthetoolbarlooksattheURL
XE"URL"
ofthepagethebrowseritisdisplayingandstripsoffeverythingdownthelast“/”(i.e.itgoestothe“parent”pageinURLterms).IfGoogle
XE"Google:Google"
hasaToolbarPageRank
XE"Pagerank"
forthatparentthenitsubtracts1andshowsthatastheToolbarPageRankforthispage.IfthereisnoPageRankfortheparentitgoestotheparent’sparentpage,butsubtracting2,andsoonallthewayuptotherootofthesite.
IfitcannotfindaToolbarPageRanktodisplayinthisway,thatisifitdoesnotfindapagewitharealcalculatedPageRank
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年度企业环保责任评估合同2篇
- 二零二四年度高尔夫球场建设合同2篇
- 2024年国际离婚合同英文版样本版B版
- 2024年度防水工程改造合同2篇
- 2024年城市河道综合治理合同
- 2024年国际旅游业务代理合同
- 2024农产品的购销合同
- 沙卵石混合料买卖合同2024年度(含定制服务)2篇
- 2024年度生物制药研发与转让合同(含独家研发权)3篇
- 二零二四年度教育咨询服务合同(国际留学)3篇
- 氯化锌-危险化学品安全周知卡
- QC七大手法九大步骤八大原则资料演示文稿
- 40篇初中英语教资面试真题
- GB/T 5744-2008船用快关阀
- GB/T 30319-2013基础地理信息数据库基本规定
- 原地单手肩上投篮
- GB 20464-2006农作物种子标签通则
- 2023年油库储运操作规程
- 动环监控系统fsu现场安装调测操作指南
- 合同文本建设工程造价咨询合同(示范文本)
- 电子病历等级评审级解读优质推荐课件
评论
0/150
提交评论