毕业设计(论文)文献翻译译文:一个搜索引擎的体系结构_第1页
毕业设计(论文)文献翻译译文:一个搜索引擎的体系结构_第2页
毕业设计(论文)文献翻译译文:一个搜索引擎的体系结构_第3页
毕业设计(论文)文献翻译译文:一个搜索引擎的体系结构_第4页
毕业设计(论文)文献翻译译文:一个搜索引擎的体系结构_第5页
已阅读5页,还剩8页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

ArchitectureofaSearchEngine

XE"SearchEngine"

Theterm"searchengine

XE"Searchengine"

"isoftenusedgenericallytodescribebothcrawler

XE"Crawler:crawler"

-basedsearchenginesandhuman-powereddirectories.Thesetwotypesofsearchenginesgathertheirlistingsinradicallydifferentways.Crawler-basedsearchengines,suchasGoogle

XE"Google:Google"

,createtheirlistingsautomatically.They"crawl

XE"Crawler:crawl"

XE"Crawler:Crawl"

"or"spider

XE"Crawler:Spider"

"theweb,thenpeoplesearchthroughwhattheyhavefound.Ahuman-powereddirectory,suchastheOpenDirectory,dependsonhumansforitslistings.Yousubmitashortdescriptiontothedirectoryforyourentiresite,oreditorswriteoneforsitestheyreview.Asearchlooksformatchesonlyinthedescriptionssubmitted.

Atypicalcrawler

XE"Crawler:crawler"

-basedsearchengine

XE"Searchengine"

hasseveralmajorelements.Firstisthespider

XE"Crawler:Spider"

,alsocalledthecrawler.ThespidervisitsaWeb

XE"Web:Web"

page,readsit,andthenfollowslinkstootherpageswithinthesite.Thisiswhatitmeanswhensomeonereferstoasitebeing"spidered"or"crawled."Thespiderreturnstothesiteonaregularbasis,suchaseverymonthortwo,tolookforchanges.Everythingthespiderfindsgoesintothesecondpartofthesearchengine,theindex

XE"Index"

.Theindex,sometimescalledthecatalogue,islikeagiantbookcontainingacopyofeveryWebpagethatthespiderfinds.IfaWebpagechanges,thenthisbookisupdatedwithnewinformation.Searchengine

XE"SearchEngine"

software

XE"Searchenginesoftware"

isthethirdpartofasearchengine.Thisistheprogramthatsiftsthroughthemillionsofpagesrecordedintheindextofindmatchestoasearchandranktheminorderofwhatitbelievesismostrelevant.

Onecanalsopictureatypicalsearchengine

XE"Searchengine"

(ofanytype)usingthefollowingelements:

UserInterface

Thisisneededtotaketheuserquery.

SearchModule

Transformsthequerytoanunderstandableformat,thenperformsmatchingwiththeindex

XE"Index"

andfinallyreturnsresultsasoutputwiththeneededinformation.

Index

Adatabase/repository

XE"Repository"

XE"Crawler:Repository"

withthedatatobesearched.

Thearchitectureisdepictedinthefigurebelow:

Thesearchmoduleisthemostimportant.Therearelocatedmostofthesearchengine

XE"Searchengine"

algorithms,includingpagerank

XE"Pagerank"

algorithms,usedtosorttheoutputwhenpresentedtotheuser.Inthissecondapproach,thecrawler

XE"Crawler:crawler"

isconsideredtobe“behind”themainsearchengine,becauseitissomehowseparatefromit.

CrawlerArchitecture

Asearchengine

XE"Searchengine"

cannotworkwithoutaproperindex

XE"Index"

wherepossiblesearchedpagesarestored,usuallyinacompressedformat.Thisindexiscreatedbyspecializedrobots,whichcrawl

XE"Crawler:crawl"

XE"Crawler:Crawl"

theWeb

XE"Web:Web"

fornew/modifiedpages(theactualcrawlers,orspiders).

Typicalcrawler

XE"Crawler:crawler"

architecture

XE"Crawler:architecture"

isdepictedinthefigurebelow:

Letusnowinvestigateeachcomponent.

RetrievingModule

RetrieveseachdocumentfromtheWeb

XE"Web:Web"

andgivesittotheProcessmodule.

ProcessModule

ProcessesdatafromtheRetrievingModule.SendsnewdiscoveredURLstotheURL

XE"URL"

ListingModuleandtheWeb

XE"Web:Web"

pagetexttotheFormat&StoreModule.

URL

XE"URL"

ListingModule

FeedstheRetrievingModuleusingitslistofURLs.

Format&StoreModule

Convertsdatatobetterformatandstoresitintotheindex.

XE"Index"

Index

Database/repository

XE"Repository"

XE"Crawler:Repository"

withtheusefuldataretrieved.

TheProcessModuleisthecoordinatingmodule.ItcontrolstheRetrievingModulethroughtheURL

XE"URL"

ListingModuleandpreparesdataforindexing.Itisalsoperformingsomeautomatictextanalysis(stemming,removingofhighfrequencywords,etc),classification(keywordclustering,documentclustering,etc.),filtering(notalldocumentswillbestored),etc.

SearchEnginesExamples

SearchEngineshavebeenquitearesearchedissueinthepastfiveyears,afterthepapersofKleinberg

XE"Kleinberg"

XE"JohnKleinberg"

(Klein1997)andBrin

XE"Brin"

XE"SergeyBrin"

(Brin1998a,Brin1998b)appeared.InitialresearchwasfocusedonlyonbuildingGoogle

XE"Google:Google"

-likeengines.However,intimeresearchhasconcentratedontwomainaspects:searchpersonalization

XE"Personalization"

andimprovingsearchspeed.Theformerismostlyorientedondevelopingpersonalizedpagerank

XE"Pagerank"

algorithms(Widom2002b,Guha,Anderson2002,Moba2000,Widom2002a).ThesealgorithmsareextensionsoftheoriginalGooglepagerankalgorithmandexploitthepersonalizationvector

XE"Personalizationvector"

presentedinthepaper(Brin1998a).

Furthermore,otherresearchershavetriedtobuildtopicorientedsearchengine

XE"Searchengine"

s(Frankel1996,Heritage).Whiletheseprovidebetterresultsthennormalsearchengines,usersfinditdifficulttoswitchbetweenlotsofengineswhenwillingtoqueryondifferenttopics.

Amoresensiblesubjectissearchengine

XE"Searchengine"

speed.Itinvolvescrawlingspeed,index

XE"Index"

accessspeedandpagerank

XE"Pagerank"

speed.FuturesolutionswillprobablyfocusonthedistributednatureoftheWWW

XE"WWW"

.Somearealreadytryingtobuilddistributedindexesortocomputethepagerankinadistributedmanner(Kamvar2003b,Have1999).Thelatterapproachisprovedtobequiteeffective.AlocalpagerankisfirstcomputedforeachstronglyconnectedcomponentoftheWWWgraph,andthentheseranksarecombinedintoaninitialapproximationoftheGoogle

XE"Google:Google"

pagerank.Thepossibleparallelismofthefirststepisobvious.

Manyotherchallengesappearwhenwritingsearchengine

XE"Searchengine"

software.OnlytheexponentiallygrowthoftheWeb

XE"Web:Web"

sizecanbeenoughforareason.Everydayabout7.3millionspagesareaddedtotheWebandmanyothersaremodifiedorremoved[Guil2002].Also,accordingto[Google

XE"Google:Google"

],currentWebgraphcontainsmorethan3billionnodes.Otherchallengesemergeimmediately:

Accessibility.NotallpagesareaccessibleatallthetimeandnotallpagesareconnectedtobigcomponentsoftheWeb

XE"Web:Web"

graph.Yet,suchpagesmightcontainvaluableinformationandtheirdevelopmentshouldbesupportedbysearchengine

XE"Searchengine"

s(here,wemeandevelopmentbysupportingWebpages/sitestogetknownfast,thusconvincingotherWebdesignerstoaddlinkstowardsthemonthesitestheyaredeveloping).

Crawling.AstheWeb

XE"Web:Web"

sizeisgrowingfast,crawlingallitspageswillbemoreandmoredifficult.ExistingcrawlersarealreadytryingtodifferentiatebetweendifferentregionsoftheWebgraph(someofthemareupdatedeveryfewhoursandshouldbecrawledintensively,whilesomeotherareupdatedeveryfewmonthsorevenmoreseldomandthereisnoneedtocheckthemtoooften).

Storingcrawl

XE"Crawler:Crawl"

information.Thisispracticallydeducedfromthepreviousone.Aswewillhavetocrawlmoreandmorepages,andasstoragespaceislimited,solutionsmustbefoundtostoreasmuchcrawlinformationaspossible(currentapproachesinvolveconvertingeveryfiletypetoatextfile,followedbythecompressionofthetextfiles,etc.).

Managinginformation.ConsideringthatsomehowwewillbeabletostoreeverythingthatisintheWeb

XE"Web:Web"

,wewillstillhavetofindsolutionsinordertoaccessthisinformationfastenough.AWebsearchusingasearchengine

XE"Searchengine"

shouldnevertakemorethanafewseconds.

Finally,letmepresentsomegeneralsearchengine

XE"Searchengine"

s.ThemostknownareGoogle

XE"Google:Google"

(Google)andYahoo

XE"Yahoo"

!(Yahoo),butoneoftheoldestsearchenginesisAltavista(Altavista).Allexistingsearchengineshaveweaknesses,evenGoogle(linksearchesmustbeexact,itdoesnotsupportfullBoolean,itonlyindexesapartofthewebpagesorPDFfiles,etc.).Thisissuerepresentsarealreasonforbuildingmoresearchenginetestbedsandalgorithms.

Youcanfindlinkstoothersearchengine

XE"Searchengine"

sattheendofthebibliography(chapter8).Iwillonlyremindheresomewell-knownnames,like[Excite],[Lycos],etc.

Chapter3.PagerankBasics.

EarlyPageRankingSolutions

ThequalityoftheGoogle

XE"Google:Google"

pagerank

XE"Pagerank"

algorithmcanbeeasilyobservedwhenreadingpagerankresearchbeforeGoogle.Allapproacheswereonlybasedonpageappearanceandusersurfing.

InthefollowingIshallpresenttheYahoo

XE"Yahoo"

!rankingtechniques(Yahoo).MuchhasbeenwrittenaboutYahooovertheyears.Webmastershaveexchangedtipsonhowtogetlistedinthedirectory,whilereportershavecoveredthecompany'spastsuccessandrecenttroubles.Yet,thealgorithmofYahoo'ssearchhasnotattractednearlyasmuchattentionforonereasonoranother.Speculationandgeneraladviceaboutithavebeenavailable,butonlyafewpeoplehaveseriouslyattemptedtoexplainhowYahoorankssites,astheYahooCompanydidnotpresentpubliclyitsalgorithms.

EvenwhileYahoo

XE"Yahoo"

isadirectoryandnotasearchengine

XE"Searchengine"

,itdoeshaveasearchfeaturewhichvisitorsoftenuse.AsitethatranksbadlyonYahoo'ssearchwillmissoutonagreatdealoftrafficthatthedirectorycouldpotentiallyproduce.

ExperimentallypeoplediscoveredthatYahoo

XE"Yahoo"

usesalleastthefollowingelementswhenrankingpages:

PageTitle–Apparentlythisisthemostimportantaspectandincludingthepagetitleinthequerywordshelpsalot.Furthermore,apagethatdoesnothaveanyofitskeywordsinthetitleissupposedtobereturnedveryseldomasatoprankedresult.

PageCategory–Ifonemarksthedesiredcategoryinthesearch,pagesonthatissuewillsurelyhaveahigherpagerank

XE"Pagerank"

.Thisisalsoconsideredtobeveryimportant.

URL

XE"URL"

–The(partial)URLofthepagecanbeusedtolimitthequery.

PageDescription–Includingwordsfromthepagedescriptioninthequerywouldonlyincreasesearchspeed.Therelevanceofthereturnedpagesremainsalmostthesame(onecanseehowprimitivethisapproachhasbecomeinonlyaboutonedecade).

Yahoo

XE"Yahoo"

wasthefirstsearchengine

XE"Searchengine"

withclickthroughtracking.Eachclickinthesetofreturnedpagesofaquerywasrememberedandstatisticswerecreated.Still,thebetterideacamealsowithGoogle

XE"Google:Google"

,whichdevelopedtheGoogleToolbar

XE"Google:GoogleToolbar"

fortrackingdownclicksoneveryWeb

XE"Web:Web"

page(notonlyonthepagesbelongingtothesearchengine).

Currently,asearchonYahoo

XE"Yahoo"

ismostlikelytoprovidethesameresultsasasearchonGoogle

XE"Google:Google"

duetothecontractsignedbetweenthetwocompanies.

TheKleinberg

XE"Kleinberg"

XE"JohnKleinberg"

HITS

XE"HITSAlgorithm"

Algorithm

IntroductionandBasicNotations

Thiswasthefirstrealgoodalgorithmonpageranking.TheideabehindtheKleinberg

XE"Kleinberg"

XE"JohnKleinberg"

algorithm(Klein1997)istocomputetwoscoresforeachpageinacommunity,thatisahub

XE"Hub"

scoreandanauthority

XE"Authority"

score.

Butwhatisahub

XE"Hub"

?Ahubcanbeexplainedasbeingapagepointingtomanyotherimportantpages.SupposeyouarelookingforanarticleontraditionalRomaniancookingtechniques.Youwillusuallystartusingasearchengine

XE"Searchengine"

(like[Google

XE"Google:Google"

],[Yahoo

XE"Yahoo"

],[Altavista],[Excite]),butthenyoumustseewhichofitsresultsarereallythepagesyouarelookingfor.Mostoftheactualsearchenginesreturnauthoritiesfirst,whicharepagescontainingvaluableinformationpointedtobymanyhubs.Soahubispointingtowardsmanyauthoritiesandanauthority

XE"Authority"

ispointingtowardsmanyhubs.Comingbacktoourscenario,thesesearchengineresultsaresometimesverygood(exactlythearticleyouwerelookingfor),andsometimesquiteawayfromyourinterests.Forthissecondcase,onemaytrytousehubs.Hubsareusuallyoneormoretopic-orientedandtrytogatherlinkstowardsasmany(good)pagesonthespecificissueaspossible.

TheAlgorithm

ThealgorithmstartswithasetRofpageswithhighpagerank

XE"Pagerank"

(asifitwerecomputedbyasearchengine

XE"Searchengine"

).Then,thissetisfirstlyextendedusingthefollowingmethod(laterinthispaperweshallcallitTheKleinberg

XE"Kleinberg"

XE"JohnKleinberg"

Extension

XE"TheKleinbergextension"

):

LetS=R

ForeachpagepinRdo

Let+(p)denotethesetofallpagesppointsto.

Let--(p)denotethesetofallpagespointingtop.

Addallpagesin+(p)toS

If|-(p)|<=dThen

Addallpagesin-(p)toS

Else

Addanarbitrarysetofdpagesfrom-(p)toS

EndFor

ReturnS

Intheoriginalpapertheextensionisappliedusingastartingsetwith200pagesandthedvalueequalto50.Itisnecessarytolimitthenumberofaddedpagesthatpointtoapagep,becausetherecanbethousandsofthem.Forexample,therearecurrentlyabout661,000pagespointingto

www

XE"WWW"

.

andabout252,000pagespointingto

.

Afterthetargetedsetofpagesisgenerated,thehubsandauthoritiesscorearecomputed.Twoweightsaredefinedforeachpagep:

Anauthority

XE"Authority"

weight

XE"Authorityweight"

x<p>

Ahub

XE"Hub"

weight

XE"Hubweight"

y<p>

Ifppointstomanypageswithlargex-values,thenitshouldreceivealargey-value;andifpispointedtobymanypageswithlargey-values,thenitshouldreceivealargex-value.Thismotivatesthedefinitionoftwooperationsontheweights,whichwedenotebyIandO.Givenweights{x<p>}and{y<p>}theIoperationupdatesthex-weightsasfollows:

HereweconsideredEtobethesetofexistinglinks.Thex-valueofpagepisthesumofally-valuesofpagesqpointingtop.

Analogously,theOoperationupdatesthey-weightsasfollows:

They-valueofpagepisthesumofallx-valuesofpagesqpointedtobyp.TheIandOoperationspracticallymeanthatthehubsandauthoritiesarereinforcingoneanother.Thecomputationofhubsandauthoritiesisnowasimpleiterationontheseformulas,likebelow:

Initializex0andy0with(1,1,1,…,1)inRn,wherenisthenumberofpages.

Givenaknumberofsteps,applythefollowingcomputationktimes:

ApplytheIoperationto(xi-1,yi-1)obtainingxi’

ApplytheOoperationto(xi’,yi-1)obtainingyi’

Normalizexi’obtainingxi

Normalizeyi’obtainingyi

End

Return(xk,yk)astheauthority

XE"Authority"

andhub

XE"Hub"

scoresrequired.

Ofcourse,thisalgorithmcanfurtherbeusedtofilterthetopcauthoritiesandhubs,startingfromabasesetofpages.Thealgorithmproposedin[Kleinberg1997]forthispurposeisdescribedbelow:

Letbeacollectionofpages

Letkandcbesomenaturalnumbers

Runthepreviousiterativealgorithmontocompute(xk,yk)

Reportthepageswithclargestcoordinatesinxkasauthorities

Reportthepageswithclargestcoordinatesinykashubs

TheGoogle

XE"Google:Google"

PagerankAlgorithm

PageRank

XE"Pagerank"

isatopicmuchdiscussedbySearchEngineOptimization

XE"SearchEngineOptimization"

(SEO)experts.AttheheartofPageRankwealwaysfindoneormoremathematicalformulas.

HowisPageRank

XE"Pagerank"

Used?

PageRank

XE"Pagerank"

isoneofthemethodsGoogle

XE"Google:Google"

usestodetermineapage’srelevanceorimportance.ItisonlyonepartofthestorywhenitcomestotheGooglelisting,andtheotheraspectswillbepresentedbelow.PageRankisalsodisplayedonthetoolbarofyourbrowserifyouhaveinstalledtheGoogletoolbar

XE"Google:GoogleToolbar"

(

/

).Still,theToolbarPageRankonlygoesfrom0–10andseemstobesomethinglikealogarithmicscale:

ToolbarPageRank

XE"Pagerank"

(logbase10)

RealPageRank

XE"Pagerank"

0

0-10

1

100-1,000

2

1,000-10,000

3

10,000-100,000

4

andsoon...

WecannotknowtheexactdetailsofthescalebecausethemaximumPageRank

XE"Pagerank"

ofallpagesontheWeb

XE"Web:Web"

changeseverymonthwhenGoogle

XE"Google:Google"

doesitsre-indexing.Alsothetoolbarsometimesguesses.ThetoolbaroftenshowsaToolbarPageRankforpagesjustuploadedandwhichcannotpossiblybeintheindex

XE"Index"

yet.

WhatishappeningisthatthetoolbarlooksattheURL

XE"URL"

ofthepagethebrowseritisdisplayingandstripsoffeverythingdownthelast“/”(i.e.itgoestothe“parent”pageinURLterms).IfGoogle

XE"Google:Google"

hasaToolbarPageRank

XE"Pagerank"

forthatparentthenitsubtracts1andshowsthatastheToolbarPageRankforthispage.IfthereisnoPageRankfortheparentitgoestotheparent’sparentpage,butsubtracting2,andsoonallthewayuptotherootofthesite.

IfitcannotfindaToolbarPageRanktodisplayinthisway,thatisifitdoesnotfindapagewitharealcalculatedPageRank

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论