版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
网络搜索引擎原理陈光()信息与通信工程学院Similarity
&
ClusteringClustering:theprocess
of
grouping
aset
ofobjects
into
classes
of
similar
objectsDocuments
within
a
cluster
should
be
similar.Documents
from
different
clusters
should
bedissimilar.The
commonest
form
of
unsupervised
learningUnsupervised
learning
=
learning
from
raw
data,as
opposed
to
supervised
data
where
aclassification
of
examples
is
givenA
common
and
important
task
that
finds
manyapplications
in
IR
and
other
placesCh.
16What
is
clustering?How
wouldyou
designan
algorithmfor
findingthe
threeclusters
inthis
case?Ch.
16A
data
set
with
clear
cluster
structureWhole
corpus
analysis/navigationBetter
user
interface:
search
withouttypingFor
improving
recall
insearch
applicationsBetter
search
results
(like
pseudo
RF)For
better
navigation
of
search
resultsEffective
“user
recall”
will
be
higherFor
speeding
up
vector
spaceretrievalCluster-based
retrieval
gives
fastersearchSec.
16.1Applications
of
clustering
in
Web
SearchYahoo!
Hierarchy
isn’t
clustering
but
isthekind
of
output
you
want
fromclusteringdairycropsagronomyforestryAIHCIcraftmissionsbotany
cellmagnetismevolutionrelativitycoursesagriculturebiologyphysicsCSspace.........…
(30)......Sample
of
Yahoo!Google
News:automatic
clusteringgivesaneffective
news
presentation
metaphorSample
of
NewsScatter/Gather:
Cutting,
Karger,
and
PedersenSec.
16.1Sample
of
Scatter/GatherFor
visualizing
a
document
collection
and
its
themesWise
et
al,
“Visualizing
the
non-visual”
PNNLThemeScapes,
Cartia–
[Mountain
height
=
cluster
size]Sample
of
VisualizationCluster
hypothesis
-
Documents
in
the
same
clusterbehave
similarly
with
respect
to
relevance
toinformation
needsTherefore,
to
improve
search
recall:Cluster
docs
in
corpus
a
prioriWhen
a
query
matches
a
doc
D,
also
return
otherdocs
in
the
cluster
containing
DHope
if
we
do
this:
The
query
“car”
will
also
returndocs
containing
automobileBecause
clustering
grouped
together
docscontaining
car
with
those
containing
automobile.Sec.
16.1For
improving
search
recallFor
grouping
search
results
thematicallySec.
16.1For
better
navigation
of
search
resultsRepresentation
for
clusteringDocument
representationVector
space?
Normalization?Centroids
aren’t
length
normalizedNeed
a
notion
of
similarity/distanceHow
manyclusters?Fixed
a
priori?Completely
data
driven?Avoid
“trivial”
clusters
-
too
large
or
smallIf
a
cluster's
too
large,
then
for
navigation
purposesyou've
wasted
an
extra
user
click
without
whittlingdown
the
set
of
documents
much.Sec.
16.2Issues
for
clusteringIdeal:
semantic
similarity.Practical:
term-statistical
similarityWewillusecosine
similarity.Docs
as
vectors.For
manyalgorithms,easier
tothinkintermsof
a
distance
(rather
than
similarity)
betweendocs.We
willmostlyspeak
ofEuclideandistanceBut
real
implementations
use
cosine
similarityNotion
of
similarity/distanceFlat
algorithmsUsually
start
with
a
random(partial)partitioningRefine
it
iterativelyK
means
clustering(Model
based
clustering)Hierarchical
algorithmsBottom-up,
agglomerative(Top-down,
divisive)Clustering
AlgorithmsHard
clustering:
Each
document
belongs
to
exactlyone
clusterMore
common
and
easier
to
doSoft
clustering:
A
document
can
belong
to
more
thanone
cluster.Makes
more
sense
for
applications
like
creating
browsablehierarchiesYou
may
want
to
put
a
pair
of
sneakers
in
two
clusters:
(i)sports
apparel
and
(ii)
shoesYou
can
only
do
that
with
a
soft
clustering
approach.Hard
vs.
soft
clusteringPartitioning
method:
Construct
a
partition
of
ndocuments
into
a
set
of
KclustersGiven:
a
set
of
documents
and
the
number
KFind:
a
partition
of
K
clustersthat
optimizesthechosen
partitioning
criterionGlobally
optimalIntractable
for
many
objective
functionsErgo,
exhaustively
enumerate
all
partitionsEffective
heuristic
methods:K-means
andK-medoids
algorithmsPartitioning
AlgorithmsReassignment
ofinstances
to
clusters
is
basedon
distancetothe
currentclustercentroids.–
(Or
one
can
equivalently
phrase
it
in
terms
ofsimilarities)x˛cx|
c
|
1μ(c)
=Sec.
16.4K-MeansAssumes
documentsarereal-valuedvectors.Clusters
based
oncentroids(akathecenter
ofgravityor
mean)
of
points
in
a
cluster,c:Select
K
random
docs
{s1,
s2,…
sK}as
seeds.Until
clustering
converges
(or
other
stopping
criterion):For
each
doc
di:Assign
di
to
the
cluster
cj
such
that
dist(xi,
sj)
is
minimal.(Next,
update
the
seeds
to
the
centroid
ofeachcluster)For
each
cluster
cjsj
=
m(cj)Sec.
16.4K-Means
AlgorithmxxxxPick
seedsReassign
clustersCompute
centroidsReassign
clustersCompute
centroidsReassign
clustersConverged!Sec.
16.4K
Means
Example
(K=2)Several
possibilities,
e.g.,Afixednumberofiterations.Doc
partition
unchanged.Centroid
positions
don’tchange.Sec.
16.4Termination
conditionsWhy
should
the
K-means
algorithmeverreacha
fixed
point?A
state
in
which
clustersdon’t
change.K-means
is
a
special
case
of
a
generalprocedure
known
as
the
ExpectationMaximization
(EM)
algorithm.EM
is
knownto
converge.Number
of
iterations
couldbe
large.–
But
in
practice
usually
isn’tSec.
16.4ConvergenceDefinegoodnessmeasureof
cluster
k
as
sumof
squared
distances
from
clustercentroid:–
Gk
=
Σi
(di
–ck)2
(sum
over
alldi
incluster
k)G
=Σk
GkReassignment
monotonically
decreases
Gsince
each
vector
isassigned
to
the
closestcentroid.Lower
case!Sec.
16.4Convergence
of
K-Meansputation
monotonically
decreases
each
Gksince(mk
is
number
of
members
in
cluster
k):Σ
(di–
a)2reaches
minimumfor:Σ
–2(di
–a)
=0Σ
di=Σ
amK
a
=
Σ
dia=
(1/
mk)
Σdi=ckK-meanstypicallyconverges
quicklySec.
16.4Convergence
of
K-MeansComputing
distance
between
two
docs
is
O(M)where
M
is
the
dimensionality
of
the
vectors.Reassigningclusters:
O(KN)
distancecomputations,
orO(KNM).Computing
centroids:
Each
doc
gets
added
oncetosomecentroid:O(NM).Assume
these
two
stepsare
each
done
once
forIiterations:
O(IKNM).Sec.
16.4Time
ComplexityResults
can
vary
based
onrandom
seed
selection.Some
seeds
can
result
in
poorconvergence
rate,
or
convergenceto
sub-optimal
clusterings.Select
good
seeds
using
a
heuristic(e.g.,
doc
least
similar
to
anyexisting
mean)Try
out
multiple
starting
pointsInitialize
with
the
results
of
anothermethod.In
the
above,
if
you
startwith
B
and
E
as
centroidsyou
converge
to
{A,B,C}and
{D,E,F}If
you
start
with
D
and
Fyou
converge
to{A,B,D,E}
{C,F}Exampleshowingsensitivity
to
seedsSec.
16.4Seed
Choiceputing
thecentroidaftereveryassignment(rather
than
after
all
points
are
re-assigned)
canimprove
speed
of
convergenceofK-meansAssumes
clustersarespherical
invector
spaceSensitive
to
coordinate
changes,
weighting
etc.Disjoint
and
exhaustiveDoesn’t
have
a
notion
of
“outliers”
by
defaultBut
can
add
outlier
filteringSec.
16.4K-means
issues,
variations,
etc.Number
of
clusters
K
isgivenPartition
n
docs
into
predetermined
number
ofclustersFinding
the
“right”
number
of
clusters
is
part
ofthe
problemGiven
docs,
partition
into
an
“appropriate”
numberof
subsets.E.g.,
for
query
results
-
ideal
value
of
K
not
knownup
front
-
though
UI
may
impose
limits.How
Many
Clusters?Say,
the
results
of
a
query.Solve
an
optimization
problem:penalizehaving
lots
ofclusters–
applicationdependent,e.g.,compressedsummary
of
search
results
list.Tradeoff
between
having
more
clusters(betterfocuswithineachcluster)andhaving
toomanyclustersK
not
specified
in
advanceGiven
a
clustering,
define
theBenefit
for
adoc
to
be
the
cosine
similarity
to
itscentroidDefine
theTotalBenefit
to
be
the
sum
ofthe
individual
doc
Benefits.K
not
specified
in
advanceFor
each
cluster,
we
have
a
Cost
C.Thus
for
aclustering
with
K
clusters,
the
Total
Cost
isKC.Define
theValueofaclusteringtobe=Total
Benefit
-
Total
Cost.Find
the
clustering
of
highestvalue,
over
allchoices
of
K.–
Total
benefit
increases
with
increasing
K.
But
canstop
when
it
doesn’t
increase
by
“much”.
The
Costterm
enforces
this.Penalize
lots
of
clustersTwo
important
paradigms:Bottom-up
agglomerative
clusteringTop-down
partitioningVisualisation
techniques:
Embedding
of
corpus
in
a
low-dimensional
spaceCharacterising
the
entities:Internally
:
Vector
space
model,
probabilistic
modelsExternally:
Measure
of
similarity/dissimilarity
between
pairsClustering(cont’d)31Clustering:
ParametersSimilarity
measure:
(eg:
Cosine
similarity)r(d1,
d2
)Distance
measure:
(eg:
Eucledian
distance)d(d1
,
d2
)Number
‘k’
of
clusters32Partitioning
ApproachesBottom-up
clusteringTop-down
clusteringGeometric
Embedding
ApproachesSelf-organization
map(SOM)Multidimensional
scalingLatent semantic
indexingGenerative
models
and
probabilistic
approachesSingle
topic
per
documentDocuments
correspond
to
mixtures
of
multiple
topicsClustering:
Formal
specification33Two
ways
to
get
partitionsbottom-up
clustering
and
top-down
clusteringPartitioning
Approaches34Partition
document
collection
into
k
clustersChoices:
{D1,
D2……Dk}–
Minimize
intra-cluster
distance
d(d1
,
d2
)i
d1
,d2˛
Di–
Maximize
r(d
,
Di
)i
d˛
DiSoft
clusteringi
d1
,d2˛
DiMaximize
intra-cluster
semblance
r(d1,
d2
)If
cluster
representations
Di
are
availableMinimize
d(d
,
Di
)i
d˛
Dii d˛
Di
i d˛
Did
assigned
to
Di
with
‘confidence’
Zd,iFind
Zd,i
so
as
to
minimize
zd
,i
r(d
,
Di
)or
maximize
zd
,id(d
,
Di
)For
each
G
keep
track
of
best
DUse
above
info
to
plot
the
hierarchical
merging
process(DENDROGRAM)To
get
desired
number
of
clusters:
cut
across
any
levelof
the
dendrogramHAC=hierarchical
agglomerative
clusteringBottom-up
clustering(HAC)35DendrogramA
dendrogram
presents
the
progressive,
hierarchy-forming
merging
process
pictorially.36Typically
s(G,D)
decreases
with
increasing
number
ofmergesSelf-Similarity–
Average
pair
wise
similarity
between
documents
in
Gs(d1,d2)
=
inter-document
similarity
measure
(say
cosine
ofTFIDF
vectors)Other
criteria:
Maximium/Minimum
pair
wise
similaritybetween
documents
in
the
clustersSimilarity
measure1
237F
d1
,d2˛F
,d1
„d2s(d
,
d
)C2s(F
)
=
1
Computationd˛Fp(F
)=dˆUn-normalized
group
profile:Can
show:1
2F
(F
-1)Fpˆ(F
),
pˆ(F
)
-
FC
2s(F
)=
1
s(d
,
d
)
=d1
,d2˛Fs(Γ
Δ)=
pˆ(G
D),
pˆ(G
D)
-
G
+
D
)(G
+
D
)(G
+
D
-1)pˆ(G
D),
pˆ(G
D)
=
pˆ(G),
pˆ(G)
+
pˆ(D),
pˆ(D)
+2
pˆ(G),
pˆ(D)38Bottom-upRequires
quadratic
time
and
spaceTop-down
or
move-to-nearestInternal
representation
for
documents
as
well
as
clustersPartition
documents
into
‘k’
clusters2
variants‘Hard’
(0/1)
assignment
of
documents
to
clusters‘soft’
:
documents
belong
to
clusters,
with
fractional
scoresTerminationwhen
assignment
of
documents
to
clusters
ceases
to
change(much)ORWhen
cluster
centroids
move
negligibly
over
successive
iterationsSwitch
to
top-down39Hard
k-Means:
Repeat…Choose
k
arbitrary
‘centroids’sign
each
document
to
nearest
centroidute
centroidsft
k-Means
:on’t
break
close
ties
between
document
assignmentsAspSoDtoclustersDon’t
make
documents
contribute
to
a
single
clusterfromdocument
d related
to
the
current
similarity
betweenand
d
.40Top-down
clusteringcContribution
for
ugpdating
cluster
centroidgmc
‹
mc
+
Dmcwhich
wins
nDamcr=rhowly(d
-
mc
)exp(-
|
d
-
m
|
)2exp(-
|
d
-
m
|2
)mcmcRun
bottom-up
group
average
clustering
algorithm
toreduce
to
k
groups
or
clusters
:
O(kn
log
n)
timePhrase
II:Iterate
assign-to-nearestMove
each
document
to
nearest
clusterpute
cluster
centroidsTotal
time
taken
is
O(kn)Two
phrases:
O(kn
log
n)Seeding
‘k’
clusterskn
)
documents41Phrase
I:Randomly
sample
OGoal:
Embedding
of
corpus
in
a
low-dimensional
spaceHierarchical
Agglomerative
Clustering
(HAC)lends
itself
easily
to
visualisatonSelf-Organization
map
(SOM)A
close
cousin
of
k-meansMultidimensional
scaling
(MDS)minimize
the
distortion
of
interpoint
distances
in
the
low-dimensional
embedding
as
compared
to
the
dissimilaritygiven
in
the
input
data.Latent
Semantic
Indexing
(LSI)Linear
transformations
to
reduce
number
of
dimensionsVisualisation
techniques42Self-Organization
Map
(SOM)Like
soft
k-means–
Determine
association
between
clusters
and
documents–
Associate
a
representative
vector
mc
with
each
cluster
anditeratively
refineUnlike
k-means–
Embed
the
clusters
in
a
low-dimensional
space
right
from
thebeginning–
Large
number
of
clusters
can
be
initialized
even
if
eventuallyA
self-organizing
map
(SOM)
or
self-organizing
feature
map
(SOFM)is
a
type
of
artificial
neural
network
(ANN)
that
is
trained
usingunsupervised
learning
to
produce
a
low-dimensional
(typically
two-dimensional),
discretized
representation
of
the
input
space
of
thetraining
samples,
called
a
map.
Self-organizing
maps
are
differentfrom
other
artificial
neural
networks
in
the
sense
that
they
use
aneighborhood
function
to
preserve
the
topological
properties
of
theinput
space.——
from
Wikimany
are
to
remain
devoid
of
documentsEach
cluster
can
be
a
slot
in
a
square/hexagonal
grid.The
grid
structure
defines
the
neighborhood
N(c)
for
each
cluster
cAlso
involves
a
proximity
function
h(c,g)
between
clusters and
c44as
well
asSOM
:
Update
RuleLike
Neural
networkData
item
d
activates
neuron
(closest
cluster)
cdthe
neighborhood neurons
N
(cd
)Eg.
Gaussian
neighborhood
function2s
2
(t)||
m
-
m
||2h(c,g)
=
exp(
c
g
)Update
rule
for
node under
the
influence
of
d
is:mg
(t
+1)
=
mg
(t)
+h(t)h(g,
cd
)(d
-
mg
)Where
h(t)
is
the
learning
rate
parameter4546SOM
:
Example
ISOM
computed
from
over
a
million
documents
taken
from
80
Usenet
news
groups.
Light
areas
have
ahigh
density
of
documents.
The
region
shown
is
near
groups
pc.chips
and
pc.video,
and
closer
inspectionshows
a
number
of
URLs
in
this
region
that
are
about
PC
videocards.47SOM:
Example
IIAnother
example
of
SOM
at
work:
the
sites
listed
in
the
Open
Directory
Projecthave
been
organized
within
a
map
of
Antarctica,
at
antarcti.ca/
(a).
Clicking
on
aregion
maintains
context
(inset)
and
zooms
in
on
more
specific
topics
(b).Documents
are
located
at
the
cluster
to
which
they
are
most
similar.48Goal“Distance
preserving”
low
dimensional
embedding
of
documentsSymmetric
inter-document
distances
dijGiven
a
priori
or
computed
from
internal
representationCoarse-grained
user
feedback“documents
i
and
j
are
quite
dissimilar”
or
“document
i
is
more
similar
to
j
than
k.”Multidimensional
Scaling(MDS)49ij–
With
increasing
feedback,
prior
distances
areoverridden–
User
provides
similarity
dˆ
between
documents
i
and
j
.goal
of
MDS:–
To
represent
documents
as
points
in
a
low-dimensional
space(often
2D
to
3D)
such
that
the
Euclidean
distance
betweenany
pair
of
points
is
as
close
as
possible
to
the
distancebetween
them
specified
by
the
input.Objective
:
Minimize
the
stress
of
embedding22ˆijijij(dd-d
)stress
=i,
ji,
jMultidimensional
Scaling(MDS)50MDS:
issuesStress
not
easy
to
optimizeIterative
hill
climbingPoints
(documents)
assigned
random
coordinatesby
external
heuristicPoints
moved
by
small
distance
in
direction
oflocally
decreasing
stressFor
n
documentsEach
takes
O(n)
time
to
be
movedTotally
O(n2
)
time
per
relaxation51No
internal
representation
of
documents
availableGoal–
find
a
projection
from
an
‘n’
dimensional
space
toa
space
with
a
smaller
number
‘k’
of
dimensions.Iterative
projection
of
documents
along
lines
ofmaximum
spreadEach
1D
projection
preserves
distance
informationFast
Map
[Faloutsos
’95]52Pivots
for
a
line:
two
points
(a
and
b)
that
determine
itAvoid
exhaustive
checking
by
picking
pivots
that
arefar
apartFirst
coordinates
x1
of
point
x
on
“best
line”
(a,b)Fast
Map
-
Best
linea,bd
212d+
d
2
-
d
2x
=
a,x
a,b
b,x
53For i
=
1
to
kFind
a
next
(ith)
“best”
lineA
“best”
line
is
onewhich
gives
maximumvariance
of
the point-set
in
the
direction
of
thelineProject
points
on
the
lineProject
points
on
the
“hyperplane”
orthogonal
to
theabove
lineFast
Map
-
Iterative
projection54Project
recursivelyupto
1-D
spaceTime:
O(nk)
timeProduct: (x1,
.
.
.
,
xk)
foreach
point
x
in
the
originaldata
setFast
Map
–
Hyperplane
Projection2'
2'
'dx
,
y=
dx,
y
-(x1
-
y1
)Purpose–
To
correct
inter-pointdistances
dx'
,
y'
between
points
(x’,y’)
bytaking
into
account
the
components
(x1,y1)already
accounted
for
by
the
first
pivot
line.a,bd
212d+
d
2
-
d
2x
=
a,
x
a,b
b,x
55Fast
Map
–
ExampleResponse
to the
query“Tony
Bennett”5657Vector
Space
Model:
Pros58Automatic
selection
of
index
termsPartial
matching
of
queries
and
documents
(dealingwith
the
case
where
no
document
contains
all
search
terms)Ranking
according
to
similarity
score
(dealing
withlarge
result
sets)Term
weighting
schemes
(improves
retrievalperformance)Various
extensionsDocument
clusteringRelevance
feedback
(modifying
query
vector)Geometric
foundation59Problems
with
Lexical
SemanticsPolysemy:词通常有multitudeofmeanings和不同用法。VectorSpaceModel不能区分同一个词的不同含义,即ambiguity.Synonymy:不同的terms可能具有dentical
or
asimilarmeaning.VectorSpaceModel里不能表达词之间的associations.60Issues
in
the
VSMterms之间的独立性假设有些terms更可能在一起出现同义词,相关词汇,拼写错误,etc.根据上下文,terms可能有不同的含义term-document矩阵维度很高对每篇文档/每个词,真的有那么多重要的特征?61Perform
alow-rank
approximation
of
term-document
matrix
(typical
rank
100-300)General
ideaMap
documents
(and
terms)
to
a
low-dimensionalrepresentation.Design
a
mapping
such
that
the
low-dimensionalspace
reflects
semantic
associations
(latentsemantic
space).Compute
document
similarity
based
on
the
innerproduct
in
this
latent
semantic
spaceLatent
Semantic
Indexing
(LSI)62Latent
Semantic
Indexing
(LSI)对角阵 给出各个dimensions的相对重要性t
·
d t
·
r对term-document矩阵作奇异值分解
SingularValue
positionDT,文档在r
dimensions里的表示T,
用于把新文档转换到r
dimensions里的矩阵wtd=TSr
·
rDTr
·
d63Low-rank
Approximationt
·
dt
·
rwtd=TSr
·
rDTr
·
dt
·
dt
·
kw'td=k
·
kk
·
dSTDT64≈从原始的term-document矩阵Ar,我们计算得到它的近似Ak.在Ak
中,每行对应一个term,每列对应一个document区别是,文档在新的空间,它的维度k
<<
r
dimensionsWhat
it
is65Tmatrix每个term在LSI
space的向量原始matrix:
terms向量是d-dimensional,T中要小很多Dimensions是在相同文档中倾向于与这个词“同现”的一组termssynonyms,
contextually-related
words,
variantendingsLSI
Term
matrix
T66S
gives
an
orderingtothedimensions值下降非常快尾部的singular
valuesat
代表"noise"在low-valuedimensions截止可以减少noise,提高性能Singular
Values67DT
matrix在LSI
space中文档的表示和T
vectors有相同的dimensionality可用于计算查询和一个文档的similarityDocument
matrix
D68LSI检索过程:查询映射/投影到LSI的DT空间,称为“folded
in”:document/query
vector
乘上S-1T
(TTT=DTD=I)文档集的文档向量为DT两者通过dot-product计算相似度性能提升来自…去除了noise不需要stem
terms
(variants
will
co-occur)不需要stop
list没有速度和空间上的改进,though…Improved
Retrieval
with
LSI69C=Sample
of
LSI70Tr=Sr=Dr=Sample
of
LSI71Sample
of
LSI(Cont’)S2=D2=72Sample
of
LSI(Cont’)Map
into
2-dimenstion
space73Latent
semantic
space:
illustrating
exampleSample
of
LSI(Cont’)74Experiments
on
TREC
1/2/3
–
DumaisPrecision
at
or
above
median
TREC
precision–
Top
scorer
on
almost
20%
of
TREC
topicsSlightly
better
on
average
than
straight
vector
spacesEffect
of
dimensionality:DimensionsPrecision2500.3673000.3713460.374Empirical
evidence75在很多场合,我们都有feature-object
matrix.矩阵是高维,有大量冗余,从而能使用low-rankapproximation.比如文本检索,the
terms是features,the
docs是objects.Latent
Semantic
Index比如opinions和users
…数据不全(e.g.,users’opinions),可以在低维空间里恢复.Powerful
general
analytical
techniqueLSI
has
many
other
applications76IR
based
on
Language
Model
(LM)queryd1d2dn…Informationneeddocument
collection1MdMd2…nM
d通常的search方法:猜测作者写相关文档时使用的词,形成queryThe
LM
approach
directly
exploits
that
idea!78P(Q
|
Md
)generationFormal
Language
(Model)I
wish传统的生成模型generative
model:产生strings
–Finite
state
machinesor
regular
grammars,etc.Example:I
wishI
wish
I
wishI
wish
I
wish
I
wish79I
wish
I
wish
I
wish
I
wish…(I
wish)
*Stochastic
Language
ModelsModel’s
probability
of
generating
strings
in
thelanguage
(commonly
all
strings
over
alphabet
∑)0.2
the0.1
a0.01
man0.01
woman0.03
said0.02
likes…themanlikesthewoman0.20.010.020.20.01multiply80Model
MP(s
|
M)
=
0.00000008Stochastic
Language
ModelsModel
probability
of
generating
any
string0.2
the0.01
class0.0001
sayst0.0001
pleaseth0.0001
yon0.0005
maiden0.01
womanModel
M1classpleasethyon
maidenthe810.01
0.0001
0.0001
0.00050.0001
0.02
0.1
0.010.20.2P(s|M2)
>
P(s|M1)Model
M20.2
the0.0001
class0.03
sayst0.02
pleaseth0.1
yon0.01
maiden0.0001
womanStochastic
Language
Models用来生成文本的统计模型–
Probability
distribution
over
strings
in
a
given
languageMP
( |
M
) =
P
(|
M
)P
(|
M,)P
(|
M,)P
(|
M,)82Stochastic
Language
Models|
) P
(
|)=
P
(
) P
(
|
) P
(Unigram
Language
ModelsP
( )
P
(
) P
(
) P
(
)P
()Bigram
(generally,
n-gram)
Language
ModelsP
( )
P
(
| )
P
(
|
) P
(
|
)Easy.Effective!83Using
Language
Models
in
IR84每篇文档对应一个model按P(d
|
q)对文档排序P(d
|
q)
=
P(q
|
d)
x
P(d)
/
P(q)P(q)
is
the
same
for
all
documents,
so
ignoreP(d)
[the
prior]
is
often
treated
as
the
same
for
all
dBut
wecould
use
criteria
like
authority,
length,
genreP(q
|
d)
is
the
probability
of
q
given
d’s
modelVery
general
formalappro
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 家庭室内塑胶地板施工方案
- 2024至2030年中国酒店一卡通软件系统行业投资前景及策略咨询研究报告
- 商业展览会穹顶搭建方案
- 2024至2030年中国聚氯乙烯填充料数据监测研究报告
- 二年级数学下册工作总结
- 食堂五四制度
- 2024年中国金属拼装益智玩具市场调查研究报告
- 2024年中国负离子夹板市场调查研究报告
- 职业病危害事故处理与报告制度
- 2024年中国套装电动工具市场调查研究报告
- 安保工作考核表
- 2024年广西高考生物试卷真题(含答案)
- 2024年国家公务员考试《行测》真题(副省级)
- 2023-2024学年冀教版八年级上册期中复习试卷(含解析)
- 广东省广州市2019年中考英语真题(含答案)
- 期货基础知识真题汇编5
- 研究生考试考研英语(二204)试卷及答案指导(2024年)
- 儿科题库单选题100道及答案解析
- 电子政务概论-形考任务5(在线测试权重20%)-国开-参考资料
- 古代小说戏曲专题-形考任务2-国开-参考资料
- GB/T 451.2-2023纸和纸板第2部分:定量的测定
评论
0/150
提交评论