




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
SRE
是什么鬼SRE是什么鬼1Google
SRE
07-14•
YouTube
Streaming•
Video
transcoding,
streaming,
storage(
>
1PB/month
)•
Global
CDN
network(
>
10K
nodes,
peaking
10Tbps
egress).GoogleSRE07-14•YouTubeSt2Google
SRE
07-14•
Cloud
Platform•
Machine
lifecycle
management(>
X
clusters
globally,
>
Y
machines)•
Borg
,
Omega(>
X
million
jobs
scheduled
every
week)GoogleSRE07-14•GoogleClo3来自GoogleDevOps经验的落地实践-SRE课件4SITE
RELIABILITY
ENGINEERING说白了就是
DevOps
一回事SITERELIABILITYENGINEERING说白5Site•
生产线管理员•
Ensure
user-visible
uptime
and
service
quality•
Authority
over
production
environment.•
跟网站一起成长•
Steep
learning
curve,
mostly
due
to
complexity•
Continuous
retraining,
sites
always
being
improved•
基础架构设施•
Specializations
for
shared
infrastructure•
Ensure
those
components
have
good
reliabilitySite•生产线管理员•Ensureuser-vi6Reliability•
it
just
works•
Service
Level
Objective
(SLO)•
Monitoring/Deployment•
Capacity
Planning•
以一敌百•
Team
manages
monitoring
and
develops
automation•
Implies
use
of
scripting
and
data
analysis
tools•
Most
failures
need
automated
recoveries
in
place•
救火队员和纵火犯合体•
Elevated
risk
during
convenient
working
hours•
Learn
of
age
mortality
risk
during
preceding
workday•
Infant
mortality
ideally
also
avoids
mealsReliability•itjustworks7Engineering•
码农•
Not
administration•
报警系统重度(中毒)用户•
Holes
may
cause
outage
before
notification
occurs•
Routinely
use
multiple
layers,
levels
and
viewpoints•
Design
the
manual
and
automatic
escalation
paths•
对未来负责•
Responsible
for
enabling
growth
and
scaling•
Plan
for
requirements,
identify
inefficiencies•
File
bugs
and,
where
appropriate,
fix
them
tooEngineering•码农•Notadminis8Who
are
SRE•
跑偏了的程序员•
50-50
mix
of
software
background
systemsengineering
background.•
重度强迫症和处女座•
“a
team
of
people
who
fundamentally
will
not
accept
doing
things
over
and
over
by
hand.
“
–
Ben
Treynor•
脸皮厚•
DEV
/
OPSWhoareSRE•跑偏了的程序员•50-509Eternal
conflict•
DEV•
The
incentive
of
the
development
team
is
to
get
features
launched
and
to
get
users
to
adopt
the
product.
•
OPS•
The
incentives
of
a
team
with
operational
duties
is
to
ensure
that
the
thing
doesn't
blow
up
on
their
watch.
Eternalconflict•DEV•The10一图看懂组织结构•
BOSS•
产品线
•
小BOSS
•
艺术类
•
开发团队•
生产线
•
APP
SRE
•
Infrastructure
SRE
•
数据中心运营•供应链一图看懂组织结构•BOSS•供应链11组织结构•
以各产品线为核心,松散的学习型组织
•
Get
Incentives
right.
•
SRE
is
a
privilege,
not
a
right.
•
Free
to
move,
Free
to
leave
bad
service.•
SRE
要做什么
SRE
说了算
•
Production
Readiness
Review
(PRR).
•
ROI
matters
most
for
SRE
•
SRE
resource
is
limited
•
High
marginal
benefits
work.组织结构•以各产品线为核心,松散的学习型组织12Early
phase•
SRE
gives
guidance
in
automating
routine
tasks
•
Reduces
workload
by
eliminating
administrivia•
SRE
points
out
errors,
omissions
in
documents
•
Developer
might
then
beg
others
for
assistance•
SRE
suggests
additional
long
term
monitors
•
These
fill
in
coverage
gaps
and
track
performance
•
Administrators
need
sufficient,
trustworthy
monitoringEarlyphase•SREgivesguid13Mature
phase•The
decisions
become
progressively
longer
term
•
Daily
task
workload
for
a
site
is
getting
reduced
•
Software
improvements
are
tuning
and
analysis•The
developer
still
has
a
short
term
viewpoint
•
Working
on
the
next
release,
fixing
known
bugs
•
The
old
live
releases
start
to
be
a
distraction•
An
obvious
incentive
to
request
site
transfer
to
SREMaturephase•Thedecisionsbec14ONCALL
PHASE•
On
call
–
more
than
quick
fixes•
SRE
team
members
take
turns.
•
Fix
any
problem
whose
solution
is
not
yet
automated
•
Accumulate
occurrence
counts
to
identify
priorities
Document
the
effective
diagnostics
and
solutions•
The
permanent
solution
takes
a
lot
more
time
•
File
bug,
develop
patch,
test,
code
review,
submit
•
Schedule
for
integration,
release
and
deployment
•
Why
spend
many
hours
or
days
doing
all
that?ONCALLPHASE•Oncall–more15Deployment
model•
Following
the
sun.•
Only
one
engineer
responds
to
any
given
alert
•
Use
a
priority
or
escalation
rule
to
avoid
wasted
effort
•
The
other
SREs
on
call
are
unlikely
to
be
disturbed•
Redundancy
everywhere!•
What
is
the
failure
rate
of
your
paging
services?•
Hopefully
better
than
10%,
unlikely
to
achieve
1%
eg:5%
with
four
way
redundant
paths
is
99.999%Deploymentmodel•Following16SRE
Best
practiceHow
to
build
a
good
SRE
team.SREBestpracticeHowtobui17CMM
maturity
model•
Level
1
-
Initial
(Chaotic,
Heroic)•
Level
2
–
Repeatable•
Level
3
–
Defined•
Level
4
–
Managed•
Level
5
-
OptimizingCMMmaturitymodel•Level118如何有成效的填坑OPS
OVERLOAD•
Reduce
complexity•
Less
dependencies,
configuration
types,
interfaces.•
Knowledge
sharing•
Assume
there
are
no
humans
operating.
•
Refocus
human
involvement•
Quarterly
Service
Review•
Hard
cap
50%
ops
load.•
Provide
career
path如何有成效的填坑OPSOVERLOAD•Reduce19正确的和开发团队掐架SLO
Budgeting•
Establish
SLO
Goal•
Nothing
is
100%
reliable.•
Spend
error
budget•
On
bad
releases,
technical
debt
etc.•
FREEZE
on
blown
budget.•
No
more
arguing•
Its’
physics!•
Self-policing
Incentives.•
Moral
authority正确的和开发团队掐架SLOBudgeting•Est20灾难级别分类FAILURES•Failover
with
minimal
delay,
near
full
quality
service•Failover
with
significant
delay,
near
full
quality
service•Partial
or
limited
service,
with
good
to
mediumquality•Prevent
crash
with
no
or
very
limited
service,
lowquality•Crash
without
data
loss
or
corruption•Crash
with
data
loss•Crash
with
data
loss,
corruption,
destruction灾难级别分类FAILURES•Failoverwith21fix
it
really
quickly
when
it
does
fail.
minutes
to
something
that
goes
wrong.and
takes
the
appropriate
corrective
actions,
versus安全生产两大指标MTBF
/
MTTR•
Availability
=
f(MTBF,
MTTR)•
You
can
make
it
fail
very
rarely,
or
you
are
able
to•
Typically,
no
human
will
respond
in
less
than
two•
It
is
that
the
human
correctly
assesses
the
situation
diagnosing
incorrectly
or
taking
ineffective
steps.fixitreallyquicklywhen22如何设计不会坏的系统Design:
Failures•
Defense
in
depth
•
All
the
different
layers
of
the
system
can/will
fail..
•
User
exp
must
not
be
affected.
•
No
human
involvement.•
Graceful
degradation
•
Caching
/
Time
shifting
•
Failover•
Redundant
Instances
,
N
+
2•
Localization
of
issue.如何设计不会坏的系统Design:Failures•23如何正确的花钱买机器Capacity
planning•
what
N+M
do
you
run
your
services
at?•
"I
don't
know,
because
we've
never
assessed
what
thecapacity
of
our
service
is."
•
Tips•
How
to
benchmark
service,•
How
to
measure
its
response
to
100%
or
130%
of
peak.•
How
much
spare
capacity
you
have
at
peak
demandtime•
Expect
frequent
outages
and
lots
of
emergencies.如何正确的花钱买机器Capacityplanning•24Something
that
is
happening
or
about
to
happen,
thatthe
situation.You
have
maybe
hours,
typically,
days,
but
someavailable
for
diagnostic
or
forensic
purposes.
The正确的实现监控系统DESIGN:
monitoring•
Alerts•
which
say
a
human
must
take
action
right
now.
a
human
needs
to
take
action
immediately
to
improve•
Tickets.•
A
human
needs
to
take
action,
but
not
immediately.human
action
is
required.•
Logging.•
No
one
ever
needs
to
look
at
this
information,
but
it
isexpectation
is
that
no
one
reads
it.Somethingthatishappeningor25situations
until
they
don't
have
to
think
about
it.实战演习Wheel
of
misfortune•
Operational
readiness
drills.•
Tips•
Picking
a
disaster•
Role
playing•
Observe•
Drill
people
on
the
correct
response
to
emergency•
Culturally
compatible.situationsuntiltheydon't26地球级别的灾难演习D.I.R.T•
Simulated
disaster
recovery.•
Total
site
loss.•
Incident
management.•
Business
continuation.地球级别的灾难演习D.I.R.T•Simulatedd27POSTMOTERM•
Blameless
postmortem•
About
process
and
technology,
not
people•
Readable
&
shared
to
wide
variety
of
readers.•
No
over-engineeringPOSTMOTERM•Blamelesspostmo28POSTMOTERM•
Capture
the
facts•
Impacted
services
and
magnitude•
Incident
timeline•
Key
contact
info•
Data•
Root
cause
and
trigger
analysis•
5
Whys•
Why
was
this
possible
in
the
first
place.POSTMOTERM•Capturethefac29POSTMOTERM•
Lessons
learned
•
What
went
well•
What
went
wrong•
Action
Plan•
Investigate,
management
of
issue•
Mitigation•
Prevention.•
A
problem
is
resolved
to
the
degree
that
no
human
being
will
ever
have
to
pay
attention
to
it
again.POSTMOTERM•Lessonslearned•30来自GoogleDevOps经验的落地实践-SRE课件31SRE
是什么鬼SRE是什么鬼32Google
SRE
07-14•
YouTube
Streaming•
Video
transcoding,
streaming,
storage(
>
1PB/month
)•
Global
CDN
network(
>
10K
nodes,
peaking
10Tbps
egress).GoogleSRE07-14•YouTubeSt33Google
SRE
07-14•
Cloud
Platform•
Machine
lifecycle
management(>
X
clusters
globally,
>
Y
machines)•
Borg
,
Omega(>
X
million
jobs
scheduled
every
week)GoogleSRE07-14•GoogleClo34来自GoogleDevOps经验的落地实践-SRE课件35SITE
RELIABILITY
ENGINEERING说白了就是
DevOps
一回事SITERELIABILITYENGINEERING说白36Site•
生产线管理员•
Ensure
user-visible
uptime
and
service
quality•
Authority
over
production
environment.•
跟网站一起成长•
Steep
learning
curve,
mostly
due
to
complexity•
Continuous
retraining,
sites
always
being
improved•
基础架构设施•
Specializations
for
shared
infrastructure•
Ensure
those
components
have
good
reliabilitySite•生产线管理员•Ensureuser-vi37Reliability•
it
just
works•
Service
Level
Objective
(SLO)•
Monitoring/Deployment•
Capacity
Planning•
以一敌百•
Team
manages
monitoring
and
develops
automation•
Implies
use
of
scripting
and
data
analysis
tools•
Most
failures
need
automated
recoveries
in
place•
救火队员和纵火犯合体•
Elevated
risk
during
convenient
working
hours•
Learn
of
age
mortality
risk
during
preceding
workday•
Infant
mortality
ideally
also
avoids
mealsReliability•itjustworks38Engineering•
码农•
Not
administration•
报警系统重度(中毒)用户•
Holes
may
cause
outage
before
notification
occurs•
Routinely
use
multiple
layers,
levels
and
viewpoints•
Design
the
manual
and
automatic
escalation
paths•
对未来负责•
Responsible
for
enabling
growth
and
scaling•
Plan
for
requirements,
identify
inefficiencies•
File
bugs
and,
where
appropriate,
fix
them
tooEngineering•码农•Notadminis39Who
are
SRE•
跑偏了的程序员•
50-50
mix
of
software
background
systemsengineering
background.•
重度强迫症和处女座•
“a
team
of
people
who
fundamentally
will
not
accept
doing
things
over
and
over
by
hand.
“
–
Ben
Treynor•
脸皮厚•
DEV
/
OPSWhoareSRE•跑偏了的程序员•50-5040Eternal
conflict•
DEV•
The
incentive
of
the
development
team
is
to
get
features
launched
and
to
get
users
to
adopt
the
product.
•
OPS•
The
incentives
of
a
team
with
operational
duties
is
to
ensure
that
the
thing
doesn't
blow
up
on
their
watch.
Eternalconflict•DEV•The41一图看懂组织结构•
BOSS•
产品线
•
小BOSS
•
艺术类
•
开发团队•
生产线
•
APP
SRE
•
Infrastructure
SRE
•
数据中心运营•供应链一图看懂组织结构•BOSS•供应链42组织结构•
以各产品线为核心,松散的学习型组织
•
Get
Incentives
right.
•
SRE
is
a
privilege,
not
a
right.
•
Free
to
move,
Free
to
leave
bad
service.•
SRE
要做什么
SRE
说了算
•
Production
Readiness
Review
(PRR).
•
ROI
matters
most
for
SRE
•
SRE
resource
is
limited
•
High
marginal
benefits
work.组织结构•以各产品线为核心,松散的学习型组织43Early
phase•
SRE
gives
guidance
in
automating
routine
tasks
•
Reduces
workload
by
eliminating
administrivia•
SRE
points
out
errors,
omissions
in
documents
•
Developer
might
then
beg
others
for
assistance•
SRE
suggests
additional
long
term
monitors
•
These
fill
in
coverage
gaps
and
track
performance
•
Administrators
need
sufficient,
trustworthy
monitoringEarlyphase•SREgivesguid44Mature
phase•The
decisions
become
progressively
longer
term
•
Daily
task
workload
for
a
site
is
getting
reduced
•
Software
improvements
are
tuning
and
analysis•The
developer
still
has
a
short
term
viewpoint
•
Working
on
the
next
release,
fixing
known
bugs
•
The
old
live
releases
start
to
be
a
distraction•
An
obvious
incentive
to
request
site
transfer
to
SREMaturephase•Thedecisionsbec45ONCALL
PHASE•
On
call
–
more
than
quick
fixes•
SRE
team
members
take
turns.
•
Fix
any
problem
whose
solution
is
not
yet
automated
•
Accumulate
occurrence
counts
to
identify
priorities
Document
the
effective
diagnostics
and
solutions•
The
permanent
solution
takes
a
lot
more
time
•
File
bug,
develop
patch,
test,
code
review,
submit
•
Schedule
for
integration,
release
and
deployment
•
Why
spend
many
hours
or
days
doing
all
that?ONCALLPHASE•Oncall–more46Deployment
model•
Following
the
sun.•
Only
one
engineer
responds
to
any
given
alert
•
Use
a
priority
or
escalation
rule
to
avoid
wasted
effort
•
The
other
SREs
on
call
are
unlikely
to
be
disturbed•
Redundancy
everywhere!•
What
is
the
failure
rate
of
your
paging
services?•
Hopefully
better
than
10%,
unlikely
to
achieve
1%
eg:5%
with
four
way
redundant
paths
is
99.999%Deploymentmodel•Following47SRE
Best
practiceHow
to
build
a
good
SRE
team.SREBestpracticeHowtobui48CMM
maturity
model•
Level
1
-
Initial
(Chaotic,
Heroic)•
Level
2
–
Repeatable•
Level
3
–
Defined•
Level
4
–
Managed•
Level
5
-
OptimizingCMMmaturitymodel•Level149如何有成效的填坑OPS
OVERLOAD•
Reduce
complexity•
Less
dependencies,
configuration
types,
interfaces.•
Knowledge
sharing•
Assume
there
are
no
humans
operating.
•
Refocus
human
involvement•
Quarterly
Service
Review•
Hard
cap
50%
ops
load.•
Provide
career
path如何有成效的填坑OPSOVERLOAD•Reduce50正确的和开发团队掐架SLO
Budgeting•
Establish
SLO
Goal•
Nothing
is
100%
reliable.•
Spend
error
budget•
On
bad
releases,
technical
debt
etc.•
FREEZE
on
blown
budget.•
No
more
arguing•
Its’
physics!•
Self-policing
Incentives.•
Moral
authority正确的和开发团队掐架SLOBudgeting•Est51灾难级别分类FAILURES•Failover
with
minimal
delay,
near
full
quality
service•Failover
with
significant
delay,
near
full
quality
service•Partial
or
limited
service,
with
good
to
mediumquality•Prevent
crash
with
no
or
very
limited
service,
lowquality•Crash
without
data
loss
or
corruption•Crash
with
data
loss•Crash
with
data
loss,
corruption,
destruction灾难级别分类FAILURES•Failoverwith52fix
it
really
quickly
when
it
does
fail.
minutes
to
something
that
goes
wrong.and
takes
the
appropriate
corrective
actions,
versus安全生产两大指标MTBF
/
MTTR•
Availability
=
f(MTBF,
MTTR)•
You
can
make
it
fail
very
rarely,
or
you
are
able
to•
Typically,
no
human
will
respond
in
less
than
two•
It
is
that
the
human
correctly
assesses
the
situation
diagnosing
incorrectly
or
taking
ineffective
steps.fixitreallyquicklywhen53如何设计不会坏的系统Design:
Failures•
Defense
in
depth
•
All
the
different
layers
of
the
system
can/will
fail..
•
User
exp
must
not
be
affected.
•
No
human
involvement.•
Graceful
degradation
•
Caching
/
Time
shifting
•
Failover•
Redundant
Instances
,
N
+
2•
Localization
of
issue.如何设计不会坏的系统Design:Failures•54如何正确的花钱买机器Capacity
planning•
what
N+M
do
you
run
your
services
at?•
"I
don't
know,
because
we've
never
assessed
what
thecapacity
of
our
service
is."
•
Tips•
How
to
benchmark
service,•
How
to
measure
its
response
to
100%
or
130%
of
peak.•
How
much
spare
capacity
you
have
at
peak
demandtime•
Expect
frequent
outages
and
lots
of
emergencies.如何正确的花
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年悬浮床加氢裂化催化剂项目建议书
- 幼儿行为观察记录培训
- 护理反思日记实践体系
- 社区护理评估方法
- 经济适用房满年限上市买卖及共有权证明注销合同
- 影视行业服装道具损坏赔偿补充合同
- 医疗机构护理劳务外包保密协议书
- 能源行业市场调研与分析补充协议
- 直播带货平台与商家佣金分成协议
- 豪华私人飞机氧气舱设施租赁服务协议
- JGJT 223-2010 预拌砂浆应用技术规程
- 电力电缆基础知识专题培训课件
- 初级消防设施操作员实操详解
- 施工组织设计实训任务书
- 贪污贿赂犯罪PPT(培训)(PPT168页)课件
- 机械原理课程设计巧克力包装机(共27页)
- 安达信-深圳证券交易所人力资源管理咨询项目现状分析报告PPT课件
- 毕业论文行星减速器设计完稿
- 半波偶极子天线地HFSS仿真设计
- 混凝土搅拌车车队管理制度
- GJBC标准培训
评论
0/150
提交评论