来自GoogleDevOps经验的落地实践-SRE课件

上传人：l*** IP属地：贵州上传时间：2022-12-19 格式：PPTX 页数：62 大小：413.29KB 积分：25 举报 版权申诉

已阅读5页，还剩57页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

SRE

是什么鬼SRE是什么鬼1Google

SRE

07-14•

YouTube

Streaming•

Video

transcoding,

streaming,

storage(

1PB/month

)•

Global

CDN

network(

10K

nodes,

peaking

10Tbps

egress).GoogleSRE07-14•YouTubeSt2Google

SRE

07-14•

Google

Cloud

Platform•

Machine

lifecycle

management(>

clusters

globally,

machines)•

Borg

Omega(>

million

jobs

scheduled

every

week)GoogleSRE07-14•GoogleClo3来自GoogleDevOps经验的落地实践-SRE课件4SITE

RELIABILITY

ENGINEERING说白了就是

DevOps

一回事SITERELIABILITYENGINEERING说白5Site•

生产线管理员•

Ensure

user-visible

uptime

and

service

quality•

Authority

over

production

environment.•

跟网站一起成长•

Steep

learning

curve,

mostly

due

complexity•

Continuous

retraining,

sites

always

being

improved•

基础架构设施•

Specializations

for

shared

infrastructure•

Ensure

those

components

have

good

reliabilitySite•生产线管理员•Ensureuser-vi6Reliability•

just

works•

Service

Level

Objective

(SLO)•

Monitoring/Deployment•

Capacity

Planning•

以一敌百•

Team

manages

monitoring

and

develops

automation•

Implies

use

scripting

and

data

analysis

tools•

Most

failures

need

automated

recoveries

place•

救火队员和纵火犯合体•

Elevated

risk

during

convenient

working

hours•

Learn

age

mortality

risk

during

preceding

workday•

Infant

mortality

ideally

also

avoids

mealsReliability•itjustworks7Engineering•

码农•

Not

administration•

报警系统重度（中毒）用户•

Holes

may

cause

outage

before

notification

occurs•

Routinely

use

multiple

layers,

levels

and

viewpoints•

Design

the

manual

and

automatic

escalation

paths•

对未来负责•

Responsible

for

enabling

growth

and

scaling•

Plan

for

requirements,

identify

inefficiencies•

File

bugs

and,

where

appropriate,

fix

them

tooEngineering•码农•Notadminis8Who

are

SRE•

跑偏了的程序员•

50-50

mix

software

background

systemsengineering

background.•

重度强迫症和处女座•

“a

team

people

who

fundamentally

will

not

doing

things

over

and

over

hand.

“

–

Ben

Treynor•

脸皮厚•

DEV

OPSWhoareSRE•跑偏了的程序员•50-509Eternal

conflict•

DEV•

The

incentive

the

development

team

get

features

launched

and

get

users

adopt

the

product.

•

OPS•

The

incentives

team

with

operational

duties

ensure

that

the

thing

doesn't

blow

their

watch.

Eternalconflict•DEV•The10一图看懂组织结构•

BOSS•

产品线

•

小BOSS

•

艺术类

•

开发团队•

生产线

•

APP

SRE

•

Infrastructure

SRE

•

数据中心运营•供应链一图看懂组织结构•BOSS•供应链11组织结构•

以各产品线为核心，松散的学习型组织

•

Get

Incentives

right.

•

SRE

privilege,

not

right.

•

Free

move,

Free

leave

bad

service.•

SRE

要做什么

SRE

说了算

•

Production

Readiness

Review

(PRR).

•

ROI

matters

most

for

SRE

•

SRE

resource

limited

•

High

marginal

benefits

work.组织结构•以各产品线为核心，松散的学习型组织12Early

phase•

SRE

gives

guidance

automating

routine

tasks

•

Reduces

workload

eliminating

administrivia•

SRE

points

out

errors,

omissions

documents

•

Developer

might

then

beg

others

for

assistance•

SRE

suggests

additional

long

term

monitors

•

These

fill

coverage

gaps

and

track

performance

•

Administrators

need

sufficient,

trustworthy

monitoringEarlyphase•SREgivesguid13Mature

phase•The

decisions

become

progressively

longer

term

•

Daily

task

workload

for

site

getting

reduced

•

Software

improvements

are

tuning

and

analysis•The

developer

still

has

short

term

viewpoint

•

Working

the

release,

fixing

known

bugs

•

The

old

live

releases

start

distraction•

obvious

incentive

request

site

transfer

SREMaturephase•Thedecisionsbec14ONCALL

PHASE•

call

–

than

quick

fixes•

SRE

team

members

take

turns.

•

Fix

any

problem

whose

solution

not

yet

automated

•

Accumulate

occurrence

counts

identify

priorities

Document

the

effective

diagnostics

and

solutions•

The

permanent

solution

takes

lot

time

•

File

bug,

develop

patch,

test,

code

review,

submit

•

Schedule

for

integration,

release

and

deployment

•

Why

spend

many

hours

days

doing

all

that?ONCALLPHASE•Oncall–more15Deployment

model•

Following

the

sun.•

Only

one

engineer

responds

any

given

alert

•

Use

priority

escalation

rule

avoid

wasted

effort

•

The

other

SREs

call

are

unlikely

disturbed•

Redundancy

everywhere!•

What

the

failure

rate

your

paging

services?•

Hopefully

better

than

10%,

unlikely

achieve

eg:5%

with

four

way

redundant

paths

99.999%Deploymentmodel•Following16SRE

Best

practiceHow

build

good

SRE

team.SREBestpracticeHowtobui17CMM

maturity

model•

Level

Initial

(Chaotic,

Heroic)•

Level

–

Repeatable•

Level

–

Defined•

Level

–

Managed•

Level

OptimizingCMMmaturitymodel•Level118如何有成效的填坑OPS

OVERLOAD•

Reduce

complexity•

Less

dependencies,

configuration

types,

interfaces.•

Knowledge

sharing•

Assume

there

are

humans

operating.

•

Refocus

human

involvement•

Quarterly

Service

Review•

Hard

cap

50%

ops

load.•

Provide

career

path如何有成效的填坑OPSOVERLOAD•Reduce19正确的和开发团队掐架SLO

Budgeting•

Establish

SLO

Goal•

Nothing

100%

reliable.•

Spend

error

budget•

bad

releases,

technical

debt

etc.•

FREEZE

blown

budget.•

arguing•

Its’

physics!•

Self-policing

Incentives.•

Moral

authority正确的和开发团队掐架SLOBudgeting•Est20灾难级别分类FAILURES•Failover

with

minimal

delay,

near

full

quality

service•Failover

with

significant

delay,

near

full

quality

service•Partial

limited

service,

with

good

mediumquality•Prevent

crash

with

very

limited

service,

lowquality•Crash

without

data

loss

corruption•Crash

with

data

loss•Crash

with

data

loss,

corruption,

destruction灾难级别分类FAILURES•Failoverwith21fix

really

quickly

when

does

fail.

minutes

something

that

goes

wrong.and

takes

the

appropriate

corrective

actions,

versus安全生产两大指标MTBF

MTTR•

Availability

f(MTBF,

MTTR)•

You

can

make

fail

very

rarely,

you

are

able

to•

Typically,

human

will

respond

less

than

two•

that

the

human

correctly

assesses

the

situation

diagnosing

incorrectly

taking

ineffective

steps.fixitreallyquicklywhen22如何设计不会坏的系统Design:

Failures•

Defense

depth

•

All

the

different

layers

the

system

can/will

fail..

•

User

exp

must

not

affected.

•

human

involvement.•

Graceful

degradation

•

Caching

Time

shifting

•

Failover•

Redundant

Instances

2•

Localization

issue.如何设计不会坏的系统Design:Failures•23如何正确的花钱买机器Capacity

planning•

what

N+M

you

run

your

services

at?•

don't

know,

because

we've

never

assessed

what

thecapacity

our

service

is."

•

Tips•

How

benchmark

service,•

How

measure

its

response

100%

130%

peak.•

How

much

spare

capacity

you

have

peak

demandtime•

Expect

frequent

outages

and

lots

emergencies.如何正确的花钱买机器Capacityplanning•24Something

that

happening

about

happen,

thatthe

situation.You

have

maybe

hours,

typically,

days,

but

someavailable

for

diagnostic

forensic

purposes.

The正确的实现监控系统DESIGN:

monitoring•

Alerts•

which

say

human

must

take

action

right

now.

human

needs

take

action

immediately

improve•

Tickets.•

human

needs

take

action,

but

not

immediately.human

action

required.•

Logging.•

one

ever

needs

look

this

information,

but

isexpectation

that

one

reads

it.Somethingthatishappeningor25situations

until

they

don't

have

think

about

it.实战演习Wheel

misfortune•

Operational

readiness

drills.•

Tips•

Picking

disaster•

Role

playing•

Observe•

Drill

people

the

correct

response

emergency•

Culturally

compatible.situationsuntiltheydon't26地球级别的灾难演习D.I.R.T•

Simulated

disaster

recovery.•

Total

site

loss.•

Incident

management.•

Business

continuation.地球级别的灾难演习D.I.R.T•Simulatedd27POSTMOTERM•

Blameless

postmortem•

About

process

and

technology,

not

people•

Readable

shared

wide

variety

readers.•

over-engineeringPOSTMOTERM•Blamelesspostmo28POSTMOTERM•

Capture

the

facts•

Impacted

services

and

magnitude•

Incident

timeline•

Key

contact

info•

Data•

Root

cause

and

trigger

analysis•

Whys•

Why

was

this

possible

the

first

place.POSTMOTERM•Capturethefac29POSTMOTERM•

Lessons

learned

•

What

went

well•

What

went

wrong•

Action

Plan•

Investigate,

management

issue•

Mitigation•

Prevention.•

problem

resolved

the

degree

that

human

being

will

ever

have

pay

attention

again.POSTMOTERM•Lessonslearned•30来自GoogleDevOps经验的落地实践-SRE课件31SRE

是什么鬼SRE是什么鬼32Google

SRE

07-14•

YouTube

Streaming•

Video

transcoding,

streaming,

storage(

1PB/month

)•

Global

CDN

network(

10K

nodes,

peaking

10Tbps

egress).GoogleSRE07-14•YouTubeSt33Google

SRE

07-14•

Google

Cloud

Platform•

Machine

lifecycle

management(>

clusters

globally,

machines)•

Borg

Omega(>

million

jobs

scheduled

every

week)GoogleSRE07-14•GoogleClo34来自GoogleDevOps经验的落地实践-SRE课件35SITE

RELIABILITY

ENGINEERING说白了就是

DevOps

一回事SITERELIABILITYENGINEERING说白36Site•

生产线管理员•

Ensure

user-visible

uptime

and

service

quality•

Authority

over

production

environment.•

跟网站一起成长•

Steep

learning

curve,

mostly

due

complexity•

Continuous

retraining,

sites

always

being

improved•

基础架构设施•

Specializations

for

shared

infrastructure•

Ensure

those

components

have

good

reliabilitySite•生产线管理员•Ensureuser-vi37Reliability•

just

works•

Service

Level

Objective

(SLO)•

Monitoring/Deployment•

Capacity

Planning•

以一敌百•

Team

manages

monitoring

and

develops

automation•

Implies

use

scripting

and

data

analysis

tools•

Most

failures

need

automated

recoveries

place•

救火队员和纵火犯合体•

Elevated

risk

during

convenient

working

hours•

Learn

age

mortality

risk

during

preceding

workday•

Infant

mortality

ideally

also

avoids

mealsReliability•itjustworks38Engineering•

码农•

Not

administration•

报警系统重度（中毒）用户•

Holes

may

cause

outage

before

notification

occurs•

Routinely

use

multiple

layers,

levels

and

viewpoints•

Design

the

manual

and

automatic

escalation

paths•

对未来负责•

Responsible

for

enabling

growth

and

scaling•

Plan

for

requirements,

identify

inefficiencies•

File

bugs

and,

where

appropriate,

fix

them

tooEngineering•码农•Notadminis39Who

are

SRE•

跑偏了的程序员•

50-50

mix

software

background

systemsengineering

background.•

重度强迫症和处女座•

“a

team

people

who

fundamentally

will

not

doing

things

over

and

over

hand.

“

–

Ben

Treynor•

脸皮厚•

DEV

OPSWhoareSRE•跑偏了的程序员•50-5040Eternal

conflict•

DEV•

The

incentive

the

development

team

get

features

launched

and

get

users

adopt

the

product.

•

OPS•

The

incentives

team

with

operational

duties

ensure

that

the

thing

doesn't

blow

their

watch.

Eternalconflict•DEV•The41一图看懂组织结构•

BOSS•

产品线

•

小BOSS

•

艺术类

•

开发团队•

生产线

•

APP

SRE

•

Infrastructure

SRE

•

数据中心运营•供应链一图看懂组织结构•BOSS•供应链42组织结构•

以各产品线为核心，松散的学习型组织

•

Get

Incentives

right.

•

SRE

privilege,

not

right.

•

Free

move,

Free

leave

bad

service.•

SRE

要做什么

SRE

说了算

•

Production

Readiness

Review

(PRR).

•

ROI

matters

most

for

SRE

•

SRE

resource

limited

•

High

marginal

benefits

work.组织结构•以各产品线为核心，松散的学习型组织43Early

phase•

SRE

gives

guidance

automating

routine

tasks

•

Reduces

workload

eliminating

administrivia•

SRE

points

out

errors,

omissions

documents

•

Developer

might

then

beg

others

for

assistance•

SRE

suggests

additional

long

term

monitors

•

These

fill

coverage

gaps

and

track

performance

•

Administrators

need

sufficient,

trustworthy

monitoringEarlyphase•SREgivesguid44Mature

phase•The

decisions

become

progressively

longer

term

•

Daily

task

workload

for

site

getting

reduced

•

Software

improvements

are

tuning

and

analysis•The

developer

still

has

short

term

viewpoint

•

Working

the

release,

fixing

known

bugs

•

The

old

live

releases

start

distraction•

obvious

incentive

request

site

transfer

SREMaturephase•Thedecisionsbec45ONCALL

PHASE•

call

–

than

quick

fixes•

SRE

team

members

take

turns.

•

Fix

any

problem

whose

solution

not

yet

automated

•

Accumulate

occurrence

counts

identify

priorities

Document

the

effective

diagnostics

and

solutions•

The

permanent

solution

takes

lot

time

•

File

bug,

develop

patch,

test,

code

review,

submit

•

Schedule

for

integration,

release

and

deployment

•

Why

spend

many

hours

days

doing

all

that?ONCALLPHASE•Oncall–more46Deployment

model•

Following

the

sun.•

Only

one

engineer

responds

any

given

alert

•

Use

priority

escalation

rule

avoid

wasted

effort

•

The

other

SREs

call

are

unlikely

disturbed•

Redundancy

everywhere!•

What

the

failure

rate

your

paging

services?•

Hopefully

better

than

10%,

unlikely

achieve

eg:5%

with

four

way

redundant

paths

99.999%Deploymentmodel•Following47SRE

Best

practiceHow

build

good

SRE

team.SREBestpracticeHowtobui48CMM

maturity

model•

Level

Initial

(Chaotic,

Heroic)•

Level

–

Repeatable•

Level

–

Defined•

Level

–

Managed•

Level

OptimizingCMMmaturitymodel•Level149如何有成效的填坑OPS

OVERLOAD•

Reduce

complexity•

Less

dependencies,

configuration

types,

interfaces.•

Knowledge

sharing•

Assume

there

are

humans

operating.

•

Refocus

human

involvement•

Quarterly

Service

Review•

Hard

cap

50%

ops

load.•

Provide

career

path如何有成效的填坑OPSOVERLOAD•Reduce50正确的和开发团队掐架SLO

Budgeting•

Establish

SLO

Goal•

Nothing

100%

reliable.•

Spend

error

budget•

bad

releases,

technical

debt

etc.•

FREEZE

blown

budget.•

arguing•

Its’

physics!•

Self-policing

Incentives.•

Moral

authority正确的和开发团队掐架SLOBudgeting•Est51灾难级别分类FAILURES•Failover

with

minimal

delay,

near

full

quality

service•Failover

with

significant

delay,

near

full

quality

service•Partial

limited

service,

with

good

mediumquality•Prevent

crash

with

very

limited

service,

lowquality•Crash

without

data

loss

corruption•Crash

with

data

loss•Crash

with

data

loss,

corruption,

destruction灾难级别分类FAILURES•Failoverwith52fix

really

quickly

when

does

fail.

minutes

something

that

goes

wrong.and

takes

the

appropriate

corrective

actions,

versus安全生产两大指标MTBF

MTTR•

Availability

f(MTBF,

MTTR)•

You

can

make

fail

very

rarely,

you

are

able

to•

Typically,

human

will

respond

less

than

two•

that

the

human

correctly

assesses

the

situation

diagnosing

incorrectly

taking

ineffective

steps.fixitreallyquicklywhen53如何设计不会坏的系统Design:

Failures•

Defense

depth

•

All

the

different

layers

the

system

can/will

fail..

•

User

exp

must

not

affected.

•

human

involvement.•

Graceful

degradation

•

Caching

Time

shifting

•

Failover•

Redundant

Instances

2•

Localization

issue.如何设计不会坏的系统Design:Failures•54如何正确的花钱买机器Capacity

planning•

what

N+M

you

run

your

services

at?•

don't

know,

because

we've

never

assessed

what

thecapacity

our

service

is."

•

Tips•

How

benchmark

service,•

How

measure

its

response

100%

130%

peak.•

How

much

spare

capacity

you

have

peak

demandtime•

Expect

frequent

outages

and

lots

emergencies.如何正确的花

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

来自GoogleDevOps经验的落地实践-SRE课件

文档简介

温馨提示

最新文档

评论

来自GoogleDevOps经验的落地实践-SRE课件

文档简介

温馨提示

最新文档

评论

相关文档