来自GoogleDevOps经验的落地实践-SRE课件_第1页
来自GoogleDevOps经验的落地实践-SRE课件_第2页
来自GoogleDevOps经验的落地实践-SRE课件_第3页
来自GoogleDevOps经验的落地实践-SRE课件_第4页
来自GoogleDevOps经验的落地实践-SRE课件_第5页
已阅读5页,还剩57页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

SRE

是什么鬼SRE是什么鬼1Google

SRE

07-14•

YouTube

Streaming•

Video

transcoding,

streaming,

storage(

>

1PB/month

)•

Global

CDN

network(

>

10K

nodes,

peaking

10Tbps

egress).GoogleSRE07-14•YouTubeSt2Google

SRE

07-14•

Google

Cloud

Platform•

Machine

lifecycle

management(>

X

clusters

globally,

>

Y

machines)•

Borg

,

Omega(>

X

million

jobs

scheduled

every

week)GoogleSRE07-14•GoogleClo3来自GoogleDevOps经验的落地实践-SRE课件4SITE

RELIABILITY

ENGINEERING说白了就是

DevOps

一回事SITERELIABILITYENGINEERING说白5Site•

生产线管理员•

Ensure

user-visible

uptime

and

service

quality•

Authority

over

production

environment.•

跟网站一起成长•

Steep

learning

curve,

mostly

due

to

complexity•

Continuous

retraining,

sites

always

being

improved•

基础架构设施•

Specializations

for

shared

infrastructure•

Ensure

those

components

have

good

reliabilitySite•生产线管理员•Ensureuser-vi6Reliability•

it

just

works•

Service

Level

Objective

(SLO)•

Monitoring/Deployment•

Capacity

Planning•

以一敌百•

Team

manages

monitoring

and

develops

automation•

Implies

use

of

scripting

and

data

analysis

tools•

Most

failures

need

automated

recoveries

in

place•

救火队员和纵火犯合体•

Elevated

risk

during

convenient

working

hours•

Learn

of

age

mortality

risk

during

preceding

workday•

Infant

mortality

ideally

also

avoids

mealsReliability•itjustworks7Engineering•

码农•

Not

administration•

报警系统重度(中毒)用户•

Holes

may

cause

outage

before

notification

occurs•

Routinely

use

multiple

layers,

levels

and

viewpoints•

Design

the

manual

and

automatic

escalation

paths•

对未来负责•

Responsible

for

enabling

growth

and

scaling•

Plan

for

requirements,

identify

inefficiencies•

File

bugs

and,

where

appropriate,

fix

them

tooEngineering•码农•Notadminis8Who

are

SRE•

跑偏了的程序员•

50-50

mix

of

software

background

systemsengineering

background.•

重度强迫症和处女座•

“a

team

of

people

who

fundamentally

will

not

accept

doing

things

over

and

over

by

hand.

Ben

Treynor•

脸皮厚•

DEV

/

OPSWhoareSRE•跑偏了的程序员•50-509Eternal

conflict•

DEV•

The

incentive

of

the

development

team

is

to

get

features

launched

and

to

get

users

to

adopt

the

product.

OPS•

The

incentives

of

a

team

with

operational

duties

is

to

ensure

that

the

thing

doesn't

blow

up

on

their

watch.

Eternalconflict•DEV•The10一图看懂组织结构•

BOSS•

产品线

小BOSS

艺术类

开发团队•

生产线

APP

SRE

Infrastructure

SRE

数据中心运营•供应链一图看懂组织结构•BOSS•供应链11组织结构•

以各产品线为核心,松散的学习型组织

Get

Incentives

right.

SRE

is

a

privilege,

not

a

right.

Free

to

move,

Free

to

leave

bad

service.•

SRE

要做什么

SRE

说了算

Production

Readiness

Review

(PRR).

ROI

matters

most

for

SRE

SRE

resource

is

limited

High

marginal

benefits

work.组织结构•以各产品线为核心,松散的学习型组织12Early

phase•

SRE

gives

guidance

in

automating

routine

tasks

Reduces

workload

by

eliminating

administrivia•

SRE

points

out

errors,

omissions

in

documents

Developer

might

then

beg

others

for

assistance•

SRE

suggests

additional

long

term

monitors

These

fill

in

coverage

gaps

and

track

performance

Administrators

need

sufficient,

trustworthy

monitoringEarlyphase•SREgivesguid13Mature

phase•The

decisions

become

progressively

longer

term

Daily

task

workload

for

a

site

is

getting

reduced

Software

improvements

are

tuning

and

analysis•The

developer

still

has

a

short

term

viewpoint

Working

on

the

next

release,

fixing

known

bugs

The

old

live

releases

start

to

be

a

distraction•

An

obvious

incentive

to

request

site

transfer

to

SREMaturephase•Thedecisionsbec14ONCALL

PHASE•

On

call

more

than

quick

fixes•

SRE

team

members

take

turns.

Fix

any

problem

whose

solution

is

not

yet

automated

Accumulate

occurrence

counts

to

identify

priorities

Document

the

effective

diagnostics

and

solutions•

The

permanent

solution

takes

a

lot

more

time

File

bug,

develop

patch,

test,

code

review,

submit

Schedule

for

integration,

release

and

deployment

Why

spend

many

hours

or

days

doing

all

that?ONCALLPHASE•Oncall–more15Deployment

model•

Following

the

sun.•

Only

one

engineer

responds

to

any

given

alert

Use

a

priority

or

escalation

rule

to

avoid

wasted

effort

The

other

SREs

on

call

are

unlikely

to

be

disturbed•

Redundancy

everywhere!•

What

is

the

failure

rate

of

your

paging

services?•

Hopefully

better

than

10%,

unlikely

to

achieve

1%

eg:5%

with

four

way

redundant

paths

is

99.999%Deploymentmodel•Following16SRE

Best

practiceHow

to

build

a

good

SRE

team.SREBestpracticeHowtobui17CMM

maturity

model•

Level

1

-

Initial

(Chaotic,

Heroic)•

Level

2

Repeatable•

Level

3

Defined•

Level

4

Managed•

Level

5

-

OptimizingCMMmaturitymodel•Level118如何有成效的填坑OPS

OVERLOAD•

Reduce

complexity•

Less

dependencies,

configuration

types,

interfaces.•

Knowledge

sharing•

Assume

there

are

no

humans

operating.

Refocus

human

involvement•

Quarterly

Service

Review•

Hard

cap

50%

ops

load.•

Provide

career

path如何有成效的填坑OPSOVERLOAD•Reduce19正确的和开发团队掐架SLO

Budgeting•

Establish

SLO

Goal•

Nothing

is

100%

reliable.•

Spend

error

budget•

On

bad

releases,

technical

debt

etc.•

FREEZE

on

blown

budget.•

No

more

arguing•

Its’

physics!•

Self-policing

Incentives.•

Moral

authority正确的和开发团队掐架SLOBudgeting•Est20灾难级别分类FAILURES•Failover

with

minimal

delay,

near

full

quality

service•Failover

with

significant

delay,

near

full

quality

service•Partial

or

limited

service,

with

good

to

mediumquality•Prevent

crash

with

no

or

very

limited

service,

lowquality•Crash

without

data

loss

or

corruption•Crash

with

data

loss•Crash

with

data

loss,

corruption,

destruction灾难级别分类FAILURES•Failoverwith21fix

it

really

quickly

when

it

does

fail.

minutes

to

something

that

goes

wrong.and

takes

the

appropriate

corrective

actions,

versus安全生产两大指标MTBF

/

MTTR•

Availability

=

f(MTBF,

MTTR)•

You

can

make

it

fail

very

rarely,

or

you

are

able

to•

Typically,

no

human

will

respond

in

less

than

two•

It

is

that

the

human

correctly

assesses

the

situation

diagnosing

incorrectly

or

taking

ineffective

steps.fixitreallyquicklywhen22如何设计不会坏的系统Design:

Failures•

Defense

in

depth

All

the

different

layers

of

the

system

can/will

fail..

User

exp

must

not

be

affected.

No

human

involvement.•

Graceful

degradation

Caching

/

Time

shifting

Failover•

Redundant

Instances

,

N

+

2•

Localization

of

issue.如何设计不会坏的系统Design:Failures•23如何正确的花钱买机器Capacity

planning•

what

N+M

do

you

run

your

services

at?•

"I

don't

know,

because

we've

never

assessed

what

thecapacity

of

our

service

is."

Tips•

How

to

benchmark

service,•

How

to

measure

its

response

to

100%

or

130%

of

peak.•

How

much

spare

capacity

you

have

at

peak

demandtime•

Expect

frequent

outages

and

lots

of

emergencies.如何正确的花钱买机器Capacityplanning•24Something

that

is

happening

or

about

to

happen,

thatthe

situation.You

have

maybe

hours,

typically,

days,

but

someavailable

for

diagnostic

or

forensic

purposes.

The正确的实现监控系统DESIGN:

monitoring•

Alerts•

which

say

a

human

must

take

action

right

now.

a

human

needs

to

take

action

immediately

to

improve•

Tickets.•

A

human

needs

to

take

action,

but

not

immediately.human

action

is

required.•

Logging.•

No

one

ever

needs

to

look

at

this

information,

but

it

isexpectation

is

that

no

one

reads

it.Somethingthatishappeningor25situations

until

they

don't

have

to

think

about

it.实战演习Wheel

of

misfortune•

Operational

readiness

drills.•

Tips•

Picking

a

disaster•

Role

playing•

Observe•

Drill

people

on

the

correct

response

to

emergency•

Culturally

compatible.situationsuntiltheydon't26地球级别的灾难演习D.I.R.T•

Simulated

disaster

recovery.•

Total

site

loss.•

Incident

management.•

Business

continuation.地球级别的灾难演习D.I.R.T•Simulatedd27POSTMOTERM•

Blameless

postmortem•

About

process

and

technology,

not

people•

Readable

&

shared

to

wide

variety

of

readers.•

No

over-engineeringPOSTMOTERM•Blamelesspostmo28POSTMOTERM•

Capture

the

facts•

Impacted

services

and

magnitude•

Incident

timeline•

Key

contact

info•

Data•

Root

cause

and

trigger

analysis•

5

Whys•

Why

was

this

possible

in

the

first

place.POSTMOTERM•Capturethefac29POSTMOTERM•

Lessons

learned

What

went

well•

What

went

wrong•

Action

Plan•

Investigate,

management

of

issue•

Mitigation•

Prevention.•

A

problem

is

resolved

to

the

degree

that

no

human

being

will

ever

have

to

pay

attention

to

it

again.POSTMOTERM•Lessonslearned•30来自GoogleDevOps经验的落地实践-SRE课件31SRE

是什么鬼SRE是什么鬼32Google

SRE

07-14•

YouTube

Streaming•

Video

transcoding,

streaming,

storage(

>

1PB/month

)•

Global

CDN

network(

>

10K

nodes,

peaking

10Tbps

egress).GoogleSRE07-14•YouTubeSt33Google

SRE

07-14•

Google

Cloud

Platform•

Machine

lifecycle

management(>

X

clusters

globally,

>

Y

machines)•

Borg

,

Omega(>

X

million

jobs

scheduled

every

week)GoogleSRE07-14•GoogleClo34来自GoogleDevOps经验的落地实践-SRE课件35SITE

RELIABILITY

ENGINEERING说白了就是

DevOps

一回事SITERELIABILITYENGINEERING说白36Site•

生产线管理员•

Ensure

user-visible

uptime

and

service

quality•

Authority

over

production

environment.•

跟网站一起成长•

Steep

learning

curve,

mostly

due

to

complexity•

Continuous

retraining,

sites

always

being

improved•

基础架构设施•

Specializations

for

shared

infrastructure•

Ensure

those

components

have

good

reliabilitySite•生产线管理员•Ensureuser-vi37Reliability•

it

just

works•

Service

Level

Objective

(SLO)•

Monitoring/Deployment•

Capacity

Planning•

以一敌百•

Team

manages

monitoring

and

develops

automation•

Implies

use

of

scripting

and

data

analysis

tools•

Most

failures

need

automated

recoveries

in

place•

救火队员和纵火犯合体•

Elevated

risk

during

convenient

working

hours•

Learn

of

age

mortality

risk

during

preceding

workday•

Infant

mortality

ideally

also

avoids

mealsReliability•itjustworks38Engineering•

码农•

Not

administration•

报警系统重度(中毒)用户•

Holes

may

cause

outage

before

notification

occurs•

Routinely

use

multiple

layers,

levels

and

viewpoints•

Design

the

manual

and

automatic

escalation

paths•

对未来负责•

Responsible

for

enabling

growth

and

scaling•

Plan

for

requirements,

identify

inefficiencies•

File

bugs

and,

where

appropriate,

fix

them

tooEngineering•码农•Notadminis39Who

are

SRE•

跑偏了的程序员•

50-50

mix

of

software

background

systemsengineering

background.•

重度强迫症和处女座•

“a

team

of

people

who

fundamentally

will

not

accept

doing

things

over

and

over

by

hand.

Ben

Treynor•

脸皮厚•

DEV

/

OPSWhoareSRE•跑偏了的程序员•50-5040Eternal

conflict•

DEV•

The

incentive

of

the

development

team

is

to

get

features

launched

and

to

get

users

to

adopt

the

product.

OPS•

The

incentives

of

a

team

with

operational

duties

is

to

ensure

that

the

thing

doesn't

blow

up

on

their

watch.

Eternalconflict•DEV•The41一图看懂组织结构•

BOSS•

产品线

小BOSS

艺术类

开发团队•

生产线

APP

SRE

Infrastructure

SRE

数据中心运营•供应链一图看懂组织结构•BOSS•供应链42组织结构•

以各产品线为核心,松散的学习型组织

Get

Incentives

right.

SRE

is

a

privilege,

not

a

right.

Free

to

move,

Free

to

leave

bad

service.•

SRE

要做什么

SRE

说了算

Production

Readiness

Review

(PRR).

ROI

matters

most

for

SRE

SRE

resource

is

limited

High

marginal

benefits

work.组织结构•以各产品线为核心,松散的学习型组织43Early

phase•

SRE

gives

guidance

in

automating

routine

tasks

Reduces

workload

by

eliminating

administrivia•

SRE

points

out

errors,

omissions

in

documents

Developer

might

then

beg

others

for

assistance•

SRE

suggests

additional

long

term

monitors

These

fill

in

coverage

gaps

and

track

performance

Administrators

need

sufficient,

trustworthy

monitoringEarlyphase•SREgivesguid44Mature

phase•The

decisions

become

progressively

longer

term

Daily

task

workload

for

a

site

is

getting

reduced

Software

improvements

are

tuning

and

analysis•The

developer

still

has

a

short

term

viewpoint

Working

on

the

next

release,

fixing

known

bugs

The

old

live

releases

start

to

be

a

distraction•

An

obvious

incentive

to

request

site

transfer

to

SREMaturephase•Thedecisionsbec45ONCALL

PHASE•

On

call

more

than

quick

fixes•

SRE

team

members

take

turns.

Fix

any

problem

whose

solution

is

not

yet

automated

Accumulate

occurrence

counts

to

identify

priorities

Document

the

effective

diagnostics

and

solutions•

The

permanent

solution

takes

a

lot

more

time

File

bug,

develop

patch,

test,

code

review,

submit

Schedule

for

integration,

release

and

deployment

Why

spend

many

hours

or

days

doing

all

that?ONCALLPHASE•Oncall–more46Deployment

model•

Following

the

sun.•

Only

one

engineer

responds

to

any

given

alert

Use

a

priority

or

escalation

rule

to

avoid

wasted

effort

The

other

SREs

on

call

are

unlikely

to

be

disturbed•

Redundancy

everywhere!•

What

is

the

failure

rate

of

your

paging

services?•

Hopefully

better

than

10%,

unlikely

to

achieve

1%

eg:5%

with

four

way

redundant

paths

is

99.999%Deploymentmodel•Following47SRE

Best

practiceHow

to

build

a

good

SRE

team.SREBestpracticeHowtobui48CMM

maturity

model•

Level

1

-

Initial

(Chaotic,

Heroic)•

Level

2

Repeatable•

Level

3

Defined•

Level

4

Managed•

Level

5

-

OptimizingCMMmaturitymodel•Level149如何有成效的填坑OPS

OVERLOAD•

Reduce

complexity•

Less

dependencies,

configuration

types,

interfaces.•

Knowledge

sharing•

Assume

there

are

no

humans

operating.

Refocus

human

involvement•

Quarterly

Service

Review•

Hard

cap

50%

ops

load.•

Provide

career

path如何有成效的填坑OPSOVERLOAD•Reduce50正确的和开发团队掐架SLO

Budgeting•

Establish

SLO

Goal•

Nothing

is

100%

reliable.•

Spend

error

budget•

On

bad

releases,

technical

debt

etc.•

FREEZE

on

blown

budget.•

No

more

arguing•

Its’

physics!•

Self-policing

Incentives.•

Moral

authority正确的和开发团队掐架SLOBudgeting•Est51灾难级别分类FAILURES•Failover

with

minimal

delay,

near

full

quality

service•Failover

with

significant

delay,

near

full

quality

service•Partial

or

limited

service,

with

good

to

mediumquality•Prevent

crash

with

no

or

very

limited

service,

lowquality•Crash

without

data

loss

or

corruption•Crash

with

data

loss•Crash

with

data

loss,

corruption,

destruction灾难级别分类FAILURES•Failoverwith52fix

it

really

quickly

when

it

does

fail.

minutes

to

something

that

goes

wrong.and

takes

the

appropriate

corrective

actions,

versus安全生产两大指标MTBF

/

MTTR•

Availability

=

f(MTBF,

MTTR)•

You

can

make

it

fail

very

rarely,

or

you

are

able

to•

Typically,

no

human

will

respond

in

less

than

two•

It

is

that

the

human

correctly

assesses

the

situation

diagnosing

incorrectly

or

taking

ineffective

steps.fixitreallyquicklywhen53如何设计不会坏的系统Design:

Failures•

Defense

in

depth

All

the

different

layers

of

the

system

can/will

fail..

User

exp

must

not

be

affected.

No

human

involvement.•

Graceful

degradation

Caching

/

Time

shifting

Failover•

Redundant

Instances

,

N

+

2•

Localization

of

issue.如何设计不会坏的系统Design:Failures•54如何正确的花钱买机器Capacity

planning•

what

N+M

do

you

run

your

services

at?•

"I

don't

know,

because

we've

never

assessed

what

thecapacity

of

our

service

is."

Tips•

How

to

benchmark

service,•

How

to

measure

its

response

to

100%

or

130%

of

peak.•

How

much

spare

capacity

you

have

at

peak

demandtime•

Expect

frequent

outages

and

lots

of

emergencies.如何正确的花

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论