操作系统-清华-参考_第1页
操作系统-清华-参考_第2页
操作系统-清华-参考_第3页
操作系统-清华-参考_第4页
操作系统-清华-参考_第5页
已阅读5页,还剩45页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

File

Systems计算机科学与技术系2014.11.04操作系统专题训练20142OutlineBackgroundThe

Rising

of

Big

DataFile

System

BasisFundamentalsKeyIssuesFile

Systems

Optimization

inthe

Real

WorldExample:GFS/HDFSOptimization

Techniques数据增长(2010-2020)2010

年全球数字世界的规模首次达到了ZB级别,即1.227

ZB2005

年这个数字只有130

EB到2020

年 的数字世界规模将达到40ZB40

ZB相当于地球上所有海滩上的沙粒数量的57倍;全世界人均拥有5,247

GB

的数据3Qmee:

Online

in

60

Seconds4Data

type

distribution5相对于传统的结构化数据,非结构化数据、内容数据的增长迅速,且蕴含了极大的价值New

development:Data-Intensive

Computing

as

the

4th

ParadigmThousand

yearsago

ExperimentalScienceDescription

ofnatural

phenomenaLast

few

hundred

years

–Theoretical

ScienceNewton’s

Laws,

Maxwell’sEquations…Last

few

decades

ComputationalScienceSimulation

of

complex

phenomenaToday

Data-Intensive

Scienceunify

theory,

experiment,

&

simulation6其他一些说法7Hype

Cycle

for

Big

DataHype

Cycle

for

Big

Data9Hype

Cycle

for

Big

Data10Big

Data

Opportunity

Heat

Map1114OutlineBackgroundThe

Rising

of

Big

DataFile

System

BasisFundamentalsKeyIssuesFile

Systems

Optimization

inthe

Real

WorldExample:GFS/HDFSOptimization

TechniquesFile

System

FundamentalsFile

system:

a

layer

of

OS

that

provides

a

friendly

way

forusers

to

use

block

deviseComponentsDisk

Management:

collecting

disk

blocks

into

filesNaming:

Interface

to

find

files

by

name,

not

by

blocksProtection:

Layers

to

keep

data

secureReliability/Durability:

Kee of

files

durable

despite

crashes,media

failures,

attacks,

etcFile

ionDisk

ionByte-orientedBlock-orientedNamesBlock

#sAccess

protectionNo

protectionConsistency

guaranteesNo

guarantees

beyond

block

write15File

&

DirectoryFile:

user-visible

group

of

blocks

arrangedsequentially

in

logical

spaceDirectory:

user-visible

index

map names

tofiles

or

a

relation

used

for

namingJust

a

table

of

(file

name,

unique

ID)

pairsThe

ID

canbe

used

to

look

upother

fileinformationOften

stored

in

files16What

Gets

StoredUser

data

itself

is

the

bulk

of

the

file

system'scontentsAlso

includes

meta-data

on

a

drive-wide

andper-file

basis:Drive-wide:

Available

spaceFormatting

infocharacter

set...Per-file:

nameownermodification

datephysical

layout...High-Level

OrganizationFiles

are

organized

in

a

“tree”

structure

madeofnested

directoriesOne

directory

acts

as

the

“root”“links”

(symlinks,

shortcuts,

etc)

provide

simplemeans

of

providing

multiple

access

paths

to

onefileOther

file

systems

can

be

“mounted”

anddropped

in

as

sub-hierarchies

(other

drives,network

shares)Low-Level

Organization

(1/2)File

data

and

meta-data

stored

separa

yFile

descriptors

+

meta-data

stored

ininodesLarge

tree

or

table

at

designatedlocation

on

diskls

how

to

look

up

file

contentsMeta-data

may

be

replicated

to

increasesystem

reliabilityLow-Level

Organization

(2/2)“Standard”

read-write

medium

is

a

harddrive

(other

media:

CDROM,

tape,

...)Viewed

as

a

sequential

array

of

blocksMust

address

~1

KB

chunk

at

a

timeTree

structure

is

“flattened”

into

blocksOverlap

reads/writes/deletes

cancause

fragmentation:

files

are

often

notstored

in

a

linear

layout–

inodes

store

all

block

numbers

related

tofileFragmentationABC(free

space)ABCA(free

space)A(free

space)CA(free

space)ADCAD(free)22File

System

RequirementsNamingShould

be

flexible,

e.g.,

allow

multiplenames

forsamefilesSupport

hierarchyfor

easy

ofusePersistenceWant

to

be

sure

data

has

been

written

to

disk

in

casecrashoccursSharing/ProtectionWant

to

restrict

whohas

access

to

filesWant

to

sharefileswith

other

users23File

System

Requirements

(cont’d)Speed

&Efficiency

for

different

access

patternsSequentialaccessRandom

accessKeyed

access

(not

usually

provided

by

OS)Minimum

Space

OverheadDisk

space

needed

tostore

metadata

is

lost

for

user

dataTwist:

all

metadata

that

is

requiredto

do

translation

mustbe

stored

ondiskTranslation

scheme

should

minimize

number

of

additional

accesses

fora

given

access

patternHarder

than,

say

page

tables

where

we

assumed

pagetablesthemselves

arenot

subject

to

paging!24Key

IssuesWhere

to

store

file

metadata?On

disk

for

local

filesystemsOn

dedicated

server(s)

for

distributed/parallel

filesystemHow

to

store

file

data?As

a

whole

on

one

diskSplit

and

stored

on

multiple

disksHow

to

guarantee

reliability

and

efficiency?Reliability:replication,

RAID,

dedicated

supervisor,

…Efficiency:replication,

cache,

hardware-specific

spaceallocation,

…How

to

set

block

size?Source:

Tanenbaum,

Modern

Operating

SystemsAssumption:

all

files

are

2KB

insizeQuestion:

Why

is

the

data

rate

corresponding

smallblocksizeslow?25Distributed

File

SystemsSupport

access

to

files

on

remote

servers– Uniform

view

of

filesMust

support

concurrencyMake

varying

guarantees

about

locking,

who“wins”with

concurrent

writes,etc...Must

gracefully

handle

dropped

connectionsCanoffer

support

for

replicationandlocal

cachingDifferent

implementations

sit

in

different

placeson

complexity/feature

scale分布式文件系统概况27扩展性:节点的加入和退出必须以热插拔的方式进行;并发性:每个云组件必须被设计成在并发环境中是安全的。可靠性:每个云组件需要清楚所依赖的组件可能出现故障的方式,组件要设计成能适当的处理每个故障。效率:用户云系统享数据的算法应该避免性能瓶颈,频繁的数据需要的副本,用户能够就近获得最快的时间,同时用户使用云服务的接口应该尽可能简单。命名服务(naming

service)元数据管理(metadatamanagement)缓存(cache)副本(replica)接口(interface)实例NFSAFSGFS/HDFS分布式文件系统命名服务在物理目标和逻辑目标之间形成 关系基本要求位置透明:使用单一的文件命名空间位置无关:物理

位置改变无需改变逻辑文件名元数据管理元数据:关于数据的数据文件名、文件大小、时间戳、控制信息、用户、组、两种管理方式In-band

Mode(带内模式):元数据与数据放在一起效率低,大数据量操作容易形成瓶颈Out-of-bandMode(带外模式):使用专门的服务其存放元数据28分布式文件系统缓存目的:性能,提高优化文件效率对象:元数据:提高并发度数据:减少网络流量位置:内存:速度快,开销大硬盘:支持大文件,离线:缓存一致性解决方案客户端发起的解决方案服务端发起的解决方案29目的保证可靠性保证可用性实现负载均衡要求副本位置对用户透明问题:一致性强一致性弱一致性分布式文件系统副本接口无状态(Sta

ess)服务

服务器不记录状态信息,每一个发起的请求都是自包含的

请求消息包大,处理时间长,不支持锁操作有状态(Stateful)服务服务器记录请求的会话信息30架构的选择

Scale

Up架构的选择Scale

OutScale

up

vs.

Scale

out扩展因素Scale-out(SAN/NAS)Scale-up(DAS/SAN/NAS)硬件扩展增加 硬件更换硬件硬件限制没有硬件限制有硬件限制可用性,可靠性更高较少管理的复杂性资源

, 管理需管理资源较少跨地理位置YesNoNAS可用Yes,NAS机制很普遍YesSAN可用Yes,增加

交换机YesDAS可用有限制Yes破坏性较少较多OutlineBackgroundThe

Rising

of

Big

DataFile

System

BasisFundamentalsKeyIssuesFile

Systems

Optimization

inthe

Real

WorldExample:GFS/HDFSOptimization

Techniques34分布式文件系统实例:GFS/HDFS35产品特征:基于低成本的PC服务器+开源Linux+千兆网+自研高度可伸缩:单集群规模可以达到上万节点,存储能力达到几百PB和计算相结合:通过将计算移动到数据所在节点,提高计算性能,主要用于数据分析数据可靠性:采用多副本保证数据的可靠性,通常采用3个副本文件被切割成固定大小的块(Chunk)一个主Master,多个Shadow

Master多个chunkserver多clientHDFS:GFS的开源实现File

SystemWhy

not

use

an

existing

file

system?’s

problems

are

different

from

anyone

else’sAssumptionsHigh

component

failure

ratesInexpensive

commodity

components

fail

all

the

time“Modest”

number

of

HUGE

filesJust

a

few

millionEach

is

100GB

or

larger;

multi-GB

files

typicalFiles

are

write-once,

mostly

appended

toPerhaps

concurrentlyLargestreaming

readsHigh

sustained

throughput

favored

over

lowlatency36GFS

Design

DecisionsFiles

stored

in

chunks– Fixed

size(64MB)Reliability

through

replicationEach

chunk

replicated

across

3+

chunkserversSingle

master

to

coordinate

access,

keep

metadataSimple

centralized

managementNo

d

achingLittle

benefit

due

to

large

data

sets,

streaming

readsFamiliar

interface,

but

customized

APISimplify

the

problem;

focus

on

appsAdd

snapshot

and

record

append

operationsOptimization

of

Metadata

ServiceSplittingthe

functionsa

single

master

intoMultiple

metadataserversMultiple

supervisorsthat

are

in

charge

ofsystem

monitoring,fault

recovery,

replica

management,garbage

collection38metadata

server

implementation基本原则:–必须实现自动故障恢复和节点宕机之后的元数据服务转移功能,保证元数据服务尽可能的;为了支持多样化的负载,元数据服务器必须是可扩展的;尽量减少元数据节点和其它节点的交互次数,降低元数据节点的负载;文件被组织成一个传统的

树读写锁去冗余的控制列表39data

server

implementation,一个chunk对文件被按32M大小进行分块(chunk)应Linux文件系统中的一个实体文件基于UUID算法产生128位chunk

id记录Chunk文件数据的MD5值来检查已保存数据的完整性40Supervisor

Implementation41基于内联及热度统计的小文件优化技术对于数据与元数据分离的分布式文件系统,

小文件

主要受限于网络延迟,

提出基于内联及热度统计的小文件优化技术,

提升小文件

性能效果:采用内联数据后,小文件

性能提升约2倍数据迁移平衡了内联数据所获得的性能优势与带来的元数据服务器开销文件内联技术对于小文件,将数据 在元数据中在打开文件时,将数据与元数据一起发送给客户端,消除了数据位置计算时间和跟对象 的通信基于热度统计的内联数据迁移技术文件大小超过阀值热度超出定义的阈值06040频繁的内联数据写

可能增加元数据服务器负担客户端自动统计计算内联数据的写 热度进行内联数据迁移的时机20Time(单位:秒)1000

2000

3000File

NumbersInline

data无inline

data有inline

data面向千亿级文件Set模型的海量文件

技术需求,提供TB级数据

和快速运营支撑。基于思想1.提出据Set模型,以

Set为数单元进行部署,扩容和管理。文件索引和数据分离,通过文件索引和磁盘数据索引共同定位文件数据,磁盘数据索引全内存化实现高效IO。多Set间容量均衡调度算法,根据Set状态和空间利用率,调度新增容量,实现容量均衡。应用效果:解决

相册千亿级文件的问题;

相册5000亿+张,日增3亿+张,

量100PB+。更新文件索引<文件名,chid,fid.>存⎉接入mastermaster文件索引引索挜扲服▂器Idx-master存⎉文件数据廜取文件数据<chid,fid>存⎉S

etfid->offset存⎉服▂器存⎉S

etfid->offset存⎉服▂器面向数百个业务,万亿条无热点小记录,提供高并发和低延时的低成本。基于固态盘的高性能分布式 技术思想1.提出单机资源复用模型,单机间和IO资源划分成固定小规格元,

单元间

且IO空单公平。混合索引技术实现低内存开销且IO高效的本地数据索引,小记录采用哈希索引减少索引数量,大记录独立索引提升IO效率。SSD应用层写优化,写缓存实现低延时响应,动态索引和写合并将高并发小记录随机写转化为低频率的大块写。量10+TB,小记录(<100字节)应用效果:提供SNS基础数据服务,数据可靠性高,高密度读写,

量40+w/s,长尾

无热点。存⎉a

接入SSD存⎉

服▂器2

GB的存⎉☶元共享内存写缓存混合索引单元公平的读写IO调度3.根据索引

取ssd数据1.廜

取更新

2.廜

取索引SSD存⎉

服▂器2

GB的存⎉☶元共享内存写缓存混合索引单元公平的读写IO调度2.冥

装成定桠

大数据⨸写入1.廜

取更新

3.更新索引Get/Set/Del⃻

2元数据管理急⃻

1

源源增量同步IceFS:Separating

Physical

StructureSource:Physical

Disentanglement

in

aContainer-Based

File

System,

OSDI

2014New ion:

cubeenables

the

grou of

files

and

directoriesinside

a

physically

isolated

containerBenefitslocalized

reaction

to

faultsfast

recoveryconcurrent

file-system

updates45Using

New

Media46DRAM

ManagementLRU

block

replacementFlash

ManagementSegment

=

A

set

of

blocks/Erasing

unitSegment

list

(Free/Clean/Dirty)Segment

replacement

(FIFO

or

LRU)Disk

Management– Power

management

by

spin

up/downSource:FLASHCACHE

[HCSS’94]Using

New

MediaTo

reduce

the

power

consumption

ofdiskNVCacheTo

reduce

disk

power

consumption

by

combining

adaptive

diskspin-down

algorithmTo

extend

spin-down

periods

by

undertaking

i

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论