基因芯片的生物信息学课件_第1页
基因芯片的生物信息学课件_第2页
基因芯片的生物信息学课件_第3页
基因芯片的生物信息学课件_第4页
基因芯片的生物信息学课件_第5页
已阅读5页,还剩62页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Microarray and Bioinformatics基因芯片的生物信息学Dr Jingfu Qiu 邱景富School of Public Health公共卫生学院Microarray and BioinformaticsAims for the Microarray BioinformaticsUnderstand basic microarray technology and its use in gene expression analysis. 基因芯片技术与表达谱分析中的应用Learn basic data analysis methods and how to apply t

2、hem in the analysis of gene expression data 基因芯片的数据分析Data acquisition 数据获得Data normalization 数据归一化Data analysis 数据分析Data Clustering 数据聚类Aims for the Microarray BioinfVocabulary-Review 回顾Gene 基因: hereditary DNA sequence at a specific location on chromosome.Genetics 遗传学: study of heredity & variation

3、in organisms.Genome 基因组: an organs total content (full DNA sequence)Genomics 基因组学: study of organisms in terms of their genome.2002年2月12日, 历时10载耗资20亿美元的人类基因组计划最终完成, 并报道了99% 的人类基因组序列.Vocabulary-Review 回顾Gene 基因: hVocabulary-Review回顾Protein 蛋白质 : sequence of amino acids that “does something”Proteomics

4、 蛋白质组学 : study of all of the proteins that can come from an organisms genomeBioinformatics 生物信息学 : the collection, organization & analysis of large-scale, complex biological data.Functional Genomics 功能基因组学: study of obtaining an overall picture of genome functions, including the expression profiles

5、at the mRNA level and the protein levelVocabulary-Review回顾Protein 蛋白质Microarray 基因芯片 A high throughput technology that allows detection of thousands of genes Simultaneously gene chip, biochip ,array Much rely on computer aids Central platform for functional genomicsMicroarray 基因芯片 A high througTypes

6、 of Microarrays 芯片的种类DNA microarrays, such as cDNA microarrays and oligonucleotide microarrays MMChips, for surveillance of microRNA populations Protein microarrays Tissue microarrays Cellular microarrays (also called transfection microarrays Chemical compound microarrays Antibody microarraysTypes o

7、f Microarrays 芯片的种类DNATypes of DNA Microarrays1. cDNA chip (DNA microarray, two-channel array) cDNA芯片 : Probe cDNA (5005,000 bases long) is immobilized to a solid surface such as glass Using robot spotting Traditionally called DNA microarray Firstly developed at Stanford University2. Gene chip (DNA

8、chip, Affymetrix chip) 基因芯片: Oligonucleotide (2080-mer oligos) is synthesized either in situ (on-chip) or by conventional synthesis followed by on-chip immobilization Historically called DNA chips Developed at Affymetrix, Inc. , under the GeneChip trademark Many companies are manufacturing oligonucl

9、eotide based chips using alternative technologiesTypes of DNA Microarrays1. cDNHistory 历史HGP (human genome project): suggested by Delbecco on Mar.7,1986,started in Oct. 1990, rapid and sensitive techniques for human genome information analysis80S: suggestion based on computer chip, W Brains tried it

10、 firstly.90S: Stephen Fodor(Present of Affymetrix now) made it successfully.1995:Quantitative monitoring of gene expression patterns with a complementary DNA microarray End of 1996: the first DNA chipHistory 历史HGP (human genome prMicroarrays are Popular 芯片技术的普及At NYU Med Center now collecting about

11、3 GB of microarray data per week (60 chips, 6-10 different experiments)PubMed search microarray= 24,431 papersMicroarrays are Popular 芯片技术的普What problems can it solve?基因芯片的应用 Differing expression of genes over time, between tissues, and disease states 基因表达差异 Identification of complex genetic disease

12、s 复杂性基因疾病的诊断 Drug discovery and toxicology studies 药理与毒理学研究 Mutation/polymorphism detection (SNPs) SNP 检测 Pathogen analysis 诊断病原What problems can it solve?基因Features 特点 Parallelism 高平行 Thousands of genes simultaneously Miniaturization 小型化 Small chip size Multiplexing 高通量 Multiple samples at the same

13、 time Automation 自动化 Chip manufacturing ReagentsFeatures 特点 Parallelism 高Differential Gene Expression基因表达差异A Few Examples:Cell type specific -e.g. skin cell vs. brain cell Developmental stage -e.g. embryonic skin cell vs. adult skin cellDisease state -e.g. normal skin cell vs. skin tumor cellEnviron

14、ment-specific -e.g. skin cell untreated vs. treateddrugs, toxinsDifferential Gene Expression基What is its pitfall 缺陷与不足? Detect transcription mRNA level, not translation protein level Many factors (variations) can affect the result:影响因素众多 Chip and probe design Experiment design Sample preparation Ima

15、ge acquisition Data normalization Data analysis . Success crucial 成功关键: You know both the biology problem and the computer aids (software, statistics).What is its pitfall 缺陷与不足? DeRequrimentsArray spotter 点样仪Array scanner 扫描仪Chemistry systems 杂交体系Softwares 软件 RequrimentsArray spotter 点样仪Market predi

16、ct 市场预期At 1999:1 billion USDLess than 5 yrs: 20 billions2005:5 billions(USA)2010:40 billions(USA) Dont include disease diagnosticThe largest industry instead of microelectricsMarket predict 市场预期At 1999:Principle 原理 Similar to Northern Base-Pairing, hybridization between nucleic cids Major difference

17、s from Northern Detects thousands of genes simultaneously /individual Probes fixation on glass slide / nylon membrane Target samples labeling with fluorescent/radioactive dNTPPrinciple 原理 Similar to NorthDesigning the Probes 探针的设计The probes need to be of high specificity to avoid hybridization with

18、wrong target molecules. 特异性The probes need to generate an output that is easy to read (spots lie in defined positions and be of regular size and shape and even spacing). 杂交结果容易判读The probes have to have high sensitivity to detect the mRNA and the intensity of the spot light must be differentiable fro

19、m background noise. 敏感性Results must be reproducible across multiple experiments. 重复性Designing the Probes 探针的设计TheSpotting Process 点样过程Spotting Process 点样过程点样针点样针基因芯片的生物信息学课件基因芯片的生物信息学课件Spot robot 点样仪Cheung et al. 1999Spot robot 点样仪Cheung et al. 1Affymetrix 基因芯片Affymetrix 基因芯片表达差异检测表达差异检测基因芯片的生物信息学课件

20、Comparison of Probe Types两种探针比较AdvantagesNo need to isolate and purify cDNAs because oligonucleotides can be synthesized.Short oligonucleotides are less likely to have cross-reactivity with other sequences in the target DNA.Density of chips is higher than with cDNAs.LimitationsThe sequence has to be

21、 known.Synthesis can be expensive and time-consuming.The short sequences are not as specific for target DNA, so appropriate controls must be added.In-situ Synthesis / OligosPCR Products / cDNA ProbesAdvantagesFlexibility to study cDNAs from any source.cDNAs do not require any a priori information ab

22、out the corresponding genes.Longer sequences increase hybridization specificity, which reduces false positives. LimitationsIsolation of individual cDNAs to immobilize on each spot can be cumbersome.Density is lower than synthesizing oligonucleotides on the surface of the chip.cDNAs are longer sequen

23、ces and are more likely to randomly contain sequences found in target DNA, which results in cross-reactivity.Many other variations of the technology exist, such as the use of longer oligos, the use of fibre optics, etc.Comparison of Probe Types两种探针HomemadeTailoredCheaper?Maximum 24,000 features per

24、arrayProne to variabilityCommercially available“Off the rack”More expensive?Maximum 500,000 features per arrayLess variabilitySpotted ArraysAffymetrix ArraysHomemadeCommercially availableProcess of manufacture a microarray芯片制备流程Start with individual genes, e.g. the 4,200 genes of the genome or Y.pes

25、tisAmplify all of them using polymerase chain reaction (PCR)“Spot” them on a medium, e.g. an ordinary glass microscope slideEach spot is about 100 m in diameterSpotting is done by a robotComplex and potentially expensive taskProcess of manufacture a micrB21B22B23B24B25B26B27B28B29B30B31B32B17B18B19B

26、20B5B6B7B8B9B10B11B12B13B14B15B16B1B2B3B448矩阵1717 点阵一共8448个点;4005条鼠疫菌基因+若干对照DNA;每样品相邻重复两个点。基因选择4015条芯片点样基因的PCR扩增产物纯化和浓缩4005条基因全基因组芯片研制引物设计B21B22B23B24B25B26B27B28B29B30Biological questionDifferentially expressed genesSample class prediction etc.TestingBiological verification and interpretationMicroa

27、rray experimentEstimationExperimental designImage analysisNormalizationClusteringDiscriminationR, G16-bit TIFF files(Rfg, Rbg), (Gfg, Gbg)Biological questionTestingBiolMicroarray Steps 基因芯片分析过程 Experiment and Data Acquisition 实验过程与数据获得 Chip manufacturing 芯片制备 Sampling and labeling 点样 Hybridization 杂

28、交 Image scaling 图像扫描 Data acquisition 数据获得 Data normalization 数据归一化 Data analysis 数据分析 Biological interpretation 生物学解释Microarray Steps 基因芯片分析过程 ExpReading an array (cont.)BlockColumnRowGene NameRedGreenRed:Green Ratio111tub12,3452,4670.95112tub23,5892,1581.66113sec14,1091,4692.80114sec21,5003,5890.4

29、2115sec31,2461,2580.99116act11,9372,1040.92117act22,5611,5621.64118fus12,9623,0120.98119idp23,5851,2092.971110idp12,7961,0052.781111idh12,1704,2450.511112idh21,8962,9960.631113erd11,0233,3540.311114erd21,6982,8960.59Reading an array (cont.)BlockCColor Coding扫描结果Tables are difficult to readData is pr

30、esented with a color scaleCoding scheme:Green = repressed (less mRNA) gene in experimentRed = induced (more mRNA) gene in experimentBlack = no change (1:1 ratio)OrGreen = control condition (e.g. aerobic)Red = experimental condition (e.g. anaerobic)We only use ratioColor Coding扫描结果Tables are diNoise

31、干扰Noise sources干扰来源:Sample preparation, labeling, amplificationReaction variationsEnvironmentTarget volumeHybridization parameters (temperature, time, .)Aspecific hybridizationDustScanner settingsQuantizationNoise 干扰Noise sources干扰来源:Other Image Processing Problems Spot Quality ProblemsUneven grid p

32、ositionsCurves within a gridVariable Spot size or shapeVariable Distance between spotsTypical Problems of Raw OutputOther Image Processing ProblemTwo slidesP04 vs. P01 (pg2)A1 vs. P01 (pg2)Two slidesP04 vs. P01 (pg2)A1Noise filtering 干扰过滤Noise filtering 干扰过滤Noise filtering 干扰过滤Gridding: identify spo

33、t locationsSegmentation: distinguish foreground from backgroundFixed Circle: put a circle around the foreground areaSeeded region growing: identify initial spot “seeds” and grow high intensity regionsEdge detection algorithmsBackground cancellationIntensity = FGintensity - BGintensityNoise filtering

34、 干扰过滤Gridding: Normalization 归一化The word normalization describes techniques used to suitably transform the data before they are analysed.Goal is to correct for systematic differencesbetween samples on the same slide, orbetween slides, which do not represent true biological variation between samples.

35、Normalization 归一化The word normNormalization 归一化Noralize data to correct for artificial variancesRed = FGred - BGredGreen = FGgreen BGgreenPixelValue = log2(Red/Green)-log2(Redavg/Greenavg)Pixel color:Green if pixel value 0Normalization 归一化Noralize dataNormalization 归一化Calibrated, red and green equal

36、ly detectedUncalibrated, red light under detectedNormalization 归一化Calibrated, rThe origin of systematic differences系统误差的产生原因Systematic differences due to Dye biases, which vary with spot intensity, Location on the array, Plate origin, Printing quality which may vary between PinsTime of printingScann

37、ing parameters,The origin of systematic diffeDNA array Data AcquisitionDNA 芯片数据的获得Image Analysis software packages exist for the analysis of the output of custom made chips (e.g. GenePix Pro, Array Vision, TIGR Spot Finder, etc) Need chip description file (CDF) For probe locationDNA array Data Acqui

38、sitionDNAIntroduction of Software-SAMSAM软件介绍Significance Analysis of MicroarraysTusher, Tibshirani and Chu (2001): Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001 98: 5116-5121, (Apr 24).Excel pluginFreePermutation basedMost published method of microarray d

39、ata analysisIntroduction of Software-SA基因芯片的生物信息学课件基因芯片的生物信息学课件基因芯片的生物信息学课件chose = .5. producing about 65 significant genes and about 5.9 false positives on the average.The choice of is up to the user, depending how many false positives he/she is comfortable with.The False Discovery Rate (FDR) is co

40、mputed as median (or 90th percentile) of the numberof falsely called genes divided by the number of genes called significant.chose = .5. producing about Handling Missing Data 丢失数据的操作There are currently two options for imputing missing values in SAM.Row Average Each value is imputed with the average

41、of non-missing values for that gene.K-Nearest Neighbor In the other (default) option- missing values are imputed using a k-nearest neighbor average in gene space (default k = 10):Handling Missing Data 丢失数据的操作Clustering 聚类软件Hypothesis: Genes with similar function have similar expression profilesFind

42、group of genes with similar expression profilesFind groupd of individuals with similar expression profiles within a populationClustering 聚类软件Hypothesis: GeClustering = Group identificationClustering = Group identificatClustering Steps 聚类分析步骤Choose a similarity metric to compare the transcriptional r

43、esponse or the expression profiles:Pearson CorrelationSpearman CorrelationEuclidean Distance特征抽取和模式表示Choose a clustering algorithm:HierarchicalK-meansClustering Steps 聚类分析步骤Choose Cluster algorithm聚类算法 - Unsupervised Analysis 非监督算法 - HierarchicalK-meanSelf-organizing mapsOthers - Supervised Analysis

44、: classification rules 监督算法Cluster algorithm聚类算法 - UnsHyerarchical Clustering ExampleEisen et al. (1998), PNAS, 95(25): 14863-14868Hyerarchical Clustering ExamplHyerarchical Clustering ExampleHyerarchical Clustering Exampl/cgi/content/full/95/25/14863基因芯片的生物信息学课件系统聚类法步骤1、将n个样品各作为一类;2、计算n个样品两两之间的距离,构

45、成距离矩阵;3、合并距离最近的两类为一新类;4、计算新类与当前各类的距离。 再合并、计算,直至只有一类为止;5、画聚类树形图,确定距离切点、类组,解释。 在SPSS软件中的操作步骤:Analyze-Classify-Hierarchical 系统聚类法步骤1、将n个样品各作为一类;2、计算n个样品Hierarchical Clustering系统聚类法g1g2g3g4g5g10.230.000.95-0.63g20.910.560.56g30.320.77g4-0.36g5g1g4g1g2g3g4g5g10.230.000.95-0.63g20.910.560.56g30.320.77g4-0

46、.36g5 Find largest value is similarity matrix. Join clusters together. Recompute matrix and iterate.Hierarchical Clustering系统聚类法gHierarchical Clustering 系统聚类g1 , g4g2g3g5g1 , g40.370.16-0.52g20.910.56g30.77g5g1g4g2g3g1 , g4g2g3g5g1 , g40.370.16-0.52g20.910.56g30.77g5 Find largest value is similarity

47、 matrix. Join clusters together. Recompute matrix and iterate.Hierarchical Clustering 系统聚类gHierarchical Clustering系统聚类g1 , g4g2 , g3g5g1 , g40.27-0.52g2 , g30.68g5g1g4g2g3g5g1 , g4g2 , g3g5g1 , g40.27-0.52g2 , g30.68g5 Find largest value is similarity matrix. Join clusters together. Recompute simila

48、rity matrix and iterate.Hierarchical Clustering系统聚类g1Interpreting the Resultsg1g4g2g3g52 clusters ?3 clusters ?Interpreting the Resultsg1g4g2k-means 聚类分析k-means 聚类分析是一种广为人知的方法,它通过尽量缩小一个分类中的项之间的差异,同时尽量拉大分类之间的距离,来分配分类成员身份。k-means 中的 means 指的是分类的“中点”,它是任意选定的一个数据点,之后反复优化,直到真正代表该分类中的所有数据点的平均值。k 指的是用于为聚类分析过程设种子的任意数目的点。k-means 算法计算一个分类中的数据记录之间的欧几里得距离的平方,以及表示分类平均值的矢量,并在和达到最小值时在最后一组 k 分类

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论