06-Microarray data analysis(生物信息学国外教程2010版).ppt_第1页
06-Microarray data analysis(生物信息学国外教程2010版).ppt_第2页
06-Microarray data analysis(生物信息学国外教程2010版).ppt_第3页
06-Microarray data analysis(生物信息学国外教程2010版).ppt_第4页
06-Microarray data analysis(生物信息学国外教程2010版).ppt_第5页
已阅读5页,还剩116页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Gene expression: Microarray data analysis,Jonathan Pevsner, Ph.D. Bioinformatics Johns Hopkins School of Medicine (M.E:800.707),January 5, 2009,Copyright notice,Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pe

2、vsner (ISBN 0-471-21004-8). Copyright 2003 by John Wiley be sure to create an appropriately balanced, randomized experimental design.,Stage 3: Hybridization to DNA arrays,Page 178-179,The array consists of cDNA or oligonucleotides Oligonucleotides can be deposited by photolithography The sample is c

3、onverted to cRNA or cDNA (Note that the terms “probe” and “target” may refer to the element immobilized on the surface of the microarray, or to the labeled biological sample; for clarity, it may be simplest to avoid both terms.),Microarrays: array surface,Fig. 6.18 Page 179,Southern et al. (1999) Na

4、ture Genetics, microarray supplement,Stage 4: Image analysis,Page 180,RNA transcript levels are quantitated Fluorescence intensity is measured with a scanner, or radioactivity with a phosphorimager,Rett,Control,Differential Gene Expression on a cDNA Microarray,a B Crystallin is over-expressed in Ret

5、t Syndrome,Fig. 6.19 Page 180,Fig. 6.20 Page 181,Fig. 6.20 Page 181,Fig. 6.20 Page 181,Fig. 6.20 Page 181,Stage 5: Microarray data analysis,Page 180,Hypothesis testing How can arrays be compared? Which RNA transcripts (genes) are regulated? Are differences authentic? What are the criteria for statis

6、tical significance? Clustering Are there meaningful patterns in the data (e.g. groups)? Classification Do RNA transcripts predict predefined groups, such as disease subtypes?,Stage 6: Biological confirmation,Page 182,Microarray experiments can be thought of as “hypothesis-generating” experiments. Th

7、e differential up- or down-regulation of specific RNA transcripts can be measured using independent assays such as - Northern blots - polymerase chain reaction (RT-PCR) - in situ hybridization,Stage 7: Microarray databases,Page 182,There are two main repositories: Gene expression omnibus (GEO) at NC

8、BI ArrayExpress at the European Bioinformatics Institute (EBI),Array Express at the European Bioinformatics Institute http:/www.ebi.ac.uk/arrayexpress/,MIAME,Page 182,In an effort to standardize microarray data presentation and analysis, Alvis Brazma and colleagues at 17 institutions introduced Mini

9、mum Information About a Microarray Experiment (MIAME). The MIAME framework standardizes six areas of information: experimental design microarray design sample preparation hybridization procedures image analysis controls for normalization Visit ,Outline: microarray data analysis,Gen

10、e expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA),Microarray data analysis, begin with a data matrix (gene expression values versus samples),Fig. 7.1 Pag

11、e 190,genes (RNA transcript levels),Microarray data analysis, begin with a data matrix (gene expression values versus samples),Typically, there are many genes ( 20,000) and few samples ( 10),Fig. 7.1 Page 190,Microarray data analysis, begin with a data matrix (gene expression values versus samples),

12、Preprocessing,Inferential statistics,Descriptive statistics,Fig. 7.1 Page 190,Microarray data analysis: preprocessing,Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spot

13、ting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency,Page 191,Microarray data analysis: preprocessing,The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while

14、 preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment.,Page 191,Data analysis: global normalization,Global norm

15、alization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope.,Page 192,Data analys

16、is: global normalization,Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show

17、2-fold regulation.,Page 192,Data analysis: global normalization,Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets),Page 192,Scatter plo

18、ts,Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes,Page 193,Brain,Astrocyte,Astrocyte,Fibroblast,Differential Ge

19、ne Expression in Different Tissue and Cell Types,expression level,high,low,up,down,Expression level (sample 1),Expression level (sample 2),Fig. 7.2 Page 193,Log-log transformation,Fig. 7.3 Page 195,Scatter plots,Typically, data are plotted on log-log coordinates Visually, this spreads out the data a

20、nd offers symmetry raw ratiolog2 ratio time behavior valuevalue t=0basal1.00.0 t=1hno change1.00.0 t=2h2-fold up2.01.0 t=3h2-fold down0.5-1.0,Page 194, 197,expression level,high,low,up,down,Mean log intensity,Log ratio,Fig. 7.4 Page 196,You can make these plots in Excel,but for many bioinformatics a

21、pplications use R. Visit to download it.,Visit to download it. See chapter 9 (2nd edition) for a tutorial on microarray data analysis.,A,A,M,M,After RMA (a normalization procedure), the median is near zero, and skewing is corrected. Scatterplots displa

22、y the effects of normalization.,SNOMAD converts array data to scatter plots ,2-fold,2-fold,Log10 (Ratio ),Mean ( Log10 ( Intensity ) ),EXP,CON,EXP,CON,EXP CON,EXP CON,2-fold,2-fold,2-fold,2-fold,Linear-linear plot,Log-log plot,Page 196-197,SNOMAD corrects local variance artifacts

23、,2-fold,2-fold,Log10 ( Ratio ),Mean ( Log10 ( Intensity ) ),robust local regression fit,residual,EXP CON,EXP CON,Corrected Log10 ( Ratio ) residuals,Mean ( Log10 ( Intensity ) ),Page 196-197,SNOMAD describes regulated genes in Z-scores,Corrected Log10 ( Ratio ),Mean ( Log10 ( Intensity ) ),2-fold,2-

24、fold,Locally estimated standard deviation of positive ratios,Z= 1,Z= -1,Locally estimated standard deviation of negative ratios,Local Log10 ( Ratio ) Z-Score,Mean ( Log10 ( Intensity ) ),Z= 5,Z= -5,Corrected Log10 ( Ratio ),Mean ( Log10 ( Intensity ) ),2-fold,2-fold,Z= 2,Z= 1,Z= -1,Z= -2,Z= 5,Z= -5,

25、Robust multi-array analysis (RMA),Developed by Rafael Irizarry (Dept. of Biostatistics), Terry Speed, and others Available at as an R package Also available in various software packages (including Partek, and Iobion Gene Traffic) See Bolstad et al. (2003) Bioinformatics 19; Iriz

26、arry et al. (2003) Biostatistics 4 There are three steps: 1 Background adjustment based on a normal plus exponential model (no mismatch data are used) 2 Quantile normalization (nonparametric fitting of signal intensity data to normalize their distribution) 3 Fitting a log scale additive model robust

27、ly. The model is additive: probe effect + sample effect,array,log signal intensity,array,log signal intensity,Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied.,RMA adjusts for the effect of GC content,GC content,log intensity,precision,accuracy,precisi

28、on with,accuracy,Good performance: reproducibility of the result,Good quality of the result (relative to a gold standard),Robust multi-array analysis (RMA),RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software).,precision,average log expression,log expression SD,RMA,MAS 5

29、.0,Robust multi-array analysis (RMA),RMA offers comparable accuracy to MAS 5.0.,log nominal concentration,observed log expression,accuracy,Outline: microarray data analysis,Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descript

30、ive) statistics distances clustering principal components analysis (PCA),Inferential statistics,Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no diffe

31、rence in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p 0.0

32、5.,Page 199,Inferential statistics,A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups?

33、 Is it appropriate to set the significance level to p 0.05?,Page 199,x1 x2,SE,difference between mean values,variability (standard error of the difference),Inferential statistics,A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Notes t is a

34、 ratio (it thus has no units) We assume the two populations are Gaussian The two groups may be of different sizes Obtain a P value from t using a table For a two-sample t test, the degrees of freedom is N -2. For any value of t, P gets smaller as df gets larger,x1 x2,SE,difference between mean value

35、s,variability (standard error of the difference),1 Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. You can also calculate the mean value for each gene (tran

36、script) for the controls and the disease (experimental) samples.,Analyzing expression data in Excel Question: for each of my 20,000 transcripts, decide whether it is significantly regulated in some disease.,control,disease,2 You can calculate the ratios of control versus disease. Note that you can u

37、se the formula =E5/I5 in this case to divide the mean control and disease values. Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cut-off for changes of interest such as two-fold.,Analyzing expression data in Excel,3 Perform a t-test. When

38、 you enter =TTEST into the function box above, a dialog box appears. Enter the range of values for controls and for disease samples, and specify a 1- or 2-tailed test.,Analyzing expression data in Excel,3 Perform a t-test (continued). For a one-tailed test, your prior hypothesis is that the transcri

39、pt in the disease group is up (or down) relative to controls; the change is unidirectional. For example, in Down syndrome samples you might hypothesize that chromosome 21 transcripts are significantly up-regulated because of the extra copy of chromosome 21.,Analyzing expression data in Excel,3 Perfo

40、rm a t-test (continued). For a two-tailed test, you hypothesize that the two groups are different, but you do not know in which direction.,Analyzing expression data in Excel,3 Note the results: you can have a small p value (0.05) with a big ratio difference a large p value (0.05) with a trivial rati

41、o difference Only the first group is worth reporting! Why?,Analyzing expression data in Excel,disease vs normal,Error,t-test to determine statistical significance,difference between mean of disease and normal t statistic = variation due to error,Error,Error,Tissue type,ANOVA partitions total data va

42、riability,Before partitioning,After partitioning,Subject,disease vs normal,disease vs normal,Inferential statistics,ParadigmParametric testNonparametric Compare two unpaired groupsUnpaired t-testMann-Whitney test Compare two paired groupsPaired t-testWilcoxon test Compare 3 orANOVA more groups,Table

43、 7-2 Page 198-200,Inferential statistics,Is it appropriate to set the significance level to p 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05. You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated

44、. But you can expect to see 5% (500 genes) regulated at the p 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the cr

45、iterion becomes: p (0.05)/10,000 or p 5 x 10-6 The Bonferroni correction is generally considered to be too conservative.,Page 199,Inferential statistics: false discovery rate,The false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is

46、 sometimes called a false discovery. The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives). You can adjust the false discovery rate. For example: FDR# regulated transcripts# false discoveries

47、0.1100 10 0.05 453 0.01 201 Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive?,log fold change (treated/untreated),p value (treated versus control),A volcano plot displays both p values and fold ch

48、ange,Outline: microarray data analysis,Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA),Descriptive statistics,Microarray data are highly dimensional

49、: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points.

50、Two commonly used distance metrics are: - Euclidean distance - Pearson coefficient of correlation,Page 203,What is a cluster?,A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity

51、or dissimilarity measures.,Data matrix (20 genes and 3 time points from Chu et al., 1998) Software: S-PLUS package,Fig. 7.8 Page 205,genes,samples (time points),3D plot (using S-PLUS software),t=0,t=0.5,t=2.0,Fig. 7.8 Page 205,Descriptive statistics: clustering,Clustering algorithms offer useful vis

52、ual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most diss

53、imilar objects first). In each case, we end up with a tree having branches and nodes.,Page 204,Agglomerative clustering,a,b,c,d,e,a,b,4,3,2,1,0,Fig. 7.9 Page 206,Adapted from Kaufman and Rousseeuw (1990),a,b,c,d,e,a,b,d,e,4,3,2,1,0,Agglomerative clustering,Fig. 7.9 Page 206,a,b,c,d,e,a,b,d,e,c,d,e,4

54、,3,2,1,0,Agglomerative clustering,Fig. 7.9 Page 206,a,b,c,d,e,a,b,d,e,c,d,e,a,b,c,d,e,4,3,2,1,0,Agglomerative clustering,tree is constructed,Fig. 7.9 Page 206,Divisive clustering,a,b,c,d,e,4,3,2,1,0,Fig. 7.9 Page 206,Divisive clustering,c,d,e,a,b,c,d,e,4,3,2,1,0,Fig. 7.9 Page 206,Divisive clustering,d,e,c,d,e,a,b,c,d,e,4,3,2,1,0,Fig. 7.9 Page 206,Divisive clustering,a,b,d,e,c,d,e,a,b,c,d,e,4,3,2,1,0,Fig. 7.9 Page 206,Divi

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论