利用EXCEL表格搞定生物芯片数据.doc_第1页
利用EXCEL表格搞定生物芯片数据.doc_第2页
利用EXCEL表格搞定生物芯片数据.doc_第3页
利用EXCEL表格搞定生物芯片数据.doc_第4页
利用EXCEL表格搞定生物芯片数据.doc_第5页
已阅读5页,还剩7页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Cell conventions 1. Cells are referred to by their column letter and row number (ex: A2). 2. A series of cells are referred to by showing the starting and ending cells, separated by a colon (ex: A2:A100). 3. When cells are copied and pasted, the column and row are shifted to adjust to the position of the new cell. For example, if the formula =B5 is copied and pasted one cell to the right, the formula in the new cell will change to =C5. 4. If you wish to maintain the position(s) of cell(s) in a formula, use the prefix $ to create an absolute link to cells that wont change with copying an pasting. For example, if the formula =$B$5 is copied and pasted one cell to the right, the formula in the new cell will still be =$B$5. You may use the prefix $ before the column and/or the row. 5. Refer to cell(s) in another worksheet by prefixing the cells with the sheet name plus !. Ex: =data!B4 refers to cell B4 on a worksheet named data. For this reason its very helpful to name worksheets with short but informative names. 6. Refer to cell(s) in another file by prefixing the cells with the file in square brackets and the sheet name plus !. Ex: =anotherFile.xlsdata!B4 refers to cell B4 of file anotherFile.xls on a worksheet named data.Copying and pasting cells 1. Copying and pasting cells usually has the expected result. If not, check your use of $ (to fix a location). 2. When copying and pasting data generated by a formula, it may be best (to prevent re-calculations of data) to paste the actual value, rather than the underlying formula. To do this, copy the cell(s) as usual but then select Paste Special Values.Using functions 1. Excel contains a reasonable range of functions which can be used by in one of at least two ways: o Type the formula (preceded by =) in the box next to fx near the top of the window, along with all of the required arguments (assuming you know what these are). o Select Insert Function and then choose the function you wish. Select a category to get a partial or complete list of all functions. A box entitled Function Arguments will pop up, and you can fill in the empty boxes.2. At any time, entering the function name into the help box or selecting Help on this function usually provides a brief explanation of the function and some examples. 3. If you make a mistake with a function, you might get a suggestion to correct it. If you agree to the correction, check the input and output! 4. You may find that Excel calculates functions too often to your liking (especially if it takes a while). If that happens,o Windows: go to Tools Options o Mac: go Excel Preferencesand select the Calculation tab. Then select Manual under Calculation. Then Excel will only perform calculation on the file when you use F9 or on the active worksheet when you use SHIFT+F9. 5. Some functions and a few supplementary data analyses can be accessed from Tools Data Analysis. If you dont see this option, you may have to install the tools. In that case, select Tools Add-Ins and check Analysis ToolPak. Then try Tools Data Analysis again.Shortcut for selecting a large number of cells1. One can always use the scrollbars to select lots of cells, but with big spreadsheets, this isnt very efficient. 2. To start the selection of a large matrix, start by selecting the bottom right cell of the desired matrix. 3. Use Control - Shift - Up arrow to select everything above the original cell. 4. Use Control - Shift - Left arrow to select everything to the left of the original cell. 5. Use Shift - Down arrow to move down one row (to avoid the header line, if present). 6. Use Shift - Right arrow to move to the right one column (to avoid any ID column(s), if present). 7. Note: This should work with both Windows and Macintosh computers.Functions for microarray analysis 1. SUM 2. AVERAGE 3. MEDIAN 4. LOG 5. IF 6. AND 7. TTEST 8. VLOOKUPIntroductionYoull be using a sample of expression data from a study using Affymetrix (one color) U95A arrays that were hybridized to tissues from fetal and human liver and brain tissue. Each hybridization was performed in duplicate. Many other tissues were also profiled but wont be used for these exercises.What well be doing to analyze these data: normalize data from these eight chips calculate expression ratios of genes between two different tissues use a common statistical test to identify differentially expressed genes flag low intensity data (most probably background noise) cluster a differentially expressed subset of all genes to identify those with similar expression profiles try to find what functions specific groups of genes (with similar expression profiles) have in commonYoull be using Excel to do most of the mathematical analyses, since this will show the exact formulas used to perform every step of the analysis pipeline. As a result, youll need to use Excel functions and be familiar with some Excel conventionsPreliminary information: Image analysis and calculation of expression value1. As described in Su et al., 2002, human tissue samples were hybridized on Affymetrix (one-color) arrays and chips were scanned. For each tissue, at least two independendent samples were hybridized to separate chips. 2. Scanned images were quantified (including measurement of background) using standard software. 3. Data from a probeset (a series of oligos designed to a specific gene target) were used to calculate an expression values for that probeset using standard Affymetrix algorithms. 4. See the manuals from Affymetrix for more information about these processes, and the Statistical Algorithms Description Document for the actual equations used. 5. Note that these analysis protocols are generally specific to the chip type and its manufacturer.Part I. Normalization of expression data1. Why normalize? Chips may have been hybridized to different amounts of RNA, for different amounts of time, with different batches of solutions, etc. Normalization should remove systematic biases and make any comparisons between chips more meaningful. 2. Download the starting data for the exercises.1. Look at the expression data for a selected series of experiments 2. Down the first column are the Affymetrix probeset IDs, each corresponding to a target gene. Each other column shows a tissue name, followed by expression values in some arbitrary unit. There should be two chips each for adult and fetal brain and liver (a total of 8 chips). Note that since this is a one-color array, the expression measures are absolute values and not ratios. If youre more used to 2-color (cDNA) arrays, this would be similar to (but not exactly the same as) 4 chips, each with one channel/dye for fetal tissue and another for adult tissue.1. Calculate the median of all expression values from each chip. o Use the Excel MEDIAN function to calculate the median at the bottom of each column. Ex: =MEDIAN(B2:B12627)1. To perform global median normalization, on a new sheet (norm) of the same file, scale each column of data so that each median is 100. 2. Divide each expression signal in a chip by the median for the chip and multiply by 100. Ex: =(raw!B2/raw!B$12628)*100 if B12628 contains the median for the chip To confirm that this was correctly done, calculate the median of each normalized chip. Global mean normalization is also possible but more influenced by outliers. If you want to compare mean and median yourself after the class, try both methods and then compare the results with scatterplots.Part II. Calculating ratios1. On a new sheet (means), calculate the mean of each pair of replicated experiments (converting 8 chips to 4 means). o Use the Excel AVERAGE function. Ex: =AVERAGE(norm!B2:norm!C2)2. On a new sheet (ratios), calculate the ratios (for both brain and liver) of fetal tissue / adult tissue (converting 4 experiments to 2 ratios).Part III. Log transformations1. Why use logarithms? Log-transformed ratios are helpful so up-regulated and down-regulated genes change by the same amplitudes. Log-transformed chip intensities will be required for differential expression testing, since these statistical tests assume normal distributions - which is true for log-transformed intensities but not untransformed intensities. 2. Logarithms can use any base, but base 2 is easiest when transforming ratios, since transformed 2-fold ratios up or down will be +1 or -1. As a result, well do all logs with base 2 to keep thing simplest. 3. On a new sheet (logs), calculate the logs (base 2) of all expresion values, means, and ratios. o Use the Excel LOG function. Ex: =LOG(norm!B2,2) o Convert all the data from sheets norm, means, and ratios.Part IV. Identifying differentially expressed genes1. Differentially expressed genes can be naively determined by fold changes but more effectively determined by using a statistic such as the t test. 2. Well compare the results of these two methods later in Part VI. 3. Use the t test with each gene to determine if the data on fetal and adult expression are different in the brain and/or liver. o Use the Excel TTEST function on the ttest sheet. Ex: TTEST(logs!B2:C2,logs!D2:E2,2,3) o The TTEST function takes four arguments: the first array, the second array, the number of tails, the type. o The number of tails is 2, since one tissue can have an expression that is lower or higher than the other. o The type of t test is 3, which refers to two-sample unequal variance. o Were safer by not assuming that each genes exhibits the same variance in each tissue, but we do lose some statistical power.4. Use the Absent/Present calls from the Affymetrix algorithm to flag genes with questionable expression levels. o In the process of converting probe data into one probeset measurement, Affymetrix calculates p-values preflecting confidence that the gene is present in the sample, and these are used to classify each probeset as Absent, Present, or Marginal. o We dont want to spend our energy looking at expression intensities of genes that might change from close to zero to a larger intensity still close to zero (or vice versa). o Our current task is to flag all genes that are have Absent calls in all brain samples or in all liver samples, using the original data in the ap sheet. o One way to do this is to merge the calls into one cell and then test if its AAAA. o On the ttest sheet, use the formulas IF and CONCATENATE to do this. ex: =IF(CONCATENATE(ap!B2,ap!C2,ap!D2,ap!E2)=AAAA,A,P) o The syntax of IF is: logical test, value if true, value if false5. List all the gene IDs for those that meet your significance threshold (such as p Values into another sheet (to_sort) or file before performing Data Sort. o Get 2 lists (one for each tissue) of gene IDs that have a p-value 0.05 and have a P call based on 4 chips. You can sort by the two columns for a tissue in either order (or both at the same time, with a formula). After you select the lines that pass one criteria, sort them by the other criteria. Paste the two lists into the lists sheet.o Use the Compare two lists tool to get the non-redundant union of these lists. o Paste this combined list into a new sheet (like selected).6. Compile all the log-transformed data for all genes that show a significant change in expression (to use later for clustering) in the four tissues. o On the selected sheet use the Excel VLOOKUP function. Ex: =VLOOKUP($A2,logs!$A$2:$O$12627,2,FALSE) o The VLOOKUP function takes 4 arguments: the value to search for, the table to search (containing the value to search for in the first column), the column number from which the matching value is returned, FALSE (to indicate that you want an exact match rather than the closest match). o The table to search is the 5 columns (1 columns of gene IDs + 4 columns for the 4 tissues) of mean expression values. o Note that the positions of the table to search must be fixed (with a $ before each column and row). o Note that this command, when copied into lots of cells, can take the computer a while to perform. o Save this sheet as a text file by either one of these methods: File Save As, and for Save as type, choose Text (tab delimited). Copy the sheet into a text editor and save that file.7. Optional: The BaRC submatrix selector, an alternative to Excels VLOOKUP o The BaRC submatrix selector takes as input a data matrix and a subset of IDs from the first column of the matrix. Then it selects only desired rows from the matrix, in the order of the input ID list. o Save the logs sheet as a text file. Go to the logs sheet and try File Save As and for Save as Type, choose Text (Tab delimited). Excel may complain that you can only save one sheet and that some information may be lost, but just choose a different name (like Array_logs.txt) and youll still have your original big file.o Save the list of Affy IDs from the selected sheet as a text file (like Selected_ids.txt). o Go to the BaRC Submatrix Selector. Browse and select the data file (Array_logs.txt) Browse and select the list of row IDs (Selected_ids.txt) or paste the list into a text editor. You can use a third file of column IDs but ignoring that will give you all columns, which is what we want here. Click on Select Submatrix and you should get output in 1-2 seconds (if the program works correctly).Part V. Clustering1. Use any or all of these data sets. The third dataset, being across more tissues, may be the most interesting. 1. your subset of expression values (from Part IV.5) 2. a pre-processed set of log2-transformed expression values (not ratios). 3. a full set of expression ratios (transformed to log base 2), with values compared to the mean across all tissues2. Open Cluster 3.0, a clustering application that works on all operating systems. Its an enhanced version of the Eisen clustering program. See the manual for more information about the program. 3. File Open and select your file of expression data (one of the files in Part V.1). 4. Note that there are some filtering and normalization functions on the tabs Filter Data and Adjust Data, but weve already performed these steps. 5. Try Hierarchical clustering using the default settings. o Go to the Hierarchical table and check Cluster under genes o Click on Centroid Linkage (or Average Linkage) to use a clustering algorithm that is not sensitive to outliers. o When clustering is completed itll be shown at the bottom of the window. o Cluster 3.0 generates several files during clustering: The .cdt file (containing the re-ordered expression data) will be read by JavaTreeView. For hierarchical clustering, .gtr and .atr files describe the structure of the gene and/or array trees. For k-means clustering, the .kgg file lists the genes in each of the clusters.o Look at the .cdt output file in a text editor: Note the new column GWEIGHT (for gene weight) and the new row EWEIGHT (for experiment weight) You may modify these weights for future clustering (to give more weight, for example, to certain arrays).6. Open JavaTreeView for visualizing your data as a heatmap. o Open and view your initial (pre-clustered) text file. o Open and view your final (clustered) file (with a .cdt extension). o Try selected a region of the data to get a more detailed view. o Try Settings Pixel Settings and adjust the contrast to get the most informative view for your data. o Note that if you used expression values (rather than ratios), youll only see two colors and those between them. o Try clustering across Genes and Arrays (tissues) to analyze tissue relatedness. o If you wish to use the web link feature, go to Settings Url Settings and use the link /analysis.G-U95AV2:HEADER (so probeset 32615_at is linked to the Affymetrix NetAffx page /analysis.95AV2:32615_at). This Affymetrix site requires free registration but provides a lot of good data.7. Try k-Means clustering using the default settings. o Follow the same steps as you did with hierarchical clustering above, but after opening the file, go to the k-Means tab, check Organize genes and click on Execute.8. Optional: While in JavaTreeView, try Export Export to Postscript and save all or part of your figure. This will produce an image of optimal resolution. Otherwise, you may wish to export to GIF or bitmap (which are easier to handle in Photoshop, but lower resolution). 9. Optional: Open the heatmap in Illustrator or Photoshop.Part VI. Visualizing all the data1. You may try any combination of these graphing techniques. Global visualization of an experiment can be helpful for designing subsequent analysis and for quality control. 2. Intensity scatterplot (sample) o Go to the logs sheet and select two mean columns fo

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论