生物信息学课件英文原版课件 (46)_第1页
生物信息学课件英文原版课件 (46)_第2页
生物信息学课件英文原版课件 (46)_第3页
生物信息学课件英文原版课件 (46)_第4页
生物信息学课件英文原版课件 (46)_第5页
已阅读5页,还剩66页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Outline for Todays Discussionbasic facts about DNA, RNA, and proteinsexamples of computational problemsexamples of algorithmic techniquesMain Bio-moleculesDNA: encodes genetic information.RNA: copies and transports such information to produce proteins.Protein: performs various biological functions.B

2、asic Facts about DNA4 nitrogenous bases: Adenine, Cytosine, Guanine, Thyminenucleotide = base + phosphate + sugar strand = a polymer of nucleotidesdouble strands: two complementary strandsbinding rule: A-T and C-G3D structure: double helixchromosome = a complete DNA molecule in a cellgenome = the wh

3、ole set of chromosomes in a cellDNA StructureGenome at 4 Levels of DetailsExample of Chromosome SizesBasic Facts about RNA4 nitrogenous bases: Adenine, Cytosine, Guanine, Uracilnucleotide = base + phosphate + sugar strand = a polymer of nucleotidessingle strandbinding rule: A-U, C-G, and others.3D s

4、tructure: much more complex than double helixBasic Facts about Protein20 amino acid residuesstrand = a polymer of amino acidssingle strandbinding rule: complicated.3D structure: much more complex than RNAs 3D structure function of a proteinImportance of Protein FoldingThe 3D structure significantly

5、determines the function.Key Facts about Genesgene = segment of a chromosomegene proteincodon = 3 consecutive DNA basescodon protein charactergene = partitioned into introns and exonsexon = codonsintron = “meaningless regionGenes may be nestedDNA RNA ProteinFunctional Assignment using Gene Ontology13

6、,601 GenesDrosophilaGene Number in the Human GenomeThe Gene Counting ProblemThe number probably will be never known exactly.Current estimates: 30,000-40,000Other estimates: 120,000Gene discovery: sequence analysis motif recognition matches to mRNA computational predictions mouse data matches experim

7、ental validationSome Main Areas of BioinformaticsA key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes.Genome: entire sets of materials in the chromosomes.Transcriptome: entire sets of gene transcripts.Proteome: entire sets of p

8、roteins.Genome (DNA) Transcriptome (RNA) Proteome (Protein)PerspectivesA key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes.Genome: entire sets of materials in the chromosomes.Transcriptome: entire sets of gene transcripts.Prot

9、eome: entire sets of proteins.Genome (DNA) Transcriptome (RNA) Proteome (Protein)ProteomicsProteome: all proteins encoded within a genomehalf millions distinct proteins (temporal, spatial, modifications)30,000 human genesmRNA and protein expressions may not correlateProteomics: study of protein expr

10、ession by biological systemsrelative abundance and stability; post-translational modificationsfluctuations as a response to environment and altered cellular needscorrelations between protein expression and disease stateprotein-protein interactions, protein complexesTechnologies:2D gel electrophoresi

11、s mass spectrometryyeast two-hybrid systemprotein chipsProtein Identification: HPLC-MS-MSMass/ChargeTandem Mass SpectrumMass/ChargeProteinsPeptidesOne PeptideB-ions / Y-ionsProtein Identification: HPLC-MS-MSMass/ChargeTandem Mass SpectrumMass/ChargeProteinsPeptidesOne PeptideB-ions / Y-ionsPeptide F

12、ragmentation and IonizationB-ionY-ionComplementary: Mass(B-ion)+Mass(Y-ion) = Mass(peptide)+4H+OTandem Mass SpectrumMass / ChargeAbundance (100%)2005088.033100400175.113274.112361.121430.213448.225Raw Tandem Mass SpectrumProtein Database SearchFind the peptide sequences in a protein database that op

13、timally fit the spectrum.It does not work if the target peptide sequence is not in the database.It does not work if there is an unknown modification at some amino acid.It is very slow because it must search the entire database.E.g., SEQUEST, Yates, Univ. of Washington.De Novo Peptide Sequencing Prob

14、lemInput: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide.Output: a peptide P such that (1) mass(P)=W and (2) S is a subset of all the ion masses of P. Mass / ChargeAbundance (100%)50100274.112361.121Peptide Mass 429.212 Dalt

15、onsP = SWR,Mass(P) = 429.212,Ions(P) = 88.033, 175.113, 274.112, 361.121, 430.213, 448.225Amino Acid Mass TableImportance of Protein FoldingThe 3D structure significantly determines the function.Two Complementary Problems for Protein FoldingProtein Folding Prediction - Given a protein sequence, dete

16、rmine the 3D folding of the sequence. Protein Sequence Design - Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.Complexity for Protein Folding ProblemsProtein Folding Pr

17、ediction - Given a protein sequence, determine the 3D folding of the sequence. NP-hard under various models.Protein Sequence Design - Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into t

18、he structure. Solvable in polynomial time under the Grand Canonical model.Protein Identification: HPLC-MS-MSMass/ChargeTandem Mass SpectrumMass/ChargeProteinsPeptidesOne PeptideB-ions / Y-ionsKey Steps of Sequencing DNAGoal = determine the character sequence of a DNA molecule.Duplicate many copies.C

19、ut the copies into fragments.Determine the character sequence of small fragments (by means of recursion or lab instruments).Compare, align, and order the fragments into the original sequence.Polymerase Chain ReactionCutting DNAsBLAST (Basic Local Alignment Search Tool)A suite of sequence comparison

20、algorithms optimized for speed used to search sequence databases for optimal local alignments to a protein or nucleotide queryProgram Query DatabaseblastpproteinproteinblastnDNADNAblastxDNA (translated in 6 frames)proteintblastnproteinDNA (translated in 6 frames)tblastxDNA (translated in 6 frames)DN

21、A (translated in 6 frames) Mapping and Walking Mapping and Clone by Clone Shotgun Whole Genome Shotgun with Mate PairsLab-Intense (SLOW)Compute-Intense (FAST) Comparison of Sequencing StrategiesDNA target sampleSHEAR & SIZEe.g., 10Kbp 8% std.dev.End Reads / Mate PairsCLONE & END SEQUENCE590bp10,000b

22、pMate-Pair Shotgun DNA Sequencing Mapping and Shotgun1) Replicate mapped spans of DNA.ChromosomeMapped span(BAC) 35,0002) Shear the replicates randomly and sequence the pieces.cgattccgattccgattccgattccgattccgattccgattccgattccgattccgattccgattccgattccgattccgattccgattccgattc3) Assemble reads by overlap

23、 matching. Infer the original sequence by consensus.Computed overlapscgattccgattccgattccgattccgattccgattccgattccgattcComputedsequencecgattcggattctcgattctacgaaClone by Clone Shotgun sequencingWhole Genome Sequencing ApproachesKey Steps of Sequencing DNAGoal = determine the character sequence of a DNA

24、 molecule.Duplicate many copies.Cut the copies into fragments.Determine the character sequence of small fragments (by means of recursion or instruments).Compare, align, and order the fragments into the original sequence. Sequencing reactions produce short reads (550bp).Human Genome3 billion basesSeq

25、uence read550 bases The human genome is repeat-rich.Many short reads look identical to each other.GCATTA.GACCGTCGGATAGACATAACCGGATAGACATAACCGGATAGACATAACCAGCAGCAGCAGCACAGCAGCAGCAGCACAGCAGCAGCAGCAObstacles to Genome SequencingOrder & Orientation is Essential to Finding GenesExon 1Exon 2Exon 3Exon 4Ex

26、ons are shuffled and unoriented, significantly impacting the ability of gene finding programs to make a correct prediction.Users consistently report finding genes that they cant find elsewhere.But if contigs are not correctly put together:143 reversed2Celeras Sequencing / SNP Discovery CenterCelera

27、Supercomputing FacilityCeleras system is one of the most powerful civilian super-computing facilities in the worldCurrently over 1.5 teraflop of computing power in a virtual compute farm of Compaq processors with 100 terabytes storageNext phase a 100 teraflop computerHuman Genome Sequence from 5 Hum

28、ans (3 females-2 males) completedHuman sequencing started 9/8/99Over 39X coverage of the genome in paired plasmid readsFirst Assembly announced June 26 2.9 billion bpPublished in Science, February 16, 2001Evolutionary Treesdefinition: a tree with distinct labels at leavesleaf labels: species, organi

29、sms, DNAs, RNAs, proteins, features, etc.ancestralspeciesbirdplumpeachricewheatpresent-day species(Just a joke!)Evolutionary Treesleaf labels: DNA sequencesbirdplumpeachricewheatAAGTCCAGCCATCGGGCGGC(Just a joke!)Problem FormulationbirdplumpeachricewheatAAGTCCAGCCATCGGGCGGCInput: DNA sequences of pre

30、sent-day speciesOutput: the true evolutionary treeQuestion: What is “true?Need a model!(Just a joke!)A Fundamental Problem of BiologySince the time of Charles Darwin, Problem: reconstruct the evolutionary history of all known species.Importance: intellectually fascinating practical benefits medicine

31、, food Charles Robert Darwin - 1809-1882 Origin of Species - 1859Protein EvolutionTree of Life & Evolution of Protein Families (Dayhoff, 1978)Othologus Gene Family: Organismal and Sequence Trees Match WellEvolutionary TreeRelationships Between IMP Dehydrogenase Families Evolutionary TreeRelationship

32、s Between G-Protein-Coupled Receptor Families (One of the Largest and Most Diversified)Main DifficultiesAvailability of data Hundreds of millions of species - unlikely to be all available any time soon or ever. But DNA sequences of more and more species are becoming available. Extracting information

33、 from dataData Mining Flowcharttrue tree(unknown)collect & processindividual sequencescompare & alignmultiple sequences tree reconstructionalgorithmstree verification(compare & refine)evolution modelsgeneratesequencesfurther processparametersdistance or characterstreesinformationrefineinferparameter

34、sOther Applications of Evolutionary TreesA tree that conceptually models the evolutionary relationship of species or organismsApplications outside biology: linguistics - evolution of wordsstatistical classificationstracking computer virusesPrimary Information Captured in Evolutionary Treesmost recen

35、t common ancestorX1X2X3X4X5AAGTCCAGCCATCGGGCGGCyesnoomit to simplify computation/structureWhy Need to Compare Trees?For the same given set of species or organisms,different (1) data, (2) evolution models, (3) biological intuitions, or (4) tree construction algorithms may yield different trees.Tree c

36、omparison is a data mining tool for gaining information from multiple trees.What Information to Gain from Comparison?dissimilarity measures I.e., determine how different the given trees mon structures in multiple trees I.e., extract common evolutionary history from the trees.How to Use InformationGa

37、ined from Comparison?dissimilarity measures Reexamine (1) data, (2) evolution models, (3) biological intuitions, or (4) tree construction mon structures in multiple trees Common information is more reliable than non-common information. Examples of Tree ComparisonsKey points:There are lots of tree co

38、mparisons!How does one design or use tree comparisons?new tree comparison = new type of information data mining flow chart: a hunch for a certain kind of information a math definition for tree comparison algorithms find new informationExamples #1 of Tree ComparisonsEmphasis: dissimilarity measures1.

39、 Good versus Bad Edges (Robinson-Foulds distance)2. Subtree Transfer Distance-Emphasis: common information3. Maximum Common Refinement Subtree4. Maximum Agreement Subtree (more technical details)Good EdgesX1X2X4X3X5X3X2X4X1X5goodTree #2Tree #1goodDefinition: good edge = same clusteringBad EdgesX1X2X

40、4X3X5X3X2X4X1X5badTree #2Tree #1badDefinition: bad edge = different clusteringsExternal Edges are Always Good EdgesX1X2X4X3X5X3X2X4X1X5goodTree #2Tree #1goodDefinition: good edge = same clusteringGood versus Bad EdgesX1X2X4X3X5X3X2X4X1X5badgoodTree #2Tree #1goodbadRobinson-Foulds distance = (1) # of

41、 bad edges (2) % of the internal edges being badRobinson-Foulds DistanceMeasure:Robinson-Foulds distance = (1) # of bad edges (2) % of the internal edges being badIntuitions:This measure counts how often two trees have different clusterings.Computational Complexity:n = size of input treesNave Algori

42、thm: O(n2) time.Best Algorithm: optimal O(n) time.(Day, 1985)Examples #4 of Tree ComparisonsEmphasis: dissimilarity measures1. Good versus Bad Edges (Robinson-Foulds distance)2. Subtree Transfer Distance-Emphasis: common information3. Maximum Common Refinement Subtree4. Maximum Agreement Subtree (more technical details)Basics of Maximum Agreement Subtrees Assumption: rooted binary evolutionary trees.Concepts:Information contents of a treeEvolutionary subtree

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论