ng11韩国人基因组转录组多态性_第1页
ng11韩国人基因组转录组多态性_第2页
ng11韩国人基因组转录组多态性_第3页
ng11韩国人基因组转录组多态性_第4页
ng11韩国人基因组转录组多态性_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individualsYoung Seok Ju1,2,9, Jong-Il Kim1,35,9, Sheehyun Kim1,2,9, Dongwan Hong1,8, Hansoo Park1,6, Jong-Yeon Shin1,5, Seungbok Lee1,4, Won-Chul Lee1,4, Sujung Kim5, Saet

2、-Byeol Yu5, Sung-Soo Park5, Seung-Hyun Seo5, Ji-Young Yun5, Hyun-Jin Kim1,4, Dong-Sung Lee1,4, Maryam Yavartanoo1,4, Hyunseok Peter Kang1, Omer Gokcumen6, Diddahally R Govindaraju6, Jung Hee Jung2, Hyonyong Chong2,7, Kap-Seok Yang2, Hyungtae Kim2, Charles Lee6 & Jeong-Sun Seo15,7Massively parallel s

3、equencing technologies have identified a broad spectrum of human genome diversity. Here we deep sequenced and correlated 8 genomes and 7 transcriptomes of unrelated Korean individuals. This has allowed us to construct a genome- wide map of common and rare variants and also identify variants formed d

4、uring DNA-RNA transcription. We identified 9.56 million genomic variants, 23.2% of which appear to be previously unidentified. From transcriptome sequencing, we discovered 4,4 4 transcripts not previously annotated. Finally, we revealed ,809 sites of transcriptional base modification, where the tran

5、scriptional landscape is different from the corresponding genomic sequences, and 580 sites of allele-specific expression.Our findings suggest that a considerable number of unexplored genomic variants still remain to be identified in the human genome, and that the integrated analysis of genome and tr

6、anscriptome sequencing is powerful for understanding the diversity and functional aspects of human genomic variants.Massively parallel sequencing technologies have revolutionized our understanding of human genome architecture. A diverse array of high-throughput sequencing platforms and strategies ha

7、ve now been developed and applied to the analyses of tomes of both healthy and clinically affected individuals17. Over the past few years, a broad range of genetic variants including SNPs, copy number variants (CNVs) and other structural genomic variants14,813 have been discovered using deep sequenc

8、ing. In 2010, a public database of human variants (dbSNP 131) catalogued approximately 20.1 million SNPs, and the database of genomic variants (DGV) listed 89,427 structural genetic variants from 38 projects14. Next-generation sequencing technologies have also allowed gene expression analy- sis to p

9、rovide transcriptome maps at nucleotide resolution15,16. However, only a limited number of studies have analyzed human genome and transcriptome sequences for the same individuals on a genome-wide scale6,17.A primary goal of human genetics is to understand the relation- ship between genomic variants

10、and phenotypic traits. Although genome-wide association studies (GWAS) have revealed associa- tions between hundreds of specific variants with complex traits and diseases, only a small proportion of the heritability of these genetictraits haually been explained18, presumably because, in part, ofunkn

11、own and rare functional variants. Compensive discoveryof the impact of unknown and rare genomic variants on missing heritability requires unbiased whole-genome sequencing. The pilot phase of the 1000 Genomes Project has been published19, but to gain a moreed understanding of tomic landscape of commo

12、n and rare variants in humans, more individuals from differ- ent populations should be whole-genome sequenced at high-depth coverage. Furthermore, comparisons between genomic variants and their corresponding transcriptional profiles from the same individu- als need to be performed to help understand

13、 the functional aspects of these variants.Here we have analyzed both genomic and transcriptomic sequences from healthy Korean individuals. These include high-coverage whole- genome sequencing of ten individuals and whole-exome sequenc- ing of an additional eight individuals. We also generated comple

14、te transcriptome sequence data from 17 of these individuals to explore the relationship between identified genomic variants and the corre- sponding transcriptome variants. Our analyses integrate genome and transcriptome data across multiple individuals and reveal extensive variation at both levels.1

15、Genomic Medicine Institute (GMI), Medical Research Center, Seoul National University, Seoul, Korea. 2Macrogen Inc., Seoul, Korea. 3Department of Biochemistry, Seoul National University College of Medicine, Seoul, Korea. 4Department of Biomedical Sciences, Seoul National University Graduate School, S

16、eoul, Korea.5Psoma Therapeutics Inc., Seoul, Korea. 6Department of Pathology, Brigham and Womens Hospital and Harvard Medical School, Boston, Massachusetts, USA. 7Axeq Technologies, Rockville, Maryland, USA. 8Present address: Division of Convergence Technology, Functional Genomics Branch, National C

17、ancer Center, Goyang-si, Korea. 9These authors contributed equally to this work. Correspondence should be addressed to J.-S.S. (jeongsunsnu.ac.kr).Received 17 December 2010; accepted 3 June 2011; published online 3 July 2011; doi:10.1038/ng.872Nature GeNetics ADVANCE ONLINE PUBLICATION 2011 Nature A

18、merica, Inc. .A rt i c l e sWe also detected 1,191,599 indels (30 bp) in the ten individual whole-genome sequences (Table 1 and Supplementary Table 4).RESULTSSNP and short indel identificationWe extracted genomic DNA from 18 unrelated individuals (11 males and 7 females) from venous blood (Supplemen

19、tary Fig. 1 and Online Methods). All the individuals have Altaic Korean origins and have no known genetic diseases. From these 18 individuals, 10 were whole- genome sequenced to aage of 26.1-fold coverage (Table 1).al genomes2,3,6, we foundCompared to previously sequencedapproximately twice as many

20、indels from each individual. Theexcess may be attributed to more accurate alignments from the increased lengths and an increased proportion of available paired-end sequencing s. We validated 35 randomly chosen indels that appeared in the heterozygous state and found as singletons using PCR and Sange

21、r sequencing, and we successfully validated all the regions (Supplementary Table 3). A large proportion of the indels identified (32.5%) were not found in dbSNP 131 and are therefore considered new, indicating that a large fraction of indels among human populations also remain to be discovered.Out o

22、f the total 8.37 million SNPs, 23,025 were non-synonymous (nsSNPs). O age, each individual genome contains 8,431 nsSNPs in 4,700 genes. Considering that approximately 20,000 genes lead to mRNA transcripts, this suggests that 25% of mRNA coding genes could lead to variable functions between individua

23、ls.To more comp ensively investigate the variants within human genes, we captured and sequenced entire exonic regions represent- ing 17,133 genes were from an additional eight Korean individuals (six males and two females) (Supplementary Table 5 and Online Methods). O age, we obtained 63.9 depth per

24、 individual for the targeted regions. From the exome-capture sequencing of these eight individuals, we found a total of 35,740 SNPs and 15,697 indels. Of these, 17,833 (49.9%) of the SNPs were non-synonymous.To obtain a global view of genomic variants influencing protein sequences among the 18 indiv

25、iduals, we merged the nsSNPs identified from the ten whole-genome and the eight whole-exome sequences into a non-redundant set, which included 28,179 variants, 8,130 (28.6%) of which are new based on their absence in dbSNP 131 (Supplementary Table 6). A subset of the nsSNPs showed remarkably high al

26、lele fre- quencies among the Koreans studied compared to other populations, including Europeans and west Africans represented in the HapMap project23. For example, rs4961 in ADD1, rs17822931 in ABCC11 and rs3827760 in EDAR are known to be associated with common sal sitive hypertension24, dry type ea

27、r wax25 and thick hair mor- phology26, respectively. We also identified new nsSNPs with high frequency among Koreans, for example, a polymorphism that alters isoleucine to methionine on the third exon of PAWR that is thought to influence apoptosis and tumor suppression27. We annotated the putative c

28、linical implications of the nsSNPs identified using the Trait- o-matic algorithm3 (Supplementary Table 7 and see URLs).We found a subset of genes to be highly enriched for nsSNPs,here called super nsSNP genes (Supplementary Table 8 andThe majority of the short-s were sequenced from both ends with761

29、51 bp lengths (Supplementary Table 1). We aligned the short s to the NCBI human reference genome build 36.3 (hg18) using GSNAP20 and Bioscope (Life Technologies).After applying a set of bioinformatic filter conditions thatwere established from previous conservative training processes3, we identified

30、 3.45 million to 3.73 million SNPs from each whole genome sequenced (Table 1). The ratio between heterozygous and homozygous SNPs was 1.50 o age, which agrees well with other previously annotated genomes21. The ratio from whole-genome sequences obtained using earlier sequencing platforms (1.67 for i

31、ndi- vidual AK1 and 1.33 for individual AK2) slightly deviated from the average, potentially reflecting underlying differences of sequencing technologies in variant detection. Comparing sequencing-derived SNPs with data obtained from the Illumina 610K genoty array revealed a 99.94% positive predicti

32、ve value for SNP detection (Supplementary Fig. 2 and Supplementary Table 2). We further validated these results using PCR amplification and Sanger sequenc- ing of 33 randomly chosen heterozygous SNPs found as singletons among ten individuals studied, which showed a 100% success rate (Supplementary T

33、able 3).The SNPs identified among the ten whole-genomesequenced indi- viduals clustered into a non-redundant set of 8.37 million SNPs, 21.9% (1.83 million) of which were considered to be new when compared to dbSNP131. As such, this study has increased the number of identified SNPs by approximately 8

34、% over what was previously known, including variants discovered by the pilot phase of the 1000 Genomes Project19. Specifically, of these 1.83 million new SNPs, we found 73.9% (1.36 million) as singletons among the ten individuals studied, suggesting that a large proportion of these new variants are

35、rare (Fig. 1a). Each individual genome has approximately 130,000 of these putative rare SNPs, consistent with the understanding that many rare variants still remain to be identified. The remaining 26.1% (0.48 million) of the new SNPs are thought to be common among Koreans (preliminary allele frequen

36、cy 10%) but were not identified by previous genome studies, suggesting that they could be population-specific variants. A genome-wide map summarizing the location and frequency for each identified SNP is provided in our database (see URLs)22.Supplementary Note). For example, ZNF717 and CDC27 showed

37、100 times increased density of nsSNPs compared to other genes (Table 2). Among the 86 super nsSNP genes, 49 (57.0%) are associated with sensory and immunological function, such as olfac- tory receptors and HLA-related genes. The enrichment of nsSNPs in some of these genes, such as PRIM2, may be expl

38、ained by hidden duplications that are yet to be repre- sented in the human genome, such as HYDIN (Supplementary Figs. 3,4)28,29. Notably, a subst al number of these genes overlap with copy number variant regions (84%) that are archived in the DGV14 as well as segmen- tal duplications (37%) found in

39、the humantable 1 summary statistics of whole-genome sequencingAligneddepthNo. SNPsHet./ hom.No. new SNPsNo. nsSNPsNo. nsSNP genesIndividuals GenderNo. indelsAK1a AK3 AK5 AK7 AK9 AK2b AK4 AK6 AK14 AK20TotalMale Male Male Male Male Female Female Female Female Female27.825.826.130.627.029.323.122.326.1

40、22.7260.3,453,6063,510,3263,595,9363,573,0293,700,1283,588,3283,630,4283,558,7033,726,9643,686,270,367,3021.671.491.561.561.451.331.471.461.481.55339,067259,958285,766283,133294,005336,200259,792250,126276,253277,1181,833,10210,3107,5158,2797,5167,8959,7428,1327,6898,5718,65623,0255,5604,2964,6064,2

41、754,3635,4594,5774,3464,7204,7969,315170,202373,687367,651414,397507,624213,719429,259413,950478,072414,0451,191,599No., number; het./hom., ratio of heterozygous to homozygous SNPs.aWe considerably included single-end and shorter (2 36 bp)s3. bWe used Life Technologies SOLiD.2ADVANCE ONLINE PUBLICAT

42、ION Nature GeNetics 2011 Nature America, Inc. .A rt i c l e sa 2,500,000Figure 1 New and rare SNPs in individual genomes. (a) The allele frequencies of autosomal SNPs from whole-genome sequence data. The number of singleton and new SNPs in each individual genome is also shown. (b) The number of new

43、SNPs and non-synonymous SNPs as theNew250,000200,000150,000100,00050,0000dbSNP1312,000,000number ofal genomes increased through the simulation study.1,500,000reference genome, suggesting that super nsSNP genes are clustered near common structural variants in the human genome.1,000,000Rare and popula

44、tion-specific variantsTo obtain a reasonable estimate of the number of rare variants in a given individuals genome, we simulated the number of variants500,0000that would be detected as new as the number ofal genomesobtained increases (Supplementary Note). Our simulations suggestedAllele frequencytha

45、t, among the 3.5 million SNPs in an individual genome, 54,000 (1.5%) and 28,000 (0.8%) have an allele frequency of 1% and 0.5%, respectively (Fig. 1b). This number should be interpreted cautiously, as approximately 2,000 SNPs in an individual genome are consid- ered to be false positives when the po

46、sitive predictive value of SNP detection is 99.94%, as it was in our study. The nsSNPs are estimatedb1,000,000900,000800,000700,000600,000500,000400,000300,000200,000100,0000New SNP2,000Expected new SNP New nsSNP Expected new nsSNP1,500tofor a larger proportion of rare SNPs (3% and 1.9% of1,000nsSNP

47、s in an individual have of 1% and 0.5% allele frequency,respectively) in this analysis, suggesting that nsSNPs have greater diversity among individuals. This enrichment of nsSNPs among rare variants may be caused by negative selection, which removes deleteri- ous alleles from the human population an

48、d thus drives those alleles to low allele frequency.The deep sequencing of Korean genomes also allowed us to assess the relationship between common Korean nsSNPs (allele frequency more than 10%) and previously known SNPs. For example, a com- mon Korean nsSNPs (archived in dbSNP131 as rs76418769 with

49、 an estimated allele frequency of 25% among Koreans), which alters glycine to arginine on the fourth exon of IL23R (a gene known to be associated with Crohns disease and psoriasis) showed low linkage5000Number of genomesdisequilibrium (LD) with nearby (20 kb) known SNPs (Fig. 2a; um r2 0.10). Notabl

50、y, a tagging SNP (rs6687620) was non- informative for this region in Koreans (with an estimated allele fre- quency of 1)., 53.4% of new but common Korean nsSNPs were in insufficient LD (r2 50% reciprocal overlap cri- teria). The size of these deletions ranged from 1 kb to 46 kb (Supplementary Table

51、10). We validated the deletions with an ultra-highresolutiongenome-widetable 2 selected super nssNP genes (top 20) from whole-genome sequencing datansSNPAverage numberAverage density (per kb)Increase ofdepthaGeneChr.Position (Mb)Exon length (bp)ZNF717 OR4C3 CDC27 FRG2C OR4C45 OR9G9 FAM104B PRIM2 HLA

52、-DRB1 HLA-DPA1 CTBP2 SEC22B HLA-DQB1 OR13C5 KCNJ12 MUC4 TAS2R31 HLA-DQA1 OR51Q1 HLA-A3111731111X 6661016917312611675.86948.30342.55375.79648.32356.22455.18957.29132.65533.144126.668143.80832.736106.40121.259196.960110.74332.7135.40030.0182,7499912,4948539219193421,5438077871,3476537919581,3033,4019317729551,10675.016.540.613.513.712.94.617.78.57.812.66.07.18.411.228.47.36.07.48.426.2816.6516.2815.8314.8814.0413.4511.4710.539.919.359.198.988.778.608.357.847.777.757.59+Chr., chromosome.adepth for the corresponding gene increased

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论