We present the initial phase of the Korean Genome Project (Korea1K), including 1094 whole genomes (sequenced at an average depth of 31×), along with data of 79 quantitative clinical traits. We identified 39 million single-nucleotide variants and indels of which half were singleton or doubleton and detected Korean-specific patterns based on several types of genomic variations. A genome-wide association study illustrated the power of whole-genome sequences for analyzing clinical traits, identifying nine more significant candidate alleles than previously reported from the same linkage disequilibrium blocks. Also, Korea1K, as a reference, showed better imputation accuracy for Koreans than the 1KGP panel. As proof of utility, germline variants in cancer samples could be filtered out more effectively when the Korea1K variome was used as a panel of normals compared to non-Korean variome sets. Overall, this study shows that Korea1K can be a useful genotypic and phenotypic resource for clinical and ethnogenetic studies.
The Korean population [estimated census population size close to 85 million (M)] has been thought to be highly homogeneous with few large-scale admixture events in the past (1–4). However, little formal scrutiny has been given to these claims. Several Korean whole genomes and exomes (5, 6) have been reported since the first Korean genome data (SJK) were published in 2008 (7), including the first Korean reference genome sequence (KOREF_S) (8) and 40 unrelated individuals (KOREF_C) that formed the basis of KoVariome, the Korean genomic variation database (9). Before the current study, at least 100 whole genomes of Korean individuals were available worldwide (5, 10). However, although a global whole-genome project (the multiethnicity 1000 genome project) that aims to characterize global human genetic diversity contains over 2500 genomes, including Chinese and Japanese, it does not include Korean samples yet (11).
There has also been an effort to generate ethnicity-specific reference genome sequences, and several human variomes have been generated to expand the coverage of human genome diversity, including the UK10K (12), the Genome of the Netherlands (GoNL) project (13), and the pan-African genome (14). In 2015, the consequences of strong founder effects were demonstrated in the Icelandic population by sequencing 2636 genomes (15). In the Danish population study, 150 trios were used to de novo assemble a reference genome, and they provide detailed data on structural variations and many complex genomic regions, including the major histocompatibility complex and major regions of the Y chromosome (16). In East Asia, the 1KJPN project yielded data on 1070 Japanese genomes (17), and another recent dataset identified selection signatures in the Japanese population from 2234 Japanese whole-genome data (18). In contrast, the original KoVariome database contained only 50 Korean whole-genome sequences without clinical information at the time of publication (9), although its sample size has subsequently increased to >100 genomes. Despite these large genome sequencing projects in numerous populations, little biochemical and clinical data and limited information regarding genotype-phenotype association for the participants have been collected to characterize the population’s health and disease states.
Here, we introduce a dataset comprising 1094 Korean whole genomes of which 1007 genomes were newly generated in combination with systematically acquired clinical and biochemical measurement information from the blood and urine of the participants. This Korea1K set represents the first-phase release of the Korean Genome Project (KGP). KGP is a joint project by the Personal Genome Project at Harvard Medical School, the National Center for Standard Reference Data of Korea, Clinomics Inc., and the Korean Genomics Center of Ulsan National Institute of Science and Technology (UNIST). These genomes have been sequenced to a high sequencing depth (~31× on average) using Illumina HiSeq X10, and we used these data to characterize single-nucleotide variants (SNVs), indels, copy number variations (CNVs), transposable element (TE) insertion, and human leukocyte antigen (HLA) type in the Korean population and contrast the Korean data with similar data from other populations. The majority of the genomic data (984 samples) were from volunteers with clinical information on 79 quantitative traits that were measured at Ulsan University Hospital. To evaluate the practical utility of this large genomic dataset, we performed a genome-wide association study (GWAS) using the information of the 79 quantitative clinical traits. We also quantified the effectiveness of our dataset as a reference panel by analyzing 19 previously published Korean gastric cancer patient genomes (19).
SNVs and indels in Korea1K dataset
Whole-genome sequencing (WGS) data from 1007 blood or saliva samples (984 samples with clinical and biochemical information) were generated with an average sequencing depth of 31× and pooled with sequencing data from an additional 87 blood or saliva samples (without clinical information) from the KoVariome database (9). In total, 1094 complete genomes, including 916 unrelated and healthy individuals, mostly from the Ulsan metropolitan region, were compared to the human genome reference (hg38). A total of 39.2 M SNVs and 7.6 M indels were called from the dataset (table S1). We filtered out false-positive variants due to the sequencing batch effect (figs. S1 and S2) and related individuals, yielding a set of variants containing 34 M SNVs and 4.8 M indels. We divided the variants into five categories based on their allele frequency in the Korean population (singleton: allele count = 1; doubleton: allele count = 2; rare: allele count of >2 and allele frequency of ≤0.01; common: allele frequency of >0.01 but ≤0.05; and very common: allele frequency of >0.05; Fig. 1A). Highlighting the power of our large dataset, approximately half of the variants that we identified were classified as singleton or doubleton (allele count of ≤2), and unexpectedly, more than 70% of them are not reported in dbSNP (v150) (20). On the other hand, less than 20% of the variants were classified as very common (allele frequency of >0.05), with more than 94% of these variants previously reported in dbSNP (v150) (20). A total of 96.6% of the very common SNVs overlapped with KoVariome (9), compared to only 12.4% of rare SNVs (fig. S3). The number of variants that have an allele frequency of >0.01 was similar to other non-African populations (1KGP non-African and 3.5KJPN), while there was a far higher number of variants, which have an allele frequency of ≤0.01 than KoVariome because of Korea1K’s much larger sample size (fig. S4). On the basis of the final set of variants, each individual showed on average ~4.42 M variants (3.58 M very common, 0.4 M common, 0.31 M rare, 0.46 M doubleton, and 0.85 M singleton variants), of which 8928 and 918 were nonsynonymous and loss of function (LoF), respectively.
Next, we classified each variant into 1 of 19 different variant classes (i.e., intergenic and intronic) based on its functional impact and location in the genome (fig. S5). LoF variants (nonsense, nonstop, splicing site, and indel variants) in the Korea1K set had a higher ratio of rare, doubleton, or singleton variants than other regional classes, indicating the effect of purifying selection on these variants. In addition, the allele (site) frequency spectrum of unrelated individuals was used to estimate the fraction under selection pressure in different genomic regions (21). We confirmed that LoF variants had the highest fraction of sites under negative selection (fig. S6). We applied the same comparative analysis to the entire gene set and found that 16 genes showed high purifying selection pressure, which was even stronger than the selection for nonsynonymous variants across the genome. Four genes showed negative values suggestive of positive selection pressure (fig. S7). Regarding indels, the Korea1K set displayed more deletions (2,573,411) than insertions (2,155,644), possibly resulting from skewed variant calling (fig. S8). Indels in protein-coding regions displayed higher peaks among in-frame indels based on their length, indicating purifying selection (fig. S9) (22).
The discovery rate of newly observed variants from unrelated individual genomes is a method for quantifying genomic diversity in a given population (23). The pattern of newly observed, unshared variants of unrelated Korean genomes was investigated using the five allele frequency categories (Fig. 1B). The discovery rate of very common (allele frequency of >0.05) variants saturated after 132 samples (14.4%), while the rate of singleton and doubleton variants was still increasing after analyzing all 916 healthy unrelated samples. When we compared the count of newly observed variants in unrelated individuals against previously published KoVariome, unexpectedly, Korea1K showed a slightly higher rate of novel variant discovery than KoVariome (fig. S10; Korea1K, 101,866; KoVariome, 48,051 for 50th individual). This increase might have been caused by implementing newer versions of the variant calling pipeline and the human genome reference. As expected, this also confirms that more sequenced genomes are needed to sufficiently cover very rare variants in the Korean population.
The Korea1K set contained 266,081 nonsynonymous SNVs. Among them, 118,417 and 117,414 were categorized as protein damaging by PolyPhen (24) (possibly damaging, 46,116; probably damaging, 72,301) and SIFT (25) (deleterious, 117,414), respectively. In total, 87,671 variants were predicted as protein damaging by both programs, and their allele frequency is skewed toward rare frequencies, while benign or tolerated variants are skewed toward common frequencies, again indicating purifying selection (fig. S11).
When mitochondrial and chromosomal Y haplogroups among the Korean individuals (figs. S12 and S13) were investigated, the common types identified were D (34.19%), B (13.89%), and M (13.80%) mitochondrial and O (73.49%), C (16.9%), and N (6.58%) chromosomal Y (26, 27). The O male haplogroup is widely distributed in East Asia and Southeast Asia, while the C haplogroup is prominently distributed in East Asia and Northeast Asia (26). We also identified other fairly common mitochondrial haplogroup types (A, G, and F) in East Asia (28).
Genomic features of Koreans compared to other populations
We assessed the genetic distinctiveness of our Korea1K sample using principal components analysis (PCA) with the small size variants (SNP and indel) in our dataset and 1KGP. As previously reported, principal components PC1 and PC2 with worldwide populations showed a separate East Asian group (Fig. 2A). Although Koreans, Chinese, and Japanese are genetically very close relative to all other individuals (29), we found that those three populations clustered distinctly from each other (Fig. 2B). This pattern was replicated by ADMIXTURE analysis with K = 3 (fig. S14).
To investigate functionally relevant variants, we extracted 1048 ClinVar pathogenic variants found in Korea1K. Among them, 242 variants had an allele frequency greater than 0.1 in Korea1K, which is high for pathogenic variants (fig. S15). We also found 35 drug-response variants annotated in ClinVar (fig. S16), and 11 of them displayed significantly different allele frequencies from those of the Chinese or Japanese individuals in the 1KGP set, highlighting the importance of population-specific datasets when interpreting pathogenic or drug-response variants. For example, the variant rs4961 in ADD1 had the highest frequency in the Korea1K compared to other populations and is associated with hypertension and responsiveness to furosemide and spironolactone as shown in a European study (30, 31). However, no significant association with blood pressure was found in our GWAS using the Korea1K set (see the “Genome-wide association study” section for details).
We identified CNVs, TE insertions, and HLA-1 haplotypes (see Supplementary Materials and Methods) as WGS data can identify numerous variants from complex or highly variable nongenic regions (16, 17). Korea1K contains 1441 CNV loci, and 80% of them overlapped with the CNV set from the entire 1KGP samples while not overlapping with segmental duplication regions. Four common CNVs (sample frequency of >0.05) overlapped with those in the 1KGP set and were validated by a secondary CNV caller (Supplementary Materials), which, in turn, contained five protein-coding genes (figs. S17 and S18, extended data table S1). Among the four common CNVs, Korea1K had a copy duplication of CLPS, a pancreatic colipase, which is involved in dietary lipid hydrolysis.
For TE polymorphisms, the patterns of TE insertions between the Korean and other populations were investigated by PCA (figs. S19 and S20 and table S2). PC1 and PC2 identified that four superpopulations (Africans, Asians, Americans, and Europeans) were well separated from each other, whereas subpopulations in East Asia were not. Therefore, a specific TE insertion pattern alone is insufficient to finely differentiate subpopulations in East Asia, although the genomic diversity was clearly reflected in the allele frequency distribution (figs. S21 and S22). TE insertions with significantly different allele frequencies between Koreans and 26 other populations in the 1KGP set were enumerated, and as expected, Korea1K displayed significantly fewer differential TE insertions compared to East Asian populations than non–East Asians (Fig. 2, C and D and extended data table S2). Furthermore, ALU and SINE-VNTR-ALUs (SVA) displayed a greater proportion of differential TE insertions than Long interspersed nuclear element (LINE) in JPT, CHB, and CHS, probably because of different insertion rates on the TE types.
We also compared HLA types in Korea1K with those in publicly available databases containing European, American, and Asian HLA frequencies (figs. S23 and S24). Our HLA allele frequency pattern was very similar to the HLA haplotype distribution of Korean samples from the public database. HLA types A*24:02, A*26:01, A*31:01, B*40:02, and B*52:01 displayed significantly lower allele frequencies in the Korean population relative to the Japanese population (Fisher’s exact test P = 3.61 × 10−49, 7.09 × 10−8, 1.34 × 10−12, 9.61 × 10−12, and 3.13 × 10−42, respectively), while types A*33:03 and B*44:03 had higher allele frequencies (Fisher’s exact test P = 3.10 × 10−46 and 1.00 × 10−5, respectively). Although the Japanese are genetically very close to the Korean, the HLA-type profiles of these populations are considerably different. However, we identified similarities in the Asian populations; for example, types A*33:03, A*02:06, and B*58:01 displayed relatively high allele frequencies, while types A*02:01, A*03:01, A*01:01, A*32:01, A*68:01, B*07:02, B*44:02, and B*08:01 displayed low frequencies in Asian populations (Korean, Japanese, and Chinese populations) compared to other groups.
GWAS based on clinical traits
Thanks to its extensive genomic coverage, population-scale WGS data are more effective than chip-based approaches at identifying statistically significant associations with quantitative traits and diseases (12). It is even more powerful if matched clinical or phenotypic data are available for the genomes (32). In the Korea1K set, we were able to quantify 79 quantitative clinical traits measured from 984 samples (extended data table S3) through health checks provided to the KGP participants. We analyzed associations by fitting additive genetic models with relevant covariates for 79 quantitative traits and 6,658,227 variants [5,932,215 SNVs and 726,012 indels; minor allele frequency (MAF) of >1%] from 823 unrelated individuals of the 984 samples. The analysis resulted in 467 variants that were statistically linked via GWAS to 11 quantitative traits (P < 7.5 × 10−9, the Bonferroni-corrected significance threshold). The 467 variants were clumped into 15 independent loci on eight chromosomes, and 11 of them contained previously reported variants linked to a trait. We found that 11 index variants were not present on the commonly used Illumina Omni 2.5 human SNP chip (Fig. 3, Table 1, figs. S25 to S28, and extended data tables S4 and S5). Among the 11 loci of reported variants, 9 contained variants reported in the GWAS catalog (33), but their index variants were newly identified in this study. Of the 15 independent loci, we could identify 1 previously unidentified locus in MAMASTR, which is associated with cancer marker (carbohydrate antigen 19-9 and carcinoembryonic antigen). A previously unidentified locus was also found in WDPCP, which is associated with body fat percentage. Another previously unidentified locus in SERPINA7 was found, which is associated with triiodothyronine. We also found two loci on chromosome 2 (rsID: rs28946889, trait: total bilirubin, P = 1.85 × 10−23; rs662799, neutral fat, P = 4.22 × 10−10) that have been previously identified in two Korean GWAS (34, 35). The MAFs in the previously unidentified loci were markedly lower than those previously reported when we compared the MAFs of GWAS variants in these previously reported and the unreported loci (fig. S29). This means that large-scale variomes from WGS data help identify low-frequency alleles and unreported loci via whole genome–based GWA studies.
Korea1K imputation panel
Haplotype-based imputation is a cost-effective method to capture human genetic variation for clinical purposes. Crucially, the accuracy of imputation is improved when a population-specific reference panel is used (17). We first constructed a phased reference panel using the Korea1K dataset and the combination of Korea1K and 1KGP panel by using SHAPEIT2 (36). The imputation accuracy for the three reference panels [Korea1K (n = 1059), 1KGP (n = 2504), and Korea1K + 1KGP (rephased, n = 3563)] was evaluated by imputing prephased variants from the matched normal sample of the 19 Korean patients with gastric cancer. This test set was imputed with the three reference panels using Minimac3 (37). The accuracy was evaluated by comparing the squared Pearson correlation coefficient between the real genotypes and the dosage of imputed genotypes. The Korea1K panel showed better correlation with true genotypes at low allele frequencies than the 1KGP panel, and the combined Korea1K + 1KGP panel had the best accuracy overall, indicating the usefulness of Korea1K set for the imputation of Korean SNV data (Fig. 4).
The Korea1K dataset as a panel of normals for cancer genomics studies
Variome databases can be used as reference panels for genome-wide clinical studies (38). One of the practical examples is cancer mutation analysis. Many cancer-related studies need matched normal control samples to filter germline variants from the entire call set (39, 40). However, it is sometimes unfeasible to have matched normal sequencing data of the same patients with cancer. As an alternative to matched normal, a large-scale reference panel of variants can potentially serve as a control set for cancer genomics studies to filter out germline variants (41).
To estimate the power of our Korea1K dataset as an ethnicity-specific panel of normals, we evaluated the effectiveness of Korea1K, 3.5KJPN (17) and 1KGP (11) as a panel of normal for previously reported Korean gastric cancer datasets (19). We first identified true somatic and germline variants by comparing WGS data from cancer and matched normal tissues. Then, for the test sets, we classified the variants from the cancer tissue into tentative somatic or germline, on the basis of allele frequency cutoffs of the reference panels. When a target variant has a lower allele frequency value than the cutoff value in the reference panel, the variant was classified as a tentative somatic variant. If it has a higher allele frequency value, then it was classified as a tentative germline variant. Thereafter, we generated statistical measures of the classification performance of the datasets by comparing predicted variant categories based on multiple stepwise allele frequency cutoffs with the true variant categories (Fig. 5).
Although the 3.5KJPN set contained the largest number of variants, the Korea1K dataset had the highest accuracy of prediction of germline and somatic variants (Fig. 5A; Korea1K with an allele frequency cutoff of 0.01: 96.42%; 3.5KJPN with an allele frequency cutoff of 0.01: 93.83% on average). Furthermore, the Korea1K set had similar Matthews correlation coefficient values to 3.5KJPN, indicating a similar classification performance (Fig. 5B; Korea1K with an allele frequency cutoff of 0.01: 0.38; 3.5KJPN with an allele frequency cutoff of 0.01: 0.37 on average). Germline variants were predicted with the highest accuracy using the Korea1K set (Fig. 5C), while the recovery rate for somatic variants was low (fig. S30). Since we used samples from Korean individuals with cancer, an overall increase in similarity was noted with the Korea1K dataset. This leads to the speculation that a greater number of true somatic variants were filtered out using the Korea1K set than with other datasets; nonetheless, the density of variants of cancer-related genes from the Cancer Gene Census (CGC) database was the highest in the Korea1K-filtered set (fig. S31). Moreover, when the germline filtering criterion was set to an allele frequency of 1.5%, the Korea1K set had the highest density of CGC genes. Since we used a lift-over reference panel for 3.5KJPN and 1KGP, we also applied the same approach as with the Korea1K variants on only lift-over possible regions and confirmed that there were no qualitative changes to the results (figs. S32 to S34). These results highlight the possible benefits of using an ethnicity-specific variome database for cancer genome analyses of the same or a very closely related ethnic group(s).
This study presents a comprehensive WGS analysis of 1094 Koreans (Korea1K), which is a mixture of existing KoVariome (9) and newly added 1007 genomes with clinical information. On the basis of our analysis, Korean population is genetically homogeneous compared to other East Asians, and this is probably due to geopolitical isolation in the past thousands of years. However, we speculate that, although Koreans are fairly homogeneous, more than 1000 samples are necessary to map the Korean human genome diversity judging from the assessment of the discovery rate of newly observed variants (allele frequency of >0.05 variants saturated after 132 samples, while the rate of singleton and doubleton variants kept increasing even after analyzing all the 916 healthy unrelated samples). Despite a large amount of genomic data, coupled with clinical information, our CNV and TE analyses did not identify anything unusual or unique. This could be because short-read DNA-sequencing methods have an inherent difficulty in detecting structural variations that cannot be easily resolved bioinformatically, and we must perform long-read sequencing using the same samples in the future to map novel associations between these complex variants and phenotypically accessible traits. Furthermore, we note that our samples are mostly from the Ulsan metropolitan area and cannot reflect the whole Korean peninsula, although Ulsan has a population size of 1 M and the residents are from all across the peninsula due to rapid industrialization. Together, the sample size of 1094 from mostly Ulsan is still far from sufficient to represent the Korean population or to map latent genomic structural variations.
Our investigation of using Korea1K as a panel of normals for cancer genomics studies can be a small stepping stone for an efficient germline prefiltering process for cancer genome analyses in the future. However, it is still questionable how much actual benefit such ethnicity-specific variome-based filtering can bring to cancer genome analyses in real clinical settings, especially for rare or individual-specific variant analysis. Nevertheless, the large-scale Korean variome database constructed herein is potentially applicable in studies on various cancers and other diseases of Koreans and can indirectly help reduce the cost of certain genetic analyses. This kind of personal whole-genome dataset combined with common health check–derived clinical information is possibly a good exemplary path for an ethnicity-relevant reference panel for future personalized medical applications for Koreans.
Sample collection and sequencing
Informed consent was obtained from all individuals for their participation in the Korean Ulsan genome project, which comprises two subprojects. All clinical information was examined by the Ulsan University Hospital. In total, 696 samples were curated in the Ulsan University Hospital Biobank, from which samples were received thereafter. Further, 311 samples were collected by us. We downloaded data from 87 Korean samples from KoVariome (9), which collected volunteers from all across the Korean peninsula. Sample collection and sequencing was approved by the Institutional Review Board (IRB) of the Ulsan National Institute of Science and Technology (UNISTIRB-15-19-A and UNISTIRB-16-13-C). Genomic DNA was isolated from human blood samples, using the DNeasy Blood & Tissue kit (Qiagen, Germany) in accordance with the manufacturer’s protocol. Genomic DNA from saliva samples was isolated using the GeneAll Exgene trademark clinic SV mini kit. Extracted DNA was quantified using the Quant-iT BR assay kit (Invitrogen). High–molecular weight genomic DNA was sheared using a Covaris S2 ultra sonicator system to obtain fragments of appropriate sizes. Libraries with short 350–base pair (bp) inserts for paired-end reads were prepared using the TruSeq Nano DNA sample prep kit in accordance with the manufacturer’s protocol for Illumina-based sequencing. The products were quantified using the Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA), and raw data were generated using an Illumina HiSeq X10 platform (Illumina). Clusters were generated using paired-end 2 × 150-bp cycle sequencing reads via resequencing. Further image analysis and base calling were carried out using the Illumina real-time analysis program (https://sapac.illumina.com/informatics/sequencing-data-analysis.html) with default parameters following the manufacturer’s instructions. The quality of the base in the read was checked by FastQC (ver. 0.11.5; www.bioinformatics.babraham.ac.uk/projects/fastqc/) (table S3).
Adapter contamination was trimmed using Cutadapt (ver. 1.9.1) (42) with a forward adapter (′GATCGGAAGAGCACACGTCTGAACTCCAGTCAC′) and reverse adapter (′GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT′) and with a minimum read length of 50 bp after trimming. Thereafter, trimmed reads were mapped to the hg38 reference, using BWA-MEM (ver. 0.7.16a) with the “-M” option (43). Mapped BAM files were sorted by coordination, using Picard (ver. 2. 14. 0) with the Sortsam module. Duplicate reads were marked using Picard (ver. 2. 14. 0) with the MarkDuplicates module. The mapping quality was recalibrated using the BaseRecalibrator tool in the Genome Analysis Tool Kit (GATK) (ver. 3.7) (44). gVCF files were generated by HaplotypeCaller in GATK (44) with “-stand_call_conf 30-ERC GVCF” option. SNVs and indels were jointly genotyped from the gVCF files by GenotypeGVCFs in GATK (44). The called variants were annotated using Variants Effect Predictor (VEP) ver. 92 (45), and the fraction was estimated under negative selection, using the script of Moon and Akey (https://github.com/moon-s/fraction-under-selection) (21). For estimation of a fraction under selection pressure for each protein-coding gene, we selected genes that have more than 250 alternative allele count sums, since the small number of allele count may not produce proper site frequency spectrums. The following annotated variants were assigned to LoF mutations: “Frame_Shift_Del”, “Frame_Shift_Ins”, “In_Frame_Del”, “In_Frame_Ins”, “Nonsense_Mutation”, “Nonstop_Mutation”, or “Splice_Site”. The detailed methods for calling CNV, TE insertion, and HLA typing were noted to Supplementary Materials and Methods.
Batch effect removal
Each sample was labeled in accordance with its sequencing library preparation protocol, sequencing company, and the date of sending blood samples or libraries to the company. Twelve technical batches were identified. The batch effect was assessed via PCA, using EIGENSOFT (ver. 6.1.4) (46), using variants and samples in accordance with the following criteria:
1) Biallelic SNVs with a MAF of ≥5%.
2) P values of the Hardy-Weinberg Equilibrium (HWE) test >0.05.
3) Genotype missing rate of <0.01.
Thereafter, filtered variants were pruned on the basis of linkage disequilibrium (LD), using PLINK (47) (ver. 1.9b) with “–indep-pairwise 200 4 0.1”,, leaving 101,326 SNVs. For individual selection, closely related individuals identified on the basis of an identity by descent (IBD), estimated in PLINK (47), were filtered. All pairs with an IBD value of >0.125 were extracted (corresponding to third-degree relatives) and clustered to a family group. Until no pairs of relatives remained, each family group was then reduced as follows:
1) The sample with the highest number of pairs in the family group was eliminated.
2) The sample with the highest missing calls among LD-pruned SNVs was eliminated if there are several samples with the same number of pairs in the family group.
To identify variants exhibiting the batch effect, we used logistic regression models for all variants as follows:
1) The variant was eliminated if it was a batch-specific variant compared to all other batches.
2) Each batch was paired with another, resulting in all possible combinations. The variant was eliminated if it was significant in any of the combinations.
In total, 6,348,049 variant positions were significantly associated with the technical batch (P ≤ 0.01) and eliminated from the original set. We used the quality by depth (QD) value in a joint VCF file for plotting variants’ quality distribution.
PCA and ADMIXTURE with the 1KGP genome data
The interpopulation genomic structure was evaluated by projecting the first two PCs determined via PCA of SNVs from Korea1K samples and 1KGP without closely related individuals. We selected and merged variants and from the Korea1K and 1KGP sets in accordance with the following criteria:
1) Biallelic SNVs with a MAF of ≥5%.
2) Biallelic SNVs with an HWE P >10−6.
3) Biallelic SNVs with a missing genotype rate of <0.01.
Extracted variants were LD pruned using “–indep 50 5 2” in PLINK (47), yielding 153,633 sites. PCA was carried out using the EIGENSOFT program (46). ADMIXTURE (48) analysis was performed from K = 2 to K = 14 based on the same variants set as PCA. We plotted an ADMIXTRUE plot for K = 3, which showed the smallest cross-validation error rate across the Ks.
Mitochondrial and chromosome Y haplogroup analysis
Mitochondrial haplogroups were identified via Haplogrep (ver. 2.1.13) (49, 50), and the Yfitter tool (ver. 0.2) (51) was used to identify Y chromosome haplogroups. We prepared the input files for the Yfitter program by converting hg38 coordination to hg19 coordination, using CrossMap (52) (ver. 0.2.7).
Genome-wide association study
For the GWAS, 823 individuals, 6,658,227 variants, and 79 traits were selected in accordance with the following criteria:
1) Individuals whose clinical traits were examined.
2) Individuals having no rare diseases.
3) Individuals having no kinship within the selected samples.
4) SNVs and indels having a MAF of ≥1%.
5) SNVs and indels having an HWE P > 10−6.
6) SNVs and indels having a missing genotype rate of <0.01.
The GWAS was performed exclusively using quantitative traits. GWA analysis was performed using linear regression under an additive genetic model. Age, age2, sex, body mass index (BMI), and the first 10 principal components were included as covariates. BMI was excluded from covariates in the GWAS for BMI itself and the degree of obesity. The genome-wide significance threshold was determined to be 7.51 × 10−9 through Bonferroni correction (0.05/6,658,227). The study-wide significance threshold was determined using the equation (0.05 / (the number of tested traits × the number of tested variants)). Variants were grouped into the loci with “–clump-p1 0.0000001 –clump-kb 1000 –clump-r2 0.1” options with PLINK (version 1.9) (47). For each locus, previously reported variants associated with the trait of interest were examined in order from the most significant variant with the National Human Genome Research Insitutue (NHGRI) GWAS catalog (33) (P ≤ 5 × 10−8, ver. 2018-12-07).
Imputation panel construction
To construct the Korea1K imputation reference panel, 1059 healthy individuals that had no rare diseases with a total of 28,692,913 autosomal biallelic variants with a missing genotype call rate of <0.1 and minor allele count of >1 (not a singleton) were selected. The variants were phased into haplotype using SHAPEIT2 (version v2.r904) (36), and the Korea1K set was used to construct a rephased imputation panel using the 1KGP reference panel. We chose an alternative allele from Korea1K if the alternative allele of Korea1K and 1KGP is discordant during the merging step. To evaluate the imputation accuracy of the reference panels, we separately processed matched normal samples from previously published 19 unrelated Korean patients with gastric cancers obtained from National Center for Biotechnology Information (NCBI; SRP014574 and SRA057772). For a test set, we extracted 1,302,490 variants that were present in Illumina Omni 2.5 chip from the 19 individuals and obtained 1,243,087 prephased SNVs using SHAPEIT2 (36). The prephased test set was imputed using the prepared reference panels by Minimac3 (ver. 2.0.1) (37). Imputation accuracies were estimated using squared Pearson correlation coefficients (R2) between the true genotypes and imputed genotype dosages.
Processing of previously reported samples from patients with cancer and classification of variants
WGS data from 19 previously reported Korean individuals with gastric cancer were obtained from NCBI (SRP014574 and SRA057772) and mapped to hg38 using BWA-MEM (ver. 0.7.15) with the “-M” option, and the SAM format was converted to the BAM format using SAMtools (ver. 1.4) (53). The BAM files were sorted using SAMtools (ver. 1.4), and duplicated reads were marked using the MarkDuplicates module in Picard tools. Base realignment and recalibration of the base quality score were carried out using GATK (ver. 3.7) (44). Variants from all samples were called using GATK HaplotypeCaller with the joint calling mode. All identified variants were annotated using the Ensembl VEP (ver. 92.1) (45). Furthermore, the variants were annotated to the cancer-related genes from the CGC database (54). If a variant was identified exclusively in a sample from an individual with cancer, then we treated these variants as somatic. Since 3.5KJPN only provides information regarding variants with hg19 coordinates, the variants were converted to the hg38 coordination, using the lift-over tool in the University of California, Santa Cruz genome browser (55). Thereafter, we merged the information regarding allele frequencies in the Korea1K and 3.5KJPN sets to the annotated variants based on genomic location and allelic information. Annotated variants were classified into tentative somatic or germline variants based on the allele frequency cutoffs in each panel (Korea1K, 3.5KJPN, and 1KGP). If a variant showed a lower allele frequency value than the cutoff value or was not presented in the reference panel, then the variant was classified as a tentative somatic variant. If not, then it was classified as a tentative germline variant. Thereafter, the classification and the true sets were compared to evaluate the performance of the datasets.
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.
- Copyright © 2020 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).