No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.
These online methods contain an abridged description of the methodology used in the current manuscript; extensive details about the methodology we used are provided in Supplementary Note 2. Importantly, two independently developed computational frameworks (SigProfiler and SignatureAnalyzer) based on NMF were applied separately to the examined sets of mutational catalogues. SigProfiler and SignatureAnalyzer take different approaches for deciphering mutational signatures and for assigning each signature to each sample. By using two methods, we aimed to provide a perspective on the effect that different methodologies can have on the numbers of signatures generated, signature profiles and attributions. In addition to applying SigProfiler and SignatureAnalyzer to cancer data, the tools were also applied to realistic synthetic data with known solutions.
Analysis of mutational signatures with SigProfiler
SigProfiler incorporates two distinct steps for identification of mutational signatures, based on the previously described methodology6,11,17 (Extended Data Fig. 8). The first step (SigProfilerExtraction) encompasses a hierarchical de novo extraction of mutational signatures based on somatic mutations and their immediate sequence context, and the second step (SigProfilerAttribution) focuses on accurately estimating the number of somatic mutations associated with each extracted mutational signature in each sample. SigProfilerExtraction is an extension of a previous framework for the analysis of mutational signatures11,17. In brief, for a given set of mutational catalogues, the algorithm deciphers a minimal set of mutational signatures that optimally explains the proportion of each mutation type and estimates the contribution of each signature to each sample. More specifically, for each NMF iteration, SigProfilerExtraction minimizes a generalized Kullback–Leibler divergence constrained for nonnegativity (Supplementary Note 2). The algorithm uses multiple NMF iterations (in most cases 1,024) to identify the matrix of mutational signatures and the matrix of the activities of these signatures, as previously described17. The unknown number of signatures is determined by human assessment of the stability and accuracy of solutions for a range of values, as previously described17. The framework is applied hierarchically to increase its ability to find mutational signatures that generate few mutations or are present in few samples.
After signatures are discovered by SigProfilerExtraction, SigProfilerAttribution estimates their contributions to individual samples. For each examined sample, the estimation algorithm involves finding the minimum of the Frobenius norm of a constrained function using a nonlinear convex optimization programming solver using the interior-point algorithm63. See Supplementary Note 2 and Extended Data Fig. 8b for further details.
Analysis of mutational signatures with SignatureAnalyzer
SignatureAnalyzer uses a Bayesian variant of NMF that infers the number of signatures through the automatic relevance determination technique and delivers highly interpretable and sparse representations for both signature profiles and attributions that strike a balance between data fitting and model complexity. Further details of the actual implementation of the computational approach have previously been published9,27,64. SignatureAnalyzer was applied by using a two-step signature extraction strategy using 1,536 pentanucleotide contexts for SBSs, 83 indel features and 78 DBS features. In addition to the separate extraction of SBS, indel and DBS signatures, we performed a ‘COMPOSITE’ signature extraction based on all 1,697 features (1,536 SBS + 78 DBS + 83 indel). For SBSs, the 1,536 SBS COMPOSITE signatures are preferred; for DBSs and indels, the separately extracted signatures are preferred.
In step 1 of the two-step extraction process, global signature extraction was performed for the samples with a low mutation burden (n = 2,624). These excluded hypermutated tumours: those with putative polymerase epsilon (POLE) defects or mismatch repair defects (microsatellite instable tumours), skin tumours (which had intense UV-light mutagenesis) and one tumour with temozolomide (TMZ) exposure. Because the underlying algorithm of SignatureAnalyzer performs a stochastic search, different runs can produce different results. In step 1, we ran SignatureAnalyzer 10 times and selected the solution with the highest posterior probability. In step 2, additional signatures unique to hypermutated samples were extracted (again selecting the highest posterior probability over ten runs) while allowing all signatures found in the samples with low mutation burden, to explain some of the spectra of hypermutated samples. This approach was designed to minimize a well-known ‘signature bleeding’ effect or a bias of hyper- or ultramutated samples on the signature extraction. In addition, this approach provided information about which signatures are unique to the hypermutated samples, which was later used when attributing signatures to samples.
A similar strategy was used for signature attribution: we performed a separate attribution process for low- and hypermutated samples in all COMPOSITE, SBS, DBS and indel signatures. For downstream analyses, we preferred to use the COMPOSITE attributions for SBSs and the separately calculated attributions for DBSs and indels. Signature attribution in samples with a low mutation burden was performed separately in each tumour type (for example, Biliary–AdenoCA, Bladder–TCC, Bone–Osteosarc, and so on). Attribution was also performed separately in the combined microsatellite instable tumours (n = 39), POLE (n = 9), skin melanoma (n = 107) and TMZ-exposed samples (syn11738314). In both groups, signature availability (which signatures were active, or not) was primarily inferred through the automatic relevance determination process applied to the activity matrix H only, while fixing the signature matrix W. The attribution in samples with a low mutation burden was performed using only signatures found in the step 1 of the signature extraction. Two additional rules were applied in SBS signature attribution to enforce biological plausibility and minimize a signature bleeding: (i) allow SBS4 (smoking signature) only in lung, head and neck cases; and (ii) allow SBS11 (TMZ signature) in a single GBM sample. This was enforced by introducing a binary, signature-by-sample signature indicator matrix Z (1, allowed; 0, not allowed), which was multiplied by the H matrix in every multiplication update of H. No additional rules were applied to indel or DBS signature attributions, except that signatures found in hypermutated samples were not allowed in samples with a low mutation burden.
Application of SigProfiler and SignatureAnalyzer to synthetic data
Our goal was to evaluate SignatureAnalyzer and SigProfiler on realistic synthetic data to identify any potential limitations of these two methods. SignatureAnalyzer and SigProfiler were tested on 11 sets of synthetic data, encompassing a total of 64,400 synthetic samples, in which known signature profiles were used to generate catalogues of synthetic mutational spectra. We operationally defined ‘realistic’ data as those based on the characteristics of either SignatureAnalyzer’s or SigProfiler’s analysis of the PCAWG genome data. SignatureAnalyzer’s reference signature profiles were based on COMPOSITE signatures, consisting of 1,536 types of strand-agnostic SBSs in pentanucleotide context, 78 types of DBSs and 83 types of small indels, for a total of 1,697 mutation types. SigProfiler’s reference analysis was based on strand-agnostic SBSs in the context of one 5′ and one 3′ base. For each test, we generated two sets of realistic data: SigProfiler-realistic (based on SigProfiler’s reference signatures and attributions) and SignatureAnalyzer-realistic (based on SignatureAnalyzer’s reference signatures and attributions), as well as two other types of data that involved using SignatureAnalyzer profiles with SigProfiler attributions and vice versa. A detailed description of each of the 11 sets of synthetic data and the results from applying SigProfiler and SignatureAnalyzer are provided in Supplementary Note 2.
Analysis of clustered mutational signatures
Somatic SBSs were considered clustered if they had intermutational distances < 1,000 bp. More specifically, for each sample, an SBS mutational catalogue was generated for substitutions that were <1,000 bp from another substitution. Subsequently, the set of SBS mutational catalogues containing clustered mutations underwent de novo extraction of mutational signatures. Any novel mutational signature (one that was not previously observed in the complete SBS catalogues) was reported as a clustered mutational signature.
Better separation compared to COSMIC v.2 signatures
As described in the manuscript, all mutational signatures previously reported in COSMIC v.2 were confirmed in the new set of analyses with median cosine similarity of 0.95. However, the separation between the COSMIC v.2 mutational signatures (https://cancer.sanger.ac.uk/cosmic/signatures_v2) is much worse than the separation between the mutational signatures reported here. For example, in COSMIC v.2, signatures 5 and 16 had a cosine similarity of 0.90, making them hard to distinguish from one another. By contrast, in the current analysis, SBS5 and SBS16 have a cosine similarity of 0.65. This allows us to unambiguously assign SBS5 and SBS16 to different samples. In the current analysis, the larger number of samples has allowed the reduction of bleeding between signatures and has given more unique and easily distinguishable signatures. One can evaluate the overall separation of a set of mutational signatures by examining the distribution of cosine similarities between the signatures in the set. The signatures in COSMIC v.2 had a median cosine similarity of 0.238. By contrast, the current signatures have a much lower median cosine similarity of 0.098. This twofold reduction in similarity is highly statistically significant (P value 9.1 × 10−25) and indicates a better separation between the signatures in the current analysis.
Correlations of mutational signature activity with age
Before evaluating the association between age and the activity of a mutational signature, all outliers for both age and numbers of mutations attributed to a signature in a cancer type were removed from the data. An outlier was defined as any value outside three standard deviations from the mean value. A robust linear regression model that estimated the slope of the line and whether this slope was significantly different from zero (F test; P value < 0.05) was performed using the MATLAB function robustfit (https://www.mathworks.com/help/stats/robustfit.html) with default parameters. The P values from the F tests were corrected using the Benjamini–Hochberg procedure for false discovery rates. Results are available at syn12030687 and syn20317940.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.