CNV frequency annotation¶
Several factors complicate the assessment of allele frequencies for copy number variants:
- The breakpoints of CNV calls based on sequence coverage are imprecise, and therefore the same variant can have different breakpoint coordinates in different individuals.
- Large CNVs can be reported as several separate calls (i.e., fragmented calls). This is often due to a copy number change within the region of a large CNV, for example, due to a smaller nested CNV or a complex structural rearrangement.
- Distinguishing between different combinations of alleles that can give rise to the same copy number is challenging. For example, a copy number of 3 could be the result of a tandem duplication with 2 copies on one allele and a single copy on the other allele, or a tandem duplication with 3 copies on one allele and a deletion on the other allele, or two single copy alleles with an additional copy elsewhere in the genome.
- The accuracy of exact copy number inference for gains with more than 3 copies is not known.
Due to these issues, there is no single perfect method to calculate allele frequencies for CNVs. Therefore, we present two calculations through alternative strategies. CNV frequencies were calculated using 5,415 germline samples from unrelated individuals (participants in the Cancer program of the 100,000 Genomes Project and the COVID-19 research project)
Reciprocal overlap¶
In this method, CNV calls from the 5,415 reference samples are compared individually to CNVs in the sample as shown in a figure below, using an 80% reciprocal overlap threshold. A limitation of this method is that the frequency may be inaccurate in the event of CNV fragmentation, i.e., fragmented calls can inappropriately appear to be rare.
Frequency track – area under the curve method¶
In this method, CNV calls from the 5,415 reference samples are combined. Each base in each of the sampled genomes is annotated with the number of chromosomes for which there is an overlapping CNV. Then the area under the curve for each CNV detected in any patient is calculated, considering both the number of bases and the number of chromosomes in which a CNV is found in the reference dataset. The frequency is then weighted by the maximum possible area (i.e., an allele frequency of 1 is equivalent to all reference samples having a CNV covering all bases of the patient CNV).
An advantage of this method is that it is robust to CNV fragmentation. A limitation is that we do not know whether the underlying frequency track frequency distribution results from calls of similar size to that detected in the patient, or smaller overlapping CNVs detected in different individuals. If a CNV overlaps two high-frequency regions (e.g., at each end) separated by a low-frequency region, the overall area under the curve for the region may not be representative of the individual regions, and in particular the contribution of high-frequency regions could mask the existence of the low-frequency region.
Differences in LOSS and GAIN calculations¶
For LOSS variants, allele frequencies are calculated and reported. For GAIN variants, due to difficulties in determining the exact copy number and defining the alleles in all individuals, the proportion of individuals with any GAIN call is calculated and reported, not taking copy number into account.