![]() |
Frequently asked questions regarding the description of sequence variants |
|
Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.
This page gives an overview of the questions we have received regarding the description of sequence variations based on the existing recommendations (published in by den Dunnen and Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format).
For reactions E- mail to: ddunnen@LUMC.nl and Stylianos.Antonarakis@medecine.unige.ch
Question (Marco Montagna, Padua, Italia)
Recently, I have been involved in the molecular characterization of BRCA1 gene
rearrangements that are becoming more and more frequent in breast/ovarian cancer families.
Most often these rearrangements are mediated by Alu sequences with a very high homology
that reaches 100% in the breakpoint region. I looked at the reference papers on mutation
nomenclature, but I still have some doubts on how to define such kind of mutations. In
particular, if a genomic deletion is mediated by Alu sequences that are identical over
a large nucleotide stretch containing the breakpoint, what nucleotide should be indicated?.
Could I indicate the most 3' one (considering the "sense" strand), similarly to
the rule for deletions in repeated sequences?. Moreover, if more than one genomic sequence
is present in GenBank, which one should be considered ?. For instance, for a rearrangement
that deletes a genomic region of 20kb containing exon 1 and the upstream sequence, with a
breakpoint occurring over a stretch of nucleotides that are identical in the two
recombining sequences, and a genomic reference sequence of the antisense strand, I would
suggest the following definition: nt.X (the most 5' in the identity region of the
"antisense" reference sequence, i.e. the most 3' in the "sense"
strand) -- nt.Y del 20kb (exon > 1). Would be that fine ?.
Answer
You touch on two subjects; location of the breakpoint and reference
sequence.
Breakpoint; indeed, like you suggest, when
breakpoints occur in stretches of identical sequences the most 3' position (considering
the sense strand) is used to describe the position of the breakpoint (see
Recommendations).
Reference sequence; any reference sequence
would be OK at least when you specify the one you use (database accession.version
number, see Reference sequence discussion).
When present, it would be best to use the genomic Reference Sequence from the RefSeq
database.
When such a sequence is not present you should make, annotate and submit one (see
Discussion).
Depending on whether a genomic or a coding DNA reference sequence is
used the final description should have the format; g.1234_7234del (alternative
g.1234_7246del6012)
or c.123+45_955-234del (alternative c.123+45_955-234del6012).
Question (Erik-Jan
Kamsteeg, Nijmegen, Nederland)
The recommendations to describe unknown breakpoints are not exactly
clear to me. For example, PCR analysis of a gene on the X-chromosome shows products for exons
1-3 and no product are detected for exons 4-14 (exon 14 is the last exon of the
gene). Since PCR does not work with one primer, we are not sure whether exon 4 and 14 are completely absent, or
only partially. Therefore, using the first base of exon 4 and the '-?' (see
Recommendations) could be wrong, as could be the last base of exon 14 with a '+?'. Therefore, I would like to use the last base of exon
3 with '+?' and the last base of exon 13 with a '+?'. What are your recommendations?
Answer
Literally speaking you are right and it is best to set the borders as precise as possible. So when exon 3 is present in fact the location of the reverse primer can be used to set the most 5' border (and the same for the exon 14 primer). Consequently the description could be
something like (87+123_88-?)_(923+?_924-98)del. Although precise one might wonder whether such a description is attractive; c88-?_923+?del is as
clear (see Uncertainties).
Please note that - for simplicity - more descriptions are not fully correct. For example, stop codons are reported as p.Cys123X while one could argue that p.Cys125_Met2376del is more precise (Met2376 being the last amino acid of the protein).
Question
How should I describe the change TGT GC CA to TGT TG CA.
Can I call it a dinucleotide mutation or is it a deletion / insertion mutation ?.
Answer
Simply describe it as c.4_5delinsTG (alternatively it can be described c.[4G>T;
5C>G]). Although c.4_5GC>TG is clear and unequivocal, the description
as a deletion/insertion follows the general recommendations more precisely (see
Recommendations). Unless
c.4G>T or c.5C>G have been reported as allelic variants, we should assume the change
occurred as the consequence of one mutational event and it can be described as a
dinucleotide variant.
Question (Ron Agatep, Toronto, Canada)
Several groups have identified a duplication in the CDKN2A locus that has been labeled
in various ways. The mutation is a duplication of the first 24 bp
The ATG translation initiation codon is underlined (translational start). One group has described the mutation as 23ins24 is this correct? My interpretation of your recent paper suggests I should name it 1_24dup. Could you provide me with the correct nomenclature ?.
Answer
Correct is c.9_32dup (p.Ala4_Pro11dup) - the description c.1_24dup (p.Met1_Ser8dup)
seems correct but please note
that for all descriptions the most 3' position
possible should be arbitrarily assigned to have been changed (see
Recommendations). c.23ins24 is not correct, first
because the position of the insertion is not clear (see
Discussion),
second 'ins24' does not indicate which sequence was inserted.
Question
How should I describe a change where ATCG-ATCGATCGATCG-A-GGGTCCC becomes
ATCG-ATCGATCGATCG-A-ATCGATCGATCG-GGGTCCC ?. The fact that the inserted
sequence (ATCGATCGATCG) is present in the original sequence suggests it
derives from a duplicative event.
Answer
A correct description of the insertion is c.17_18ins5_16 (see
Recommendations). A description using 'dup' might cause confusion since
the rule is that the duplicated region is indicated before the
word "dup" and not after it (like in c.17_18dup5_16).
Still, the description
given makes it clear that the sequence inserted between nucleotides c.17 and c.18 is probably derived from nearby, i.e. position c.5_16, and thus likely
derived from a duplicative event.
Question
The 3' end of intron 8 of the CFTR gene contains a variable sequence; IVS8(TG)mTn. The
CFTR genomic reference sequence of the end of intron 8 is ...TGTGTGTGTGTTTTTTTAACAG[exon9], with a tract of
(TG)11 and T7. When we describe this sequence variation as c.1210-14(TG)9-13(T)5-9 and that of the
IVS8Tn as c.1210-6(T)5-9, are we right? Is the description of a T5 tract variant
as c.1210-14(TG)12T5 correct ?.
Answer
A difficult case; please note that following current recommendations it is
not a TG11 but a GT11 variant, overlapping one T-nucleotide with the T7
stretch. However, to prevent confusion it is probably best to use in this
exceptional case TG11.
The correct description depends on the reference sequence used. Assuming this
reference sequence is as described, i.e. TG11 followed by T7, the TG11 stretch
is located at c.1210-34_1210-13 and T7 stretch at c.1210-12_1210-6. A correct description of the variants is
then c.1210-34GT(9_13)T(4_8); c.1210-34 because the variable tract starts at that
position. When only the T stretch is described the correct notation is
c.1210-12T(5_9). A correct description of the T5 variant is c.1210-34GT(9_13)T5.
NOTE: to indicate the range, "_" must be used and not "-".
Question
Is the description NM_012345.3:c.123+45_123+51TSDinsL1.603bp
acceptable (TSD = target site duplication, L1 indicates the nature of the insert (L1,
Alu or SVA) after "ins"; 603bp = the number of inserted base pairs) ?.
Answer
Following the current recommendations the description should be NM_012345.3:c.123+45_123+51dupinsAB012345.3:g.393_1295
(alternatively NM_012345.3:c.123+45_123+51dupins603). So
use "dup" (not "TSD") and leave out "bp" (not
necessary). The insertion itself is described as AB012345.3:g.393_1295,
indicating that the inserted sequences are nucleotides 393 to 1295 from GenBank
file AB012345.3. Adding "(L1)" in the description to indicate the nature of the
inserted sequence is not recommended, it might cause confusion. The
"Remarks" column of the summary sequence variant Table can be used for
this annotation.
Question
How should we, using the most current recommendations, indicate a change in
one allele. The notation we envisage should indicate that the other allele has no
change compared to the reference sequence. For the unchanged allele "[?]"
would not be appropriate since it is not the case that allele 2 has an unknown
variant;
it simply has change. The notation "c.[76A>C]" without describing the
second allele would be misleading; not enough researchers would be familiar enough with
the nomenclature to know that this refers to only one of the two alleles present. Would
the description "c.[76A>C]+[]" be OK ?.
Answer
The character used to indicate 'no change' is the '=' (see
Recommendations).
The recommended description is thus "c.[76A>C]+[=]".
Question (Andrew Grimm,
Coordinator RettBASE)
When I come across cases where a person has two variants and it isn't known whether or not they are on the same
chromosome how should I describe this ?.
Answer
Although we do not recommend to describe uncertainties, in this
case it is clear that to prevent mistakes a clear recommendation is required. Two changes in one allele should
be described as c.[76A>C; 91C>G] and two changes on different alleles as
c.[76A>C]+[91C>G]. When it is not clear whether the
changes are on the same or on different alleles the recommendation
is to describe this using the format c.[76A>C (+) 91C>G] (see
Recommendations).
Question (Nancy
Carson, Ottawa, Canada)
The recommendations for mutation nomenclature give guidelines on the proper
nomenclature for recessive diseases where there are two mutations identified in
one gene. I have a patient with hearing loss who has a mutation in GJB2
(c.35delG) and a mutation in GJB6 (c.689_690insT). Any suggestions on how I
should write this?
Answer
The recommendation is to use the format GJB2:c.[35delG]+
GJB6:c.[689_690insT]
(see Discussion). This format prevents
confusion regarding the reference sequence used (i.e. "GJB2:")
and combines
this with the normal format to describe variants in recessive diseases (format
c.[76C>T]+[87G>A]) Using the format given it is of course still
essential to describe the reference sequence used (GenBank file with version
number). Another format, coping with this directly, is to describe the variants
as NM_004004.2:c.[35delG]+NM_006783.1:c.[689_690insT], i.e. using the Genbank
reference sequences in stead of the HGNC Gene
Symbol.
Question
Detailed analysis of a DMD patient showed that it was a mosaic case;
consequently two different nucleotides were found at one position, a G and a C
(a G is the normal sequence). How should I describe this?
Answer
Mosaic cases, i.e. two different nucleotides found at one position
on one chromosome should be described as c.[=, 83G>C] (see
General recommendations). This description is similar to that for two
transcripts deriving from one chromosome (see
RNA recommendations).
Question (Harriet Meyer,
JAMA-archives.org)
The subject of promoter polymorphisms has come up, and I would be grateful for your recommendation of how
these should be described.
Answer
For variants in the promoter region it is recommended to describe these in relation to a
genomic reference sequence (like L01538.1:g.1407C>T). Describing a promoter variant in relation to a
coding DNA reference
sequence is possible and should be in relation to the A of the ATG initiation codon,
counting backwards to the variant nucleotide (in the example given c.-401C>T indicating a change
of the C 401 nucleotides upstream of the ATG, in the promoter, to an T). To be unequivocal, next to the
coding DNA reference sequence (to identify the A of the
ATG) one should also mention the genomic reference sequence used (to identify the
C at -401) or include upstream sequences in the coding DNA reference sequence
(see Discussion). This would make it rather complex
- one has to retrieve two sequence files. Consequently, it would be much easier to describe the variant directly in relation to the genomic reference sequence.
A format which one could use is "L01538.1:g.1407C>T (at -401 of
the ATG)".
Please note that it is not correct to provide descriptions in relation to the start site of the mRNA.
There is often a debate as to where the RNA exactly starts and one should not describe DNA variants in relation to such a 'variable'
site. Of course it is acceptable that the authors mention, between brackets, the approximate position of the change in relation to the promoter.
Question
How should a mutation in the 5'UTR be described that gives rise to a new translation
initiation site ?
Answer
Description at the DNA-level should be e.g. c.-23A>T (changing -25 caGggt
-19 to caTggt, creating a new ATG-triplet). Description at
the RNA-level should be like r.-23a>u and description at the protein level could be
like p.Met1extMet-8 (or p.M1extM-8, see
Recommendation protein level). This indicates that due to a variant the protein
sequence becomes extended N-terminally by the addition of 8 new amino acids. Note that descriptions on RNA and protein level
should only be given when this was experimentally verified; if not, changes should be
placed between brackets to indicate that it is a prediction only.
Question (Dean J. Danner, Atlanta,
USA)
We are characterizing mutations in nuclear encoded proteins that function in mitochondria. The problem is in proteins that have amino terminal
mitochondrial signal peptides. The current rules for proteins say to start numbering with the initiating methionine. However, the functional protein
has this target peptide removed and therefore many investigators begin numbering at the amino acid residue of the
mature protein. Mutations that result in changes in the targeting peptide suggest that numbering should
begin with the Met-1. An alternative would be to give the targeting peptide negative numbers as in the nucleic acids upstream of the transcriptional
start site. It would be helpful to have some rules for consistency in the field.
Answer
As already suggested in your question, protein reference sequences should always
represent the complete primary translation product, not a
processed mature or functional protein (see
Recommendations).
Question (Sven Arnold,
Austria)
There are several examples you give where changes affecting a series of amino acids are described using the most 3' amino acid. Does this
also apply when it is known exactly which amino acid is affected? Example; the sequence
ATGTCAAGCTCT codes for MetSerSerSer. An insertion of AGT (c.9_10insAGT) gives
ATGTCAAGCAGTTCT, coding for MetSerSerSerSer. Looking at the protein sequence you would
describe the change as p.Ser3dup. Knowing the nucleotide changes, it would be accurately described as p.Ser2_Ser3insSer. My question is, do
we describe the protein change as it appears, or do we try and describe it according to the (known) underlying DNA change?
Answer
Descriptions at protein level should describe the changes
observed on protein level - one should not try to incorporate knowledge regarding the
change at DNA-level (see Recommendations). As
a consequence, the amino acid change described may be caused by a change which at
DNA level lies several
nucleotides upstream, like in the example you give. Another example is that
where a frame shift deletion at DNA level does
not immediately affect the protein sequence.
| Top of page | MutNomen
homepage | Check-list |
| Recommendations: DNA, RNA,
protein, uncertain |
| Discussions | FAQ's | Codons
/ amino acids | History |
| Example descriptions: QuickRef / symbols,
DNA, RNA,
protein |
Copyright © HGVS 2007 All Rights Reserved |