![]() |
A reference sequence - discussions and FAQs |
|
Since references to WWW-sites are not yet acknowledged as citations
, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.Discussions on a proper reference sequence have been very lively. In general it can be concluded that all suggestions made have their pro's and con's, but no perfect suggestion has been made so far.
Theoretically, a genomic reference sequence is the best choice. By simply numbering nucleotides from 1 to the end of the file no problems occur with complex gene structures like multiple transcription start sites (promoters / 5'-first exons), multiple translation initiation sites (ATG-codons), alternative splicing and the use of different poly-A addition sites (3'-terminal exons).
Although theoretically the best solution, there are some problems when a genomic reference sequence is used;
- for a human, a genomic reference sequence does not contain any useful information, a coding DNA reference sequence does.
- when the gene sequence is incomplete (especially when large introns are present) - a genomic sequence can not be used.
- a gene can be very large (over 2.0 Mb) - this makes nucleotide numbering based on a genomic reference sequence rather laborious (e.g. g.1567234_1567235insTG). Furhtermore, genomic reference sequences based on GenBank NT_ files become increasingly long (e.g. the CFTR gene in NT_007933.13, >65 Mb) and loose their informativity. When one tries to retrieve such files, even with fast internet connections, one needs a lot of time and viewing is rather difficult when not impossible.
- genes may contain very large introns with many intronic (length) variants present in the population - it is thus very difficult to give THE genomic reference sequence (see Genomic sequence changes regularly).
- when a genomic reference sequence is taken from a complete genome sequence, e.g. a bacterium or the human X-chromosome, the transcriptional orientation of the gene of interest may be on the reverse strand. This makes description of sequence variants rather complicated, especially when the consequences on RNA and/or protein level have been studied and are to be described; nucleotides on DNA and RNA level are complementary and numbering goes in different directions - a confusing situation that should be prevented.
Consequently, in practice a coding DNA reference sequence is mostly preferred. The most important reason being that from a description based on a genomic reference sequence the position in the gene in relation to the transcript (RNA) and the translation product (protein) can not be deduced. Using a coding DNA reference sequence a user immediately gets some information regarding the location of the variant; exonic or intronic, 5' of the ATG or 3' of the stop codon and the number of the amino acid residue that is affected).
- the exact transcriptional start site (cap-site) of a gene has often not been determined and/or its assignment is debated - the first nucleotide can thus not be assigned with certainty. For many genes the same is in fact true for the translation initiation site (ATG-codon).
- a gene may have several transcripts, using different promoters / 5'-first exons, alternatively spliced internal exons, different 3'-terminal exons and polyA-addition sites - a complete reference sequence can thus not be generated (see Alternatively spliced exons - nucleotide numbering),
- the different transcripts may encode different proteins (isoforms) with, when different promoters are used, different N-terminal sequences and even using different reading frames in one or more exons. A proper protein reference sequence can thus not be assigned.
Another problem, both for genomic and coding DNA reference sequences, emerges when different genes (partly) overlap, using the same or the opposite DNA strand. In such cases, which reference sequence should one use to describe the variant, in other words to which gene should the change be assigned ? (see Recommendations).
General recommendations regarding reference sequences to be used for the description of sequence variants.
Although not popular, a genomic reference sequence works best. When a genomic reference sequence is used the following recommendations should be followed;
For complex genes, when on the genomic reference sequence all transcripts are annotated properly, computational tools (like Mutalyzer and the Genomic Mutation Consequence Calculator) can easily predict the consequences of a sequence change on all transcripts and their encoded protein isoforms, even when they derive from overlapping genes.
When a coding DNA reference sequence is used the following recommendations should be followed;
As discussed, genes can be rather complex and the choice of a good coding DNA reference sequence can be very difficult. Below we will refer to some examples of how experts have tried to resolve the issue
Do you have other examples - please let us know (E-mail to: J.T.den_Dunnen @ LUMC.nl) !.
Initial recommendations (see e.g. Antonarakis [1998] Hum.Mut. 11: 1-3) suggested two alternative descriptions for variants in intron sequences based on a coding DNA Reference Sequence; the formats c.88+2T>G / c.89-1G>T and c.IVS2+2T>G / c.IVS2-1G>T. Recent opinions have shifted to the use of the first format, i.e. c.88+2T>G and c.89-1G>T.
Reason: from the description c.IVS2+2T>G it is difficult to deduce where
the position of the intron relative to the coding DNA sequence is. In addition, when one wants
to deduce this position, this is often problematic. First, many authors fail to mention the
genomic + coding DNA reference sequences that were used as the basis of exon / intron
numbering. Second, since on first publication gene sequences are often based on incomplete
sequences, initial exon / intron structure often turns out incomplete and numbering
changes later (see Numbering exons / introns) . Consequently, descriptions using the format c.IVS2+2T>G fail the
basic criterion to be unequivocal and should thus not be used. Descriptions using
the format c.88+2T>G do not suffer from these problems.
NOTE: when intronic variants are described in relation to a coding DNA
Reference Sequence authors should not forget to mention the genomic
Reference Sequence where the intron sequence can be found.
Because one of the basic recommendations is to use the shortest description as much as possible, in the middle of an intron nucleotide numbering changes from + to - (like "c.88+.." to "c.89-..". Furthermore, when a change in an intron is described as c.88+4356A>G, in stead of c.89-2A>G, it will not be clear that the change is close to the splice site, and thus might be pathogenic. This is immediately clear when the description is like c.89-2A>G.
Question
When description in relation to a Reference Sequence is problematic could one
specify the change in between 20 bp of sequence on both sides ?.
Answer
In many cases this would be OK but for recently duplicated genes or genes which contain
repeated segments even giving 20 nucleotides to either side will not be sufficient. Furthermore,
descriptions will become very long. Ultimately, the best method is probably to
include the raw data, i.e. the sequence file itself.
Question
When I retrieve a cDNA sequence from GenBank nucleotide numbering does not
start with +1 at the A of the ATG translation initiation codon.
Answer
True, but such a file can be simply obtained. When you retrieve the sequence from the
RefSeq-database (i.e. start at EntrezGene,
enter the gene symbol or gene name, select the gene of interest, click the mRNA
entry) it will be annotated extensively (see
Example). Clicking the "CDS" annotation (CoDing
Sequence) opens a window where the nucleotide numbering will start with
1 at the A
of the ATG translation initiation codon (see
Example). To assist those studying or reporting sequence variants a locus specific database
(LSDB, see HGVS
- list of LSDBs) usually provides the coding DNA reference sequence with the nucleotide numbering (see Example).
Question
The recommendation on numbering genomic and coding DNA variants based on the first nucleotide
of the initiation codon ATG is workable only if the reference sequence in the
database is
published as a single file. In the case of the gene CDKN2A, its genomic sequence is
stored as multiple files, each containing one exonic sequence and partial intronic
sequences on both ends of the exon. I can use the above recommendation easily to number
variants in exon 1 where the initiation codon is located. The problem is how should I
number variants in exon 2 which is located in another database file ?.
Answer
If no database file is available that contains the complete genomic sequence, a coding DNA
Reference Sequence, preferably from the RefSeq
database, should be used. Since for many organisms a genome sequence is
freely available, a database curator can easily make a fully annotated file
(genomic and coding DNA) covering the sequence of interest and submit it to the
RefSeq database. This file can than be
used as the reference sequence.
Question (Isabelle Touitou,
Montpellier, FRANCE)
If the first translation ATG is in exon 2, and we find a variant 5' to exon 1, should we include intron 1
in the counting process?.
NOTE: based on a coding DNA reference sequence intron 1 is located between
nucleotides -15 and -14.
Answer
Nucleotides in introns 5' of the ATG translation initiation codon (i.e. in the
5'UTR) are numbered as all other nucleotides (see
Examples and Figure). In your example, based on a coding DNA reference sequence, an intron is present between nucleotides -15 and -14. The nucleotides for this intron are
numbered as -15+1, -15+2, -15+3, ...., -14-3, -14-2, -14-1. Consequently, regarding the question, when a coding DNA reference sequence is used, these intronic nucleotides are not counted.
Question
The CBS gene was originally thought to contain 16 exons. Later it was recognised that
exon 15 does not exist, and recently two additional non-translated 5' exons were detected. The current
gene structure therefore includes 17 exons, of which exons 3 to 17 are translated.
Should the exons of a gene be counted from the exon that contains the start
codon rather than
the beginning of the cDNA ?. If so, should exons preceding the start codon be
counted 0, -1, -2, etc. or should the 0 be skipped ?. Is there an agreement on how to deal with corrections in exon numbering
?.
Answer
For the description of sequence changes it does not matter how exons are numbered
!; exon (and intron) numbers are not used in the descriptions. In fact this
is one reason why the recommendation is as it is (see Description of intronic variants). Examples (using a coding DNA reference
sequence);
- c.-5G>T: a change 5' of the ATG (in the 5'UTR)
- c.5G>T: a change in the coding (related to a change in amino acid 2)
- c.256+1G>T: a change in the 5' end of an intron
- c.257-1G>T: a change in the 3' end of an intron
- c.*5G>T: a change 3' of the stop codon (in the 3'UTR)
For exon numbering the only logical thing to do is to start with 1 for the first exon,
otherwise eventually problems will emerge. For other numbering schemes only the experts will know the history;
newcomers just blindly assume that the first exon is exon 1.
Consequently, when historic numbering schemes are used, at some point wrong
assumptions will be made and a patient might end up with an erroneous diagnosis.
However, since history will leave its tracks
it is suggested to always mention changed numbering schemes in M&M and in all
Figure and Table legends to prevent any further confusion. For tables even
consider to add an additional column indicating the historic / old exon number.
Question (Alessandra Splendore,
Rio de Janeiro, Brasil)
Recently two previously unidentified exons of the TCOF1 gene were identified, and named 6A and 16A. Exon 6A is present in most of the transcribed
isoforms, exon 16A is included only in minor isoforms. In updating the nomenclature of reported mutations in TCOF1, should I use a sequence that corresponds to the major isoform (with exon 6A, but without 16A) or the sequence that corresponds to the longest ("most complete")
isoform ?.
Answer
This is the eternal problem of changes in the coordinates of a reference sequence.
The best solution is that the TCOF1-community gets together and decides
to use an updated reference sequence representing the most complete
transcript, i.e including exons 6A and 16A. This updated sequence should be annotated properly, submitted to
the RefSeq database and used from then on.
Question (JM Friedman, Vancouver, CANADA)
We are working on a new locus-specific mutation database for NF1 and NF2, and we have
run into a problem with the standard mutation nomenclature based on the genomic
sequence. The problem is that the canonical genomic sequence (and consequent
numbering) we are using as the basis of the mutation nomenclature has changed
repeatedly since many of the mutations were described, and it is continuing to
change.
If we use the names assigned to the mutations on the basis of the version of the sequence
that was used to name the mutations, they do not map to the proper position in the current
version of the sequence. If we change the names to match the new sequence, they will not
match the published names for these mutations and may need to be changed again the next
time time the sequence changes. (Actually, the current version of the NF1 sequence is
annotated on the wrong strand, so all the numbering would be backwards if we used the
annotated strand instead of its complement, which is the really the correct one).
The solution to identifying the mutation unequivocally is to provide enough of the
surrounding sequence to permit a unique result on a BLAST search, and we are doing this.
However, this does not solve the problem of naming the mutations. What is your
recommendation for this ?.
Answer
Indeed the problems you mention make live very hard. In fact, especially with genes
containing large introns, there will be no one genomic reference sequence since every gene
will be slightly different (see above). The problem of continuously changing genomic sequences will not
settle rapidly.
When designating "THE genomic reference sequence" now one can already foresee
future discussions whether this choice was proper; it will be a "random pick"
and might not be the evolutionary correct choice. The way to go in our eyes is to
declare one
sequence THE genomic reference sequence (starting several kilo base
pairs 5'
of the promoter region), annotate it properly, submit it to the RefSeq
database and use it from then on. The RefSeq database has NG_
files specifically made for this purpose (see e.g. NG_000004.2). These problems are one of the reasons why for
the LSDB's I curate
(i.e. Johan den Dunnen), I prefer a coding DNA Reference
Sequence.
In that case the effect of the ever changing intronic sequences has only a marginal
effect.
Question
We are preparing an annotated set of Hox genes from the zebrafish for publication.
If the coding DNA sequence is not completely known, but only an EST lacking 5'
sequence and a genomic sequence covering the EST, how do you describe a
change in this sequence; do you number it in relation to the EST or the genomic
sequence ?. Furthermore, if there is a mismatch between the genomic and the
EST sequence, and you don't know which one is correct, how do you define
e.g. whether the genomic sequence has an insertion or the EST has a deletion ?.
Answer
First, the reference sequence chosen is always assumed to be the correct sequence simply because changes are described in relation to this sequence.
Second, when the EST sequence is incomplete one should describe changes in relation to this sequence like AA010203.2:54_55insG (assuming the reference sequence used is AA010203.2). So do not use a 'c.' or 'g.' prefix, since neither a coding DNA nor a genomic reference sequence is used. However, when a genomic sequence covering this EST is available the recommendation is to use this as a reference sequence.
Question
Making a judgment on what is the "wild type" (wt) nucleotide for some
sequences seems arbitrary at best. How would you suggest that the description be presented for
these ?.
Answer
Changes are always described in relation to a "reference sequence".
This reference sequence is considered to be the "wild type" sequence
and is expected to be the one present in the database (GenBank). Consequently, reference and wild type sequence can be
different. Note however that everybody has influence on the sequences in the RefSeq
database and thus may request that a variant is changed into the more common allele. However, the debate about what is wild
type can be unsolvable when variants are very common (near 50%) or differ between populations.
Question (M Paalman,
Human Mutation)
How should sequence variants in the mitochondrial DNA (mtDNA) be described ?.
Answer
The mtDNA genome is rather small, completely sequenced and numbered. According to current
recommendations variants in the mitochondrial DNA should be described in
relation to a the full mitochondrial DNA sequence,
i.e. the genomic reference
sequence (GenBank NC_001807.4).
Descriptions should be preceded by "m.", like m.8994T>C (see
Recommendations).
The mtDNA encodes a range of different proteins. To prevent confusion, changes
at protein level should be described including a reference to the protein
changed, like p.ATP6:Leu156Pro (GenBank NP_536848.1).
| Top of page | MutNomen
homepage | Check-list |
| Recommendations: DNA, RNA,
protein, uncertain |
| FAQ's | Codons / amino acids |
History |
| Example descriptions: QuickRef / symbols,
DNA, RNA,
protein |
Copyright © HGVS 2007 All Rights Reserved |