Format of gene names
Candida gene names should follow the format established
for S. cerevisiae gene names. This format is described in
detail in a guide to S. cerevisiae nomenclature,
published in Trends in
Genetics (TIG) (download
pdf file). The gene name should consist of three letters (the gene
symbol) followed by an integer (e.g. ADE12). Dominant alleles of the
gene (most often wild-type) are denoted by all uppercase letters, while
recessive alleles are denoted by all lowercase letters.
The 3-letter gene symbol should stand for a description of
a phenotype, gene product, or gene function. In addition, it is strongly
preferable that a given gene symbol have only one associated description (i.e., all genes which use a given 3-letter symbol should have a
related phenotype, gene product, or gene function, and that 3-letter
symbols have the same meaning for S. cerevisiae and
Candida genes). Where Candida and S. cerevisiae genes appear to
be orthologous, it is preferable that they share the same gene name. Where Candida and S. cerevisiae genes are similar, but the function of these genes is not the same in both species, it is preferable that the genes do NOT share a name; rather, the gene name assigned should have some significance with respect to the function of the gene.
There are some gene names with non-standard gene format that are
currently in use in CGD. Many of these gene names are
historical, and are well-recognized within the research community
(e.g., C. albicans WH11; OP4; MTLA1; ADE5,7 or C. glabrata
UPC2A; MT-II; MT-IIB). Some other genes acquired a non-standard name
when the name was used in a publication describing a large-scale
experiment (e.g., C. albicansFESUR1, CAM1-1).
Going forward, it is preferable that newly named genes use standard
format whenever possible. New, nonstandard gene names will be added
to CGD as aliases, but not as Standard Names. (Exceptions may be made in
cases where the S. cerevisiae ortholog has a nonstandard-format
Standard Name in SGD for
historical reasons.)
Species prefixes (e.g., "Ca" or "Cg") used in front of a gene name is not part of the
true gene name. The use of prefixes adds clarity to papers discussing
genes from different species that share a name (e.g., CaURA3
vs. ScURA3), but the gene names themselves do not include the
prefix.
Choosing a gene name
Before deciding on a gene name,
search SGD
Gene Names for any gene name beginning with the 3-letter
symbol, by entering the 3-letter symbol followed by an asterisk,
e.g. "ADE*", in the query box.
Changing a standard gene name
The first published name for a gene is typically used as its standard name; however, gene names may be changed if there is consensus among the groups who study the gene. CGD is happy to facilitate this process. To initiate a gene name change please contact the CGD curators.
At CGD, we curate gene names that have appeared in the
published literature; we do not assign names for protein-coding genes, ourselves. CGD
collects all published names for each gene; any names in addition to
the standard gene name are present in the database as searchable gene
aliases. Gene names or locus tags that appear only in GenBank may be
used as aliases in CGD; they are not used as standard gene names
unless they appear in the published literature.
For C. albicans, CGD also includes the gene identifiers assigned during Assembly 4 and Assembly 6, as well as the IPF and CA identifiers from CandidaDB (d'Enfert et al., 2005). Unpublished gene names that were assigned by CandidaDB based on homology, are included as aliases in CGD. The Suggested Names assigned by the Annotation Working Group are only adopted by CGD upon publication of these names in the scientific literature.
CGD has implemented a gene name reservation system. Reservation of a gene name prior to publication allows other groups to begin using the name as soon as possible, and reduces the likelihood that a gene will acquire multiple distinct names that are used in the published literature. Please use the CGD Gene Registry to reserve new gene names.
Format of C. albicans names
Systematic names introduced with Assembly 22 follow a new positionally based systematic nomenclature for chromosomal features.
The new systematic name is based on the known chromosomal location and haplotype, and it consists of the chromosome (C1-C7 or CR),
a unique number indicating the order of features along chromosomes, the strand (W for Watson or C for Crick) and the haplotype (A or B).
For example, C4_03570W_A denotes a feature located on chromosome 4, Watson strand and haplotype A.
Feature numbers start at the left end of the chromosome and increase by 10 to allow for adding new features in the intervening spaces as they are discovered.
Systematic names used in previous assemblies were the "orf19.#" names (where "#" is an integer) assigned to open
reading frames identified in Assembly 19 of the genome sequence. The
Annotation
Working Group has assigned orf19 identifiers to some open reading
frames that were not part of the original assembly (described in Braun
et al., 2005). New orfs have been assigned names of the format
"orf19.#.n", where "orf19.#" corresponds to the identifier of the
upstream orf19, and "n" is an integer. For example, orf19.5006.1 is
located on Contig19-10216 between orf19.5006 and orf19.5007.
Please note that names of the format "orf19.#" are expressed in a
slightly different format in the locus_tag field of the GenBank
records that are associated with release of Assembly 19. The locus_tag "CaO19.#" is equivalent to the systematic name "orf19.#" (i.e., orf19.5197 and CaO19.5197 refer to the same ORF). In order to facilitate searching for alternative aliases, regardless of their format, the "CaO19.#" identifiers are included in CGD, in addition to the "orf19.#" names.
C. albicans Assembly 20 and Assembly 21 continued to use the orf19 names and they persist in the literature. To facilitate seamless transition between the two nomenclature systems, the former orf19 systematic identifiers are fully searchable and prominently displayed on the Locus Summary pages. The mapping between all orf19 and Assembly 22 identifiers is also available for download here.
Format of systematic tRNA names
The format of C. albicans systematic tRNA names is identical to that of standard tRNA names, described below.
IPF identifiers
C. albicans gene identifiers of the form "IPF#.n" have been assigned at CandidaDB, where IPF stands for "Individual Protein File," "#" is an integer, and "n" is a version number or an informational tag (described in d'Enfert et al., 2005). CGD currently includes the IPF names that were archived in the Annotation Working Group's annotation file as of February 22nd, 2005, some IPF names that CGD curators gathered from the published literature, as well as IPF names retrieved directly from CandidaDB. Where IPF identifiers have been assigned both to an orf and also to its allele, CGD includes both IPF identifiers as searchable aliases on the Locus page.
A cautionary note about suffixes appended to gene names
Please note that the numerical suffix has a different meaning in the context of orf19 and IPF names; the orf19 suffix denotes that the orf is distinct, whereas the IPF suffix serves either as a version numbering system or a tag that conveys information about sequence homology. For example, orf19.5006.1 is not the same as orf19.5006. In contrast, the IPF identifiers IPF22272 and IPF22272.1 refer to the same gene, and the ".1" suffix indicates that there has been no change made to this record since Assembly 5. A suffix of ".2" or ".3" appended to an IPF identifier indicates that there have been one or two changes, respectively, between Assembly 5 and Assembly 19. In the context of some gene names used at CandidaDB, suffixes serve as informational tags. Suffixes were assigned to indicate that the gene appears to be a 5' or 3' gene fragment, either with or without an adjacent 3' or 5' corresponding fragment the published Assembly 19, and to note whether the fragment is located at the end of a contig. For example, IPF13383.5eoc has similarity to the 5' end of a related gene, and this ORF is also located at the end of a contig. Please see Braun et al., 2005, and d'Enfert et al., 2005, for additional explanation.
Aliases from Assemblies 4 and 6
CGD contains gene name aliases from earlier assemblies of the C. albicans genome sequence, Assemblies 4 and 6. The aliases from assembly 6 have the form "orf6.#" names (where "#" is an integer). The aliases from Assembly 4 have the form "Contig4-$$$$.####" (where $$$$ is a numerical identifier for the contig, and #### is a numerical identifier for the ORF within the contig). These aliases appear on the CGD Locus pages. In addition, the complete mappings may be downloaded in tab-delimited text file format from the CGD Download site. The mapping between Assembly 4 identifiers and orf19 names is based on a mapping provided by Judy Berman, with some additional manual curation. The mapping between Assembly 6 identifiers and orf19 names was generated at CGD by BLAST-based comparison of orf19s to orf6s, as described in detail in the README file in the Download directory.
Format of C. glabrata gene names
The format of C. glabrata standard gene names is similar to
that of the C. albicans standard gene names.
The format of the C. glabrata systematic gene names comes
directly from the nomenclature used by the sequencing
project, as described by Dujon et al. (2004):
"All annotated genetic elements were designated using a new nomenclature system (http://cbi.labri.fr/Genolevures). Briefly, elements are numbered serially along each sequence contig or scaffold from the left to right of each chromosome using 11 incremental steps (to limit errors and offer the possibility for subsequent insertion of newly recognized elements). The element nomenclature indicates the species (four letters), the project or strain number (one numeral), the chromosome (one letter) followed by the serial number (for example, CAGL0G08492g). The suffix identifies the type of element ('g' stands for any element whose RNA product may be translated by the genetic code; 'r' for elements whose RNA product is not translated; 's' for a cis-acting element; and 'v' for intergenes (intervening))."
Format of systematic tRNA names
C. glabrata systematic tRNA names follow the sequencing project systematic naming conventions, described above.
Format of Standard tRNA Names
tRNAs annotated in the sequencing projects were confirmed by CGD using the program tRNAscan-SE.
The eukaryotic model option was used for nuclear tRNAs, and the organeller model was used for mitochondrial tRNAs.
Instances of disagreement between the original annotations and tRNAscan-SE were resolved by alignment with
experimentally verified tRNAs from S. cerevisiae.
CGD uses the following format for standard tRNA names: 't' + encoded amino acid [one-letter code] + (anticodon) + count.
For example, tQ(CUG)2 for the second instance of tRNA-glutamine with anticodon 'CUG'.
Mitochondrial tRNAs use the same format, but are appended by 'mt': for example, tH(GUG)4mt.
The count for mitochondrial tRNAs is continued from those of nuclear-encoded tRNAs of the same coding type.
To facilitate searching, an alias is provided with each 'U' replaced by 'T' in the anticodon.
Please note that the count suffix is arbitrary, and independent among the different species in CGD.
It is used simply to create a unique identifier for each tRNA gene of a given species, and no special
relationship between two tRNAs with the same standard name from different species is implied.
For example, the tH(GUG)1 tRNA gene in C. albicans and the tH(GUG)1 tRNA gene in C. glabrata
are not necessarily expected to be more closely related to each other than to any other tH(GUG) in either genome.
For species-specific, systematic tRNA names, please see the gene-naming section for the particular species, above.
Detailed format of gene, allele, and protein
names (C. albicans examples)
Many thanks to Aaron Mitchell for providing this table of C. albicans gene nomenclature examples.
|