Gene Nomenclature Guide


 

Format of gene names

Candida gene names should follow the format established for S. cerevisiae gene names. This format is described in detail in a guide to S. cerevisiae nomenclature, published in Trends in Genetics (TIG) (download pdf file). The gene name should consist of three letters (the gene symbol) followed by an integer (e.g. ADE12). Dominant alleles of the gene (most often wild-type) are denoted by all uppercase letters, while recessive alleles are denoted by all lowercase letters.

The 3-letter gene symbol should stand for a description of a phenotype, gene product, or gene function. In addition, it is strongly preferable that a given gene symbol have only one associated description (i.e., all genes which use a given 3-letter symbol should have a related phenotype, gene product, or gene function, and that 3-letter symbols have the same meaning for S. cerevisiae and Candida genes). Where Candida and S. cerevisiae genes appear to be orthologous, it is preferable that they share the same gene name. Where Candida and S. cerevisiae genes are similar, but the function of these genes is not the same in both species, it is preferable that the genes do NOT share a name; rather, the gene name assigned should have some significance with respect to the function of the gene.

There are some gene names with non-standard gene format that are currently in use in CGD. Many of these gene names are historical, and are well-recognized within the research community (e.g., C. albicans WH11; OP4; MTLA1; ADE5,7 or C. glabrata UPC2A; MT-II; MT-IIB). Some other genes acquired a non-standard name when the name was used in a publication describing a large-scale experiment (e.g., C. albicansFESUR1, CAM1-1).

Going forward, it is preferable that newly named genes use standard format whenever possible. New, nonstandard gene names will be added to CGD as aliases, but not as Standard Names. (Exceptions may be made in cases where the S. cerevisiae ortholog has a nonstandard-format Standard Name in SGD for historical reasons.)

Species prefixes (e.g., "Ca" or "Cg") used in front of a gene name is not part of the true gene name. The use of prefixes adds clarity to papers discussing genes from different species that share a name (e.g., CaURA3 vs. ScURA3), but the gene names themselves do not include the prefix.

Choosing a gene name

Before deciding on a gene name, search SGD Gene Names for any gene name beginning with the 3-letter symbol, by entering the 3-letter symbol followed by an asterisk, e.g. "ADE*", in the query box.

Changing a standard gene name

The first published name for a gene is typically used as its standard name; however, gene names may be changed if there is consensus among the groups who study the gene. CGD is happy to facilitate this process. To initiate a gene name change please contact the CGD curators.

At CGD, we curate gene names that have appeared in the published literature; we do not assign names for protein-coding genes, ourselves. CGD collects all published names for each gene; any names in addition to the standard gene name are present in the database as searchable gene aliases. Gene names or locus tags that appear only in GenBank may be used as aliases in CGD; they are not used as standard gene names unless they appear in the published literature.

For C. albicans, CGD also includes the gene identifiers assigned during Assembly 4 and Assembly 6, as well as the IPF and CA identifiers from CandidaDB (d'Enfert et al., 2005). Unpublished gene names that were assigned by CandidaDB based on homology, are included as aliases in CGD. The Suggested Names assigned by the Annotation Working Group are only adopted by CGD upon publication of these names in the scientific literature.

CGD has implemented a gene name reservation system. Reservation of a gene name prior to publication allows other groups to begin using the name as soon as possible, and reduces the likelihood that a gene will acquire multiple distinct names that are used in the published literature. Please use the CGD Gene Registry to reserve new gene names.

Format of C. albicans names

Systematic names introduced with Assembly 22 follow a new positionally based systematic nomenclature for chromosomal features. The new systematic name is based on the known chromosomal location and haplotype, and it consists of the chromosome (C1-C7 or CR), a unique number indicating the order of features along chromosomes, the strand (W for Watson or C for Crick) and the haplotype (A or B). For example, C4_03570W_A denotes a feature located on chromosome 4, Watson strand and haplotype A. Feature numbers start at the left end of the chromosome and increase by 10 to allow for adding new features in the intervening spaces as they are discovered.

Systematic names used in previous assemblies were the "orf19.#" names (where "#" is an integer) assigned to open reading frames identified in Assembly 19 of the genome sequence. The Annotation Working Group has assigned orf19 identifiers to some open reading frames that were not part of the original assembly (described in Braun et al., 2005). New orfs have been assigned names of the format "orf19.#.n", where "orf19.#" corresponds to the identifier of the upstream orf19, and "n" is an integer. For example, orf19.5006.1 is located on Contig19-10216 between orf19.5006 and orf19.5007.

Please note that names of the format "orf19.#" are expressed in a slightly different format in the locus_tag field of the GenBank records that are associated with release of Assembly 19. The locus_tag "CaO19.#" is equivalent to the systematic name "orf19.#" (i.e., orf19.5197 and CaO19.5197 refer to the same ORF). In order to facilitate searching for alternative aliases, regardless of their format, the "CaO19.#" identifiers are included in CGD, in addition to the "orf19.#" names.

C. albicans Assembly 20 and Assembly 21 continued to use the orf19 names and they persist in the literature. To facilitate seamless transition between the two nomenclature systems, the former orf19 systematic identifiers are fully searchable and prominently displayed on the Locus Summary pages. The mapping between all orf19 and Assembly 22 identifiers is also available for download here.

Format of systematic tRNA names

The format of C. albicans systematic tRNA names is identical to that of standard tRNA names, described below.

IPF identifiers

C. albicans gene identifiers of the form "IPF#.n" have been assigned at CandidaDB, where IPF stands for "Individual Protein File," "#" is an integer, and "n" is a version number or an informational tag (described in d'Enfert et al., 2005). CGD currently includes the IPF names that were archived in the Annotation Working Group's annotation file as of February 22nd, 2005, some IPF names that CGD curators gathered from the published literature, as well as IPF names retrieved directly from CandidaDB. Where IPF identifiers have been assigned both to an orf and also to its allele, CGD includes both IPF identifiers as searchable aliases on the Locus page.

A cautionary note about suffixes appended to gene names

Please note that the numerical suffix has a different meaning in the context of orf19 and IPF names; the orf19 suffix denotes that the orf is distinct, whereas the IPF suffix serves either as a version numbering system or a tag that conveys information about sequence homology. For example, orf19.5006.1 is not the same as orf19.5006. In contrast, the IPF identifiers IPF22272 and IPF22272.1 refer to the same gene, and the ".1" suffix indicates that there has been no change made to this record since Assembly 5. A suffix of ".2" or ".3" appended to an IPF identifier indicates that there have been one or two changes, respectively, between Assembly 5 and Assembly 19. In the context of some gene names used at CandidaDB, suffixes serve as informational tags. Suffixes were assigned to indicate that the gene appears to be a 5' or 3' gene fragment, either with or without an adjacent 3' or 5' corresponding fragment the published Assembly 19, and to note whether the fragment is located at the end of a contig. For example, IPF13383.5eoc has similarity to the 5' end of a related gene, and this ORF is also located at the end of a contig. Please see Braun et al., 2005, and d'Enfert et al., 2005, for additional explanation.

Aliases from Assemblies 4 and 6

CGD contains gene name aliases from earlier assemblies of the C. albicans genome sequence, Assemblies 4 and 6. The aliases from assembly 6 have the form "orf6.#" names (where "#" is an integer). The aliases from Assembly 4 have the form "Contig4-$$$$.####" (where $$$$ is a numerical identifier for the contig, and #### is a numerical identifier for the ORF within the contig). These aliases appear on the CGD Locus pages. In addition, the complete mappings may be downloaded in tab-delimited text file format from the CGD Download site. The mapping between Assembly 4 identifiers and orf19 names is based on a mapping provided by Judy Berman, with some additional manual curation. The mapping between Assembly 6 identifiers and orf19 names was generated at CGD by BLAST-based comparison of orf19s to orf6s, as described in detail in the README file in the Download directory.

Format of C. glabrata gene names

The format of C. glabrata standard gene names is similar to that of the C. albicans standard gene names.

The format of the C. glabrata systematic gene names comes directly from the nomenclature used by the sequencing project, as described by Dujon et al. (2004):

"All annotated genetic elements were designated using a new nomenclature system (http://cbi.labri.fr/Genolevures). Briefly, elements are numbered serially along each sequence contig or scaffold from the left to right of each chromosome using 11 incremental steps (to limit errors and offer the possibility for subsequent insertion of newly recognized elements). The element nomenclature indicates the species (four letters), the project or strain number (one numeral), the chromosome (one letter) followed by the serial number (for example, CAGL0G08492g). The suffix identifies the type of element ('g' stands for any element whose RNA product may be translated by the genetic code; 'r' for elements whose RNA product is not translated; 's' for a cis-acting element; and 'v' for intergenes (intervening))."

Format of systematic tRNA names

C. glabrata systematic tRNA names follow the sequencing project systematic naming conventions, described above.

Format of Standard tRNA Names

tRNAs annotated in the sequencing projects were confirmed by CGD using the program tRNAscan-SE. The eukaryotic model option was used for nuclear tRNAs, and the organeller model was used for mitochondrial tRNAs. Instances of disagreement between the original annotations and tRNAscan-SE were resolved by alignment with experimentally verified tRNAs from S. cerevisiae.

CGD uses the following format for standard tRNA names: 't' + encoded amino acid [one-letter code] + (anticodon) + count. For example, tQ(CUG)2 for the second instance of tRNA-glutamine with anticodon 'CUG'. Mitochondrial tRNAs use the same format, but are appended by 'mt': for example, tH(GUG)4mt. The count for mitochondrial tRNAs is continued from those of nuclear-encoded tRNAs of the same coding type. To facilitate searching, an alias is provided with each 'U' replaced by 'T' in the anticodon.

Please note that the count suffix is arbitrary, and independent among the different species in CGD. It is used simply to create a unique identifier for each tRNA gene of a given species, and no special relationship between two tRNAs with the same standard name from different species is implied. For example, the tH(GUG)1 tRNA gene in C. albicans and the tH(GUG)1 tRNA gene in C. glabrata are not necessarily expected to be more closely related to each other than to any other tH(GUG) in either genome.

For species-specific, systematic tRNA names, please see the gene-naming section for the particular species, above.

Detailed format of gene, allele, and protein names (C. albicans examples)

Many thanks to Aaron Mitchell for providing this table of C. albicans gene nomenclature examples.
Genetic locusICG1
Wild-type alleleICG1
Recessive mutant alleleicg1-1
icg1Δ5
icg1
Δ::hisG
icg1
Δ::hisG-URA3-hisG
Dominant mutant alleleICG1-7
Variant wild-type alleleICG1-8
Tagged wild-type alleleICG1-GFP
ICG1-HA
Wild-type genotypeICG1/ICG1
ICG1-8/ICG1-9
Heterozygous mutant genotypeicg1Δ::hisG-URA3-hisG/ICG1
icg1Δ::hisG/ICG1
Homozygous mutant genotypeicg1Δ::hisG/icg1Δ::hisG-URA3-hisG
icg1Δ::hisG/icg1Δ::hisG
Reintegrant of wild-type allele (on bacterial
plasmid) at mutant locus
icg1Δ::hisG/icg1Δ::hisG::ICG1
Reintegrant of wild-type allele (on bacterial
plasmid) at the ARG4locus
icg1Δ::hisG/icg1Δ::hisG ARG4::ICG/ARG4
Wild-type gene productIcg1
Icg1p
Mutant gene productIcg1-1
Icg1-1p
Tagged gene productIcg1-GFP
Icg1p-GFP
Icg1-GFPp
Wild-type phenotypeIcg+
Mutant phenotypeIcg-
Partially-defective phenotype (as sometimes seen for heterozygote)Icgw (for weak)
ICGw
Icg+/-


Return to CGD Send a Message to the CGD Curators