The paper describing the comparative genomic analysis that was the basis for the refinements to Assembly 21 performed in 2008 has now been published (Butler, G., et al. [2009] Nature).
In a collaboration between CGD and the Broad Institute, MIT, a targeted re-analysis of Candida albicans genome sequence and annotation has been performed using new comparative genome analysis data and newly generated sequence data. A comparative genome analysis was done by Mike Lin, Christina Cuomo, Manolis Kellis, and colleagues at the Broad Institute, who compared the genome sequences of Candida albicans SC5314, Candida albicans WO-1, Candida dubliniensis, Candida tropicalis, Candida parapsilosis, Lodderomyces elongisporus, Debaryomyces hansenii, Candida guilliermondii, and Candida lusitaniae (Butler et al., submitted). Their analysis identified many conserved genomic regions corresponding to potential new ORFs, as well as regions of no significant conservation that are annotated as ORFs, which were candidates for "Dubious" classification. It also revealed several ORFs with incorrectly annotated boundaries, as well as possible sequencing errors that had led to incorrect ORF assignments. The Annotation Working Group had also previously identified suspected sequencing errors, and many ORFs in CGD contained "adjustments" (artificial sequence changes), which were added to compensate for such presumed errors and to restore ORF integrity. CGD staff inspected each of the areas identified by the Broad Institute group and by the Annotation Working Group.
As a result of this analysis, hundreds of sequence errors were corrected, which allowed us to update annotations for 530 ORFs, and 73 new ORFs identified in the comparative genome analysis were added to CGD. All artificial "adjustments" (arbitrary sequence changes made to correct presumed errors; see below) were removed from the sequence. The sequence and annotation changes made on each chromosome are listed on individual Chromosome History pages, which are linked from a Summary Table. The detailed description of the methodology used in analysis and curation, as well as the summary of the results, is available in the Sequence Refinements, November 2008 documentation.
Note that sequence and annotation changes were made to Assembly 21 only, not to previous assemblies.
As another result of this analysis, 181 non-conserved ORFs were identified whose sequence is indistinguishable from random non-coding sequence. These ORFs were classified in CGD as Dubious ORFs, unlikely to be biologically significant. The remaining ORFs in CGD were classified either as Verified, meaning that there is experimental evidence for the existence of a gene product (as defined by the ORF having curated Gene Ontology terms with experimental evidence codes, i.e., evidence codes other than IEA, ISS, RCA, ISA, ISM, ISO, NAS), or as Uncharacterized, meaning that no experimental evidence currently exists but that the ORFs are likely to represent biologically significant genes. These classifications are displayed on the Locus Summary page of each ORF, and may be changed in the future as new experimental evidence becomes available.
Assembly 21 (A21) is described in: van het Hoog M, et al. (2007) Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol 8(4):R52. URL: http://genomebiology.com/2007/8/4/R52 . In addition to making a chromosomal level assembly, by mapping the contigs of Assembly 19 (A19) to chromosomes and filling many of the gaps between them, the authors also made numerous and widespread modifications to the genomic sequence based on alignments of the sequence traces generated by inputting the SGTC's sequence traces into Sequencher software. Many of these modifications introduced insertions, deletions, and substitutions relative to the Assembly 19 sequence. In many cases A, C, T, or G was substituted with an ambiguous nucleotide; within ORFs, such ambiguous nucleotides consequently resulted in ambiguous amino acids in the predicted ORF translation, which is represented as an "X" within the A21 protein sequence.
More information is available in Assembly 21 Sequence Documentation.
The intron data published in the paper
Mitrovich QM, Tuch BB, Guthrie C, Johnson AD. Computational and experimental approaches double the number of known introns in the pathogenic yeast Candida albicans. Genome Res. 2007 Apr;17(4):492-502
have been incorporated into Assembly 21 (and 20) in CGD. Assembly 19
coordinates have not been updated.
In Assembly 21 (and 20), gaps that were introduced by the Annnotation Working Group (AWG) to compensate for presumed sequencing errors that interrupt ORFs are labeled "Adjustments". "Adjustments" refer to gaps between regions of CDS that are NOT expected to be biologically significant introns. Adjustments with a length that is a negative number indicate an overlap between two regions of CDS, resulting in a duplication of the overlapping part of the sequence in the predicted ORF. In Assembly 19 all introns and adjustments are called Gaps.
In November 2008, all the non-intron "adjustments" in Assembly 21 were removed, as explained in the Sequence Refinements, November 2008 documentation.
Summary files listing introns and non-intron adjustments are available from the CGD Downloads site.
The tRNA genes were predicted from the C. albicans genome sequence using the tRNAscan-SE algorithm developed by T. M. Lowe and S. R. Eddy. The names appear in the following format, which is based on the format of the S. cerevisiae tRNA gene names: a lower-case t (for "tRNA"), followed by the one-letter abbreviation of the amino acid with which it is charged, followed by the anticodon (in parenthesis) followed by an integer. The name of the corresponding allele has an additional ".2" suffix. For example, "tA(AGC)1" is an alanyl tRNA with an AGC anticodon, and "tA(AGC)1.2" is the corresponding allele.
The C. albicans codon usage table may be accessed using the link in the left-hand menu bar of the CGD home page, under the heading "Download Data," or using the link on the Download Sequence page. This table displays the calculated frequency of use of each codon in the diploid complement of C. albicans protein-coding genes. The table was produced with the GCG program CodonFrequency using the diploid complement of all predicted coding sequences (13,117 open reading frames) from Assembly 19 of the C. albicans SC5314 genomic sequence, as found in the file 'orf_coding.fasta' dated 07-Jun-2005. Where the sequences of two alleles differ, both sequences were used to calculate codon usage. Where the sequences of two alleles were identical, two copies of the coding sequence were added to the pool of sequences used to calculate codon usage. Thus, codon usage was calculated from the entire diploid complement of protein-coding genes.
Note that C. albicans uses an alternative genetic code for nuclear genes, different from that used by most other fungi. Details and links to translation tables for nuclear and mitochondrial genes can be found at NCBI's Taxonomy Browser.
Assembly 20 of the C. albicans sequence, released in May 2006, was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan. After the release, it was discovered that the sequence traces that had been used to fill some of the gaps and determine overlaps between Assembly 19 contigs were derived from strain WO-1, rather than from the reference strain SC5314. The sequence of these regions is consequently expected to be inaccurate where WO-1 sequence was used, and small contigs may have been misassembled based on the WO-1 sequence data. The Biotechnology Research Institute of the National Research Council of Canada has since then released a new Assembly 21 that supersedes Assembly 20.
More information is available in Assembly 20 Sequence Documentation.
The contig sequences in CGD are from Assembly 19 of the C. albicans genome sequence, from the supplementary material published in the C. albicans sequencing paper, Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., Newport, G., Thorstenson, Y.R., Agabian, N., Magee, P.T., Davis, R.W. and S. Scherer. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334. Supplementary data: http://genome-www.stanford.edu/candida-pnas2004-supplement/. (Older sequence assemblies, including Assemblies 4, 5, and 6, have been archived at CGD. These data may be retrieved from the "archived_assemblies" folder on the CGD Sequence Download Page.)
More information is available in Assembly 19 Sequence Documentation.
This page contains documentation from the Stanford Genome Technology Center (SGTC), which was previously available on the SGTC's Candida information server, and has been archived here (verbatim) for reference.
Note: The original SC5314 sequence trace files and quality scores generated by the Stanford Genome Technology Center are available for download from CGD. The construction of the sequencing library and sequencing methods are described in Tzung et al. (2001).
From the Locus Summary Page:
The "Retrieve Sequences" pull-down menu, which is located on the Resources sidebar on the right-hand side of each Locus Summary Page, retrieves, for each gene in Assembly 21, or each allele in Assembly 19: the Genomic DNA (with introns included); the Coding Sequence (with introns removed); the Genomic DNA with 1 kb of flanking sequence upstream and downstream of the gene (also includes any introns); or the ORF translation (predicted protein sequence).
From the CGD Sequence Retrieval Tool:
To access the Sequence Retrieval Tool (also called Get Sequence, or Gene/Sequence Resources, use the link under Search Options on the left-hand sidebar of the CGD Home Page or use the "Gene/ Sequence Resources" link under Specialized Gene and Sequence Searches on the Search Options page.
By Bulk Download
You may download gzip compressed sequence files in bulk from the CGD Sequence Download Page; a variety of file options exist for retrieval of data from Assemblies 19, 20, and 21. There is a link to this page under Download Data on the left-hand sidebar of the CGD Home Page. Archived copies of older sequence assemblies, including Assemblies 4, 5, and 6, may also be retrieved from the the CGD Sequence Download Pages.
You may also retrieve sequence information for any set of genes (either specified by a list of gene names, or by selecting a region of a chromosome or contig) using the Batch Download Tool.
From the GBrowse Genome Browser:
To view the nucleotide sequence of a gene using GBrowse, begin by zooming in on the gene in the browser, which is described in detail on the GBrowse Help Documentation page. You may view Assembly 19 or Assembly 21 in GBrowse; please be aware that the assemblies are stored separately, and that you may browse them separately. GBrowse may be accessed using the "Chromosomal Location" (for Assembly 21) or "Contig Location(s)" links (for Assembly 19) or the GBrowse map thumbnail views on each Locus page, or by using the "CGD GBrowse" links displayed on each BLAST result page. You may use GBrowse to search by gene name. For example, type "orf19.7247" into the Landmark or Region search box and click on Search. To view the DNA sequence of the region displayed in the browser (which is now your gene of interest), select Download Sequence File or Download Decorated FASTA File from the pull-down menu labeled "Reports and Analysis." The difference between these two formats is that the decorated FASTA file format highlights ORFs contained within the sequence, which is convenient when viewing a large sequence file. The non-decorated sequence file can be displayed in any of several different configurable formats. Each of the file formats are configurable; select the file format from the pull-down menu and then click on the Configure button to select configuration options. Click on the button marked "Go" to view the sequence.
To view any amount of nucleotide sequence of the region upstream or downstream of a gene, you can use the browser to display a specific region relative to the ORF start site and then ask to download this sequence. For example, if you want the sequence of the 1.5 kb region upstream of orf19.7247, enter "orf19.7247:-1500..-1" into the Landmark or Region search box and click on Search. Now use the Download Sequence File or Download Decorated FASTA File option to get the nucleotide sequence of the region.
To view the predicted protein sequence (orf translation) of an ORF in GBrowse, for example, orf19.7247, first type "orf19.7247" into the Landmark or Region search box and click on Search to zoom in on this ORF. Now select Download Protein Sequence File. Click on the button marked "Go" to view the sequence. The protein sequence file format is configurable; select Download Protein Sequence File from the pull-down menu and then click on the Configure button to select configuration options.
You may even view the sequence of an entire contig or chromosome in GBrowse. You can search for a contig by name. For example, type "Contig19-2507" into the Landmark or Region search box and click on Search. You may then use either the Download Sequence File or Download Decorated FASTA File from the pull-down menu labeled "Reports and Analysis" to obtain the nucleotide sequence, or you may use Download Protein Sequence File to obtain the predicted protein sequence of all of the ORFs contained on the contig. If you would like the sequence of a contig containing your favorite gene, but you don't know the name of the contig, there are several ways to find this information. The Contig Location is now listed on the CGD Locus Page for each gene. Alternately, you can search for the gene using the Landmark or Region search in GBrowse, and the name of the contig will be displayed on the Overview Panel near the top of the page.
The GBrowse Help Documentation page has additional instructions for use of the GBrowse interface. To begin exploring in GBrowse now, use this link to see a region of Contig19-10014 as an example from Assembly 19.
Using BLAST (Basic Local Alignment Search Tool):
You may use the CGD BLAST tool to conduct protein or DNA sequence searches against various sequence datasets in CGD, as described in detail on the BLAST documentation page. Alignments of the query sequence with its sequence matches (also called "hits") are displayed along with hyperlinks to related sequence resources. The "CGD GBROWSE" hyperlink above each set of HSPs on the BLAST results page opens the GBrowse genome browser, with the HSP displayed in the browser window. GBrowse may be used to further explore the region containing the match: to view ORFs and other features in the neighborhood of the hit, to browse and download adjacent sequences, to view the 6-frame translation of the region, and to view restriction sites. (For a description of GBrowse features, please see our GBrowse documentation). If applicable, links are provided to directly download/view the entire ORF or peptide sequence, or to navigate to the corresponding Locus page.
Return to CGD |
Send a Message to the CGD Curators ![]() |