The Candida Genome Database: Sequence Documentation

This page provides information about the DNA and protein sequences in CGD, including their sources, how to access them, and further explanation of some sequence-related issues.

Contents



Information about Candida-related strains and species in CGD

CGD provides sequence for download from several Candida-related strains and species, listed below. Initially, CGD curation was focused on the C. albicans literature, because C. albicans serves as a genetic model for the other Candida-related species, and it is the most well-represented of these species in the published experimental literature. As of June, 2011, we have also added curated infromation about C. glabrata. We are now expanding the manual curation process to include information about other Candida-related species, and will be adding gene-based information for them, including Locus Summary pages.

The C. albicans SC5314 sequence file names, as well as chromosome identifiers within the files, were updated on 25 August 2010 to include the name of the species and strain. This change was necessary to accommodate multiple Candida and Candida-related species and strains at CGD.

Note: Candida albicans and some related species (often called the "CTG clade") use a non-standard genetic code, "Translation table 12: Alternative Yeast Nuclear Code," to translate nuclear genes. For more information about translation tables used in CGD, please see the Non-standard Genetic Code Usage in Candida help page.

Sources of sequence-based information in CGD

Version Tracking for Chromosomal Sequence and Genome Annotation

The version designation appears in the name of each of the relevant sequence files that are available at CGD, so the exact source of the sequence data is always clear.

This version system was implemented for C. albicans SC5314 and C. glabrata CBS138 in CGD as of June 2011, and it is based on the system designed for tracking of the A. nidulans sequence and annotation versions in AspGD. The same system of version designation will be used for version tracking for the chromosomal sequence and genome annotation of other species, as they are added into CGD.

Version designations appear in the following format:
sXX-mYY-rZZ
as described in detail here.

A list of all of each of the versions of the sequence and annotation for each species, with release notes, is listed on the Summary of Genome Versions page.

Information about every update to the chromosome sequence and/or chromosomal location of any gene (or other annotated feature) is displayed on the CGD Locus History page for each of the relevant genes, and also on the appropriate CGD Chromosome History page

Please feel free to contact us with any questions.

C. albicans SC5314 Genome Sequence Assemblies

Refinements to Assembly 21 in CGD (November 2008)


Assembly 21 in CGD (September 2007)

Assembly 21 (A21) is described in: van het Hoog M, et al. (2007) Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol 8(4):R52. URL: http://genomebiology.com/2007/8/4/R52 . In addition to making a chromosomal level assembly, by mapping the contigs of Assembly 19 (A19) to chromosomes and filling many of the gaps between them, the authors also made numerous and widespread modifications to the genomic sequence based on alignments of the sequence traces generated by inputting the SGTC's sequence traces into Sequencher software. Many of these modifications introduced insertions, deletions, and substitutions relative to the Assembly 19 sequence. In many cases A, C, T, or G was substituted with an ambiguous nucleotide; within ORFs, such ambiguous nucleotides consequently resulted in ambiguous amino acids in the predicted ORF translation, which is represented as an "X" within the A21 protein sequence.

More information is available in Assembly 21 Sequence Documentation.


Assembly 20 in CGD (September 2006)

Assembly 20 of the C. albicans sequence, released in May 2006, was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan. After the release, it was discovered that the sequence traces that had been used to fill some of the gaps and determine overlaps between Assembly 19 contigs were derived from strain WO-1, rather than from the reference strain SC5314. The sequence of these regions is consequently expected to be inaccurate where WO-1 sequence was used, and small contigs may have been misassembled based on the WO-1 sequence data. The Biotechnology Research Institute of the National Research Council of Canada has since then released a new Assembly 21 that supersedes Assembly 20.

More information is available in Assembly 20 Sequence Documentation.


Assembly 19 in CGD (March 2004)

The contig sequences in CGD are from Assembly 19 of the C. albicans genome sequence, from the supplementary material published in the C. albicans sequencing paper, Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., Newport, G., Thorstenson, Y.R., Agabian, N., Magee, P.T., Davis, R.W. and S. Scherer. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334. Supplementary data: http://genome-www.stanford.edu/candida-pnas2004-supplement/. (Older sequence assemblies, including Assemblies 4, 5, and 6, have been archived at CGD. These data may be retrieved from the "archived_assemblies" folder on the CGD Sequence Download Page.)

More information is available in Assembly 19 Sequence Documentation.


Assembly 6 in CGD (January 2002)

This page contains documentation from the Stanford Genome Technology Center (SGTC), which was previously available on the SGTC's Candida information server, and has been archived here (verbatim) for reference.

Note: The original SC5314 sequence trace files and quality scores generated by the Stanford Genome Technology Center are available for download from CGD. The construction of the sequencing library and sequencing methods are described in Tzung et al. (2001).


Sources of SNP data

Please note: This is not intended to be a comprehensive bibliography, rather, a list of a few helpful references:

SNPs between allelic Assembly 19 contigs, C. albicans strain SC5314, are published in the Assembly 19 paper, Jones et al. (2004), and these data are available for download as supplementary material associated with the publication.

SNP data for C. albicans from Forche et al. (2004) are available in the supplementary material associated with the paper and may be viewed using the SNP track in the CGD GBrowse genome browser.

SNP data are included among the data from Butler et al. (2009) for eight Candida genomes, and are available for download as supplementary material associated with the paper, and from the Broad Institute website.


Accessing Sequences in CGD

From the Locus Summary Page:

The "Retrieve Sequences" pull-down menu, which is located on the Resources sidebar on the right-hand side of each Locus Summary Page, retrieves, for each gene in Assembly 21, or each allele in Assembly 19: the Genomic DNA (with introns included); the Coding Sequence (with introns removed); the Genomic DNA with 1 kb of flanking sequence upstream and downstream of the gene (also includes any introns); or the ORF translation (predicted protein sequence).

From the CGD Sequence Retrieval Tool:

To access the Sequence Retrieval Tool (also called Get Sequence, or Gene/Sequence Resources, use the link under Search Options on the left-hand sidebar of the CGD Home Page or use the "Gene/ Sequence Resources" link under Specialized Gene and Sequence Searches on the Search Options page.

By Bulk Download

You may download gzip compressed sequence files in bulk from the CGD Sequence Download Page; a variety of file options exist for retrieval of data from Assemblies 19, 20, and 21. There is a link to this page under Download Data on the left-hand sidebar of the CGD Home Page. Archived copies of older sequence assemblies, including Assemblies 4, 5, and 6, may also be retrieved from the the CGD Sequence Download Pages.

You may also retrieve sequence information for any set of genes (either specified by a list of gene names, or by selecting a region of a chromosome or contig) using the Batch Download Tool.

From the GBrowse Genome Browser:

You may also view nucleotide or protein sequence using the GBrowse genome browser. GBrowse may be accessed using the "Chromosomal Location" or "Contig Location(s)" links, or the GBrowse map thumbnail views on each Locus page, or by using the "Genome Browser" links displayed on each BLAST result page. Sequence download options are available from the Reports & Analysis pull-down menu in the interface. The GBrowse Help Documentation page has additional instructions for use of the GBrowse interface.

Using BLAST (Basic Local Alignment Search Tool):

You may use the CGD BLAST tool to conduct protein or DNA sequence searches against various sequence datasets in CGD, as described in detail on the BLAST documentation page. Alignments of the query sequence with its sequence matches (also called "hits") are displayed along with hyperlinks to related sequence resources. The "CGD GBROWSE" hyperlink above each set of HSPs on the BLAST results page opens the GBrowse genome browser, with the HSP displayed in the browser window. GBrowse may be used to further explore the region containing the match: to view ORFs and other features in the neighborhood of the hit, to browse and download adjacent sequences, to view the 6-frame translation of the region, and to view restriction sites. (For a description of GBrowse features, please see our GBrowse documentation). If applicable, links are provided to directly download/view the entire ORF or peptide sequence, or to navigate to the corresponding Locus page.


Return to CGD Send a Message to the CGD Curators