CGD

The Candida Genome Database: Sequence Documentation


This page provides information about the DNA and protein sequences in CGD, including their sources, how to access them, and further explanation of some sequence-related issues.

Contents



Genome Sequence Assemblies

Refinements to Assembly 21 in CGD (November 2008)


Assembly 21 in CGD (September 2007)

Assembly 21 (A21) is described in: van het Hoog M, et al. (2007) Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol 8(4):R52. URL: http://genomebiology.com/2007/8/4/R52 . In addition to making a chromosomal level assembly, by mapping the contigs of Assembly 19 (A19) to chromosomes and filling many of the gaps between them, the authors also made numerous and widespread modifications to the genomic sequence based on alignments of the sequence traces generated by inputting the SGTC's sequence traces into Sequencher software. Many of these modifications introduced insertions, deletions, and substitutions relative to the Assembly 19 sequence. In many cases A, C, T, or G was substituted with an ambiguous nucleotide; within ORFs, such ambiguous nucleotides consequently resulted in ambiguous amino acids in the predicted ORF translation, which is represented as an "X" within the A21 protein sequence.

More information is available in Assembly 21 Sequence Documentation.


Assembly 20 in CGD (September 2006)

Assembly 20 of the C. albicans sequence, released in May 2006, was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan. After the release, it was discovered that the sequence traces that had been used to fill some of the gaps and determine overlaps between Assembly 19 contigs were derived from strain WO-1, rather than from the reference strain SC5314. The sequence of these regions is consequently expected to be inaccurate where WO-1 sequence was used, and small contigs may have been misassembled based on the WO-1 sequence data. The Biotechnology Research Institute of the National Research Council of Canada has since then released a new Assembly 21 that supersedes Assembly 20.

More information is available in Assembly 20 Sequence Documentation.


Assembly 19 in CGD (March 2004)

The contig sequences in CGD are from Assembly 19 of the C. albicans genome sequence, from the supplementary material published in the C. albicans sequencing paper, Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., Newport, G., Thorstenson, Y.R., Agabian, N., Magee, P.T., Davis, R.W. and S. Scherer. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334. Supplementary data: http://genome-www.stanford.edu/candida-pnas2004-supplement/. (Older sequence assemblies, including Assemblies 4, 5, and 6, have been archived at CGD. These data may be retrieved from the "archived_assemblies" folder on the CGD Sequence Download Page.)

More information is available in Assembly 19 Sequence Documentation.


Assembly 6 in CGD (January 2002)

This page contains documentation from the Stanford Genome Technology Center (SGTC), which was previously available on the SGTC's Candida information server, and has been archived here (verbatim) for reference.

Note: The original SC5314 sequence trace files and quality scores generated by the Stanford Genome Technology Center are available for download from CGD. The construction of the sequencing library and sequencing methods are described in Tzung et al. (2001).


Accessing Sequences in CGD

From the Locus Summary Page:

The "Retrieve Sequences" pull-down menu, which is located on the Resources sidebar on the right-hand side of each Locus Summary Page, retrieves, for each gene in Assembly 21, or each allele in Assembly 19: the Genomic DNA (with introns included); the Coding Sequence (with introns removed); the Genomic DNA with 1 kb of flanking sequence upstream and downstream of the gene (also includes any introns); or the ORF translation (predicted protein sequence).

From the CGD Sequence Retrieval Tool:

To access the Sequence Retrieval Tool (also called Get Sequence, or Gene/Sequence Resources, use the link under Search Options on the left-hand sidebar of the CGD Home Page or use the "Gene/ Sequence Resources" link under Specialized Gene and Sequence Searches on the Search Options page.

By Bulk Download

You may download gzip compressed sequence files in bulk from the CGD Sequence Download Page; a variety of file options exist for retrieval of data from Assemblies 19, 20, and 21. There is a link to this page under Download Data on the left-hand sidebar of the CGD Home Page. Archived copies of older sequence assemblies, including Assemblies 4, 5, and 6, may also be retrieved from the the CGD Sequence Download Pages.

You may also retrieve sequence information for any set of genes (either specified by a list of gene names, or by selecting a region of a chromosome or contig) using the Batch Download Tool.

From the GBrowse Genome Browser:

To view the nucleotide sequence of a gene using GBrowse, begin by zooming in on the gene in the browser, which is described in detail on the GBrowse Help Documentation page. You may view Assembly 19 or Assembly 21 in GBrowse; please be aware that the assemblies are stored separately, and that you may browse them separately. GBrowse may be accessed using the "Chromosomal Location" (for Assembly 21) or "Contig Location(s)" links (for Assembly 19) or the GBrowse map thumbnail views on each Locus page, or by using the "CGD GBrowse" links displayed on each BLAST result page. You may use GBrowse to search by gene name. For example, type "orf19.7247" into the Landmark or Region search box and click on Search. To view the DNA sequence of the region displayed in the browser (which is now your gene of interest), select Download Sequence File or Download Decorated FASTA File from the pull-down menu labeled "Reports and Analysis." The difference between these two formats is that the decorated FASTA file format highlights ORFs contained within the sequence, which is convenient when viewing a large sequence file. The non-decorated sequence file can be displayed in any of several different configurable formats. Each of the file formats are configurable; select the file format from the pull-down menu and then click on the Configure button to select configuration options. Click on the button marked "Go" to view the sequence.

To view any amount of nucleotide sequence of the region upstream or downstream of a gene, you can use the browser to display a specific region relative to the ORF start site and then ask to download this sequence. For example, if you want the sequence of the 1.5 kb region upstream of orf19.7247, enter "orf19.7247:-1500..-1" into the Landmark or Region search box and click on Search. Now use the Download Sequence File or Download Decorated FASTA File option to get the nucleotide sequence of the region.

To view the predicted protein sequence (orf translation) of an ORF in GBrowse, for example, orf19.7247, first type "orf19.7247" into the Landmark or Region search box and click on Search to zoom in on this ORF. Now select Download Protein Sequence File. Click on the button marked "Go" to view the sequence. The protein sequence file format is configurable; select Download Protein Sequence File from the pull-down menu and then click on the Configure button to select configuration options.

You may even view the sequence of an entire contig or chromosome in GBrowse. You can search for a contig by name. For example, type "Contig19-2507" into the Landmark or Region search box and click on Search. You may then use either the Download Sequence File or Download Decorated FASTA File from the pull-down menu labeled "Reports and Analysis" to obtain the nucleotide sequence, or you may use Download Protein Sequence File to obtain the predicted protein sequence of all of the ORFs contained on the contig. If you would like the sequence of a contig containing your favorite gene, but you don't know the name of the contig, there are several ways to find this information. The Contig Location is now listed on the CGD Locus Page for each gene. Alternately, you can search for the gene using the Landmark or Region search in GBrowse, and the name of the contig will be displayed on the Overview Panel near the top of the page.

The GBrowse Help Documentation page has additional instructions for use of the GBrowse interface. To begin exploring in GBrowse now, use this link to see a region of Contig19-10014 as an example from Assembly 19.

Using BLAST (Basic Local Alignment Search Tool):

You may use the CGD BLAST tool to conduct protein or DNA sequence searches against various sequence datasets in CGD, as described in detail on the BLAST documentation page. Alignments of the query sequence with its sequence matches (also called "hits") are displayed along with hyperlinks to related sequence resources. The "CGD GBROWSE" hyperlink above each set of HSPs on the BLAST results page opens the GBrowse genome browser, with the HSP displayed in the browser window. GBrowse may be used to further explore the region containing the match: to view ORFs and other features in the neighborhood of the hit, to browse and download adjacent sequences, to view the 6-frame translation of the region, and to view restriction sites. (For a description of GBrowse features, please see our GBrowse documentation). If applicable, links are provided to directly download/view the entire ORF or peptide sequence, or to navigate to the corresponding Locus page.


Return to CGD Send a Message to the CGD Curators