Assembly 21 (A21) is described in van het Hoog M, et al. (2007) Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol 8(4):R52. URL: http://genomebiology.com/2007/8/4/R52 In addition to making a chromosomal level assembly, by mapping the contigs of Assembly 19 (A19) to chromosomes and filling many of the gaps between them, the authors also made numerous and widespread modifications to the genomic sequence based on alignments of the sequence traces generated by inputting the SGTC's sequence traces into Sequencher software. Many of these modifications have introduced insertions, deletions, and substitutions relative to the Assembly 19 sequence. In many cases A, C, T, or G has been substituted with an ambiguous nucleotide; within ORFs, such ambiguous nucleotides consequently result in ambiguous amino acids in the predicted ORF translation, which is represented as an "X" within the A21 protein sequence.
The coordinates of each ORF in A21 were determined at CGD using a sequence-based mapping procedure, described in detail under "A21 ORF mapping procedure" below. Where possible, the Assembly 20 (A20) sequence of each ORF was used to determine the A21 coordinates; if the A20 ORF sequence did not match any sequence in A21, then the A19 sequence of the ORF was compared to the A21 sequence (because the sequence of some ORFs changed significantly between A19 and A20). Where relevant, it is noted in the downloadable files whether the A19 or A20 version of the ORF was used for mapping onto A21 (called "assembly used for mapping" in the file descriptions below).
Some ORFs from Assembly 20 map to corresponding regions in A21 chromosomes that, due to sequence changes made during generation of A21, have frameshifts or new in-frame stop codons in A21. In such cases, CGD manually added "adjustments" to the ORF coordinates, splitting the ORF into segments, with gaps or overlap between segments so as to restore the A21 coding sequence (CDS) to match the A20 CDS. In cases in which the A21 sequence lacks the stop codon that terminates the ORF in A20, the A21 ORF has been extended downstream to the next in-frame stop codon. Issues encountered in mapping individual ORFs to A21 are described on the Locus History page for the ORF in CGD.
ORFs in Assembly 19 that were deleted in A20 were not mapped to A21.
The original Assembly 21 released by van het Hoog et al. does not include the sequence of the mitochondrial DNA. Datasets that contain the mitochondrial genome use the sequence from Assembly 19.
The chromosomal coordinates of the tRNA genes were predicted in Assembly 21 by using the tRNAscan-SE algorithm developed by T. M. Lowe and S. R. Eddy.
CGD has conducted a series of analyses of the A21 ORF set, the results of which are available for download and described in detail below. To download any file, select the filename.
The sequence and annotation version designation system used in CGD is described in detail here.
A list of all of each of the versions of the sequence and annotation for each species, with release notes, is listed on the Summary of Genome Versions page.
Information about every update to the chromosome sequence and/or chromosomal location of any gene (or other annotated feature) is displayed on the CGD Locus History page for each of the relevant genes, and also on the appropriate CGD Chromosome History page
Please feel free to contact us with any questions.
Note: Throughout these files, coding sequence is denoted by upper case letters, and noncoding (introns or adjustment/gap) sequences are denoted by lower case letters. An ORF's "genomic sequence" is the chromosomal sequence of the region, including any intronic sequence or gap sequence within the boundaries of the ORF, from the beginning of the start codon through the last nucleotide of the stop codon. An ORF's "coding sequence" or "CDS" is the sequence of the actual open reading frame, after introns have been spliced out and any non-intron adjustments have been made (gaps removed, or overlapping segments aligned into place).
ORFsDeletedFromA21.txt - This file contains list of ORF names and, where available, locus names for ORFs that could not be mapped well (from A20 nor A19) or which were deemed not fixable (e.g., multiple stop codons within all three A21 reading frames). Hence, these are the ORFs that are "Deleted from Assembly 21" and labeled as such on their Locus pages. ORFs that have been deleted have been eliminated from the other files listed below (e.g., an ORFs with stop codons internal to the reading frame that has been deleted from A21 is present on this list and not included in the ORFsWinternalStops.txt file).
ORFsWintrons.txt - This file contains ORFs that have introns in A21. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and full genomic and coding sequence of each of these ORFs.
ORFsWintrons_alignment.txt - This file has alignments between the full A21 genomic sequence and the A21 coding sequence of ORFs that have introns. The alignments were generated using MUSCLE, with the sequences in the file ORFsWintrons.txt, and the output is in ClustalW format.
ORFsWambiguousBases.txt - This file contains ORFs with ambiguous bases (other than A, T, G, or C) in their A21 genomic sequence (the sequence of the ORF itself and any gap or intron sequences within). The following ambiguous bases are contained in the Assembly 21 sequence as released to CGD:
B : C or G or T;
D : A or G or T;
K : G or T;
M : A or C;
N : unspecified A or C or G or T;
R : A or G;
S : C or G;
W : A or T;
Y : C or T.
This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and full genomic sequence of each of these ORFs.
ORFsWadjustments.txt - This file contains ORFs that have adjustments in their A21 coordinates. "Adjustments" are gaps or overlap between coding segments that have been introduced into the ORF coordinates to compensate for presumed errors in the sequence that cause stop codons internal to the ORF region (either an in-frame stop codon or a shift in reading frame). Please note that some A20 ORFs also have adjustments, and that the list of A20 ORFs with adjustments differs from those with adjustments in A21, due to changes that have been made to the sequence during generation of A21. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and full genomic and coding sequence of each of these ORFs.
ORFsWadjustments_alignment.txt - This file contains alignments between the full A21 genomic sequence and the A21 coding sequence of ORFs that have adjustments. The alignments were generated using MUSCLE, with the sequences in the file ORFsWadjustments.txt, and the output is in ClustalW format.
ORFsWinternalStops.txt - This file contains protein sequences of ORFs that, when mapped to A21, have stop codons within the reading frame in the A21 sequence. "Adjustments" have been manually introduced to the ORF coordinates to reconstruct a complete reading frame, where possible (see also: ORFsWadjustments.txt). This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the amino acid sequence of the predicted ORF translation. For ORFs that have been given "adjustments," the adjusted sequence is also listed.
ORFsWnoEndStop.txt - This file contains protein sequences of ORFs that, when mapped to A21, do not end with a stop codon. Many of these have other problems in A21, such as insertions or deletions in the sequence that disrupt the reading frame. Where possible, ORF coordinates have been manually updated to create a complete reading frame (e.g., truncated at an in-frame stop, extended downstream to the next in-frame stop, or given an "adjustment" that changes the reading frame.) This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the amino acid sequence of the predicted ORF translation. For cases where the ORF has been manually updated, the updated sequence is also included.
ORFsWnonATGstart.txt - This file lists ORFs that do not start with an ATG in A21. In some cases, the coordinates were updated manually during curatorial review, and the updated ORF does start with an ATG; these cases are also listed in this file. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the coding sequence of each of these ORFs. For cases where the ORF has been manually updated, the updated sequence is also included.
ORFsWpartialCodon.txt - This file contains coding sequences of ORFs that have partial codons (i.e., the length of the coding sequence is not a multiple of 3). Where possible, ORF coordinates have been manually updated to create a complete reading frame (e.g., truncated at an in-frame stop, extended downstream to the next in-frame stop, or given an "adjustment" that changes the reading frame.) This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the coding sequence (CDS). For cases where the ORF has been manually updated, the updated coding sequence is also included.
ORFsWmanualUpdates.txt - This file lists the A21 "problem ORFs" that were manually updated at CGD to address problems in A21. All ORFs that failed one or more of the checks described below under "Details of the A21 ORF Mapping Procedure" were subject to manual review at CGD. Where possible, the issues were addressed: "adjustment(s)" were added to change reading frame or to bypass in-frame stop codons, coordinates were updated to extend or truncate the ORF to acquire a terminal stop codon, and ORFs that narrowly failed criteria for "good" matches to A21 were reinstated if they could be mapped to A21. Only the ORFs for which the coordinates were changed as a result of this review are listed in this file. This file contains the ORF name, locus name (where available), A21 coordinates determined by the computational mapping procedure, manually updated ORF coordinates, descriptive note, the assembly used for mapping, the predicted ORF translation based on the coordinates before updating, and the predicted ORF translation based on the updated coordinates.
ORFsWnoChanges.txt - This file lists ORFs that have not changed at all in A21 as compared to the sequence in the assembly (A20/A19) that was used for mapping the ORF in each case. (Where possible, the A20 sequence of each ORF was used to determine the A21 coordinates; if the A20 ORF sequence did not match any sequence in A21, then the A19 sequence of the ORF was compared to the A21 sequence; this was necessary because the sequence of some ORFs changed significantly between A19 and A20.) This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the genomic sequence of each of these ORFs.
ORFsWsynChanges.txt - This file lists A21 ORFs that show synonymous changes when compared to the sequence in the (A20/A19) assembly that was used for mapping the ORF in each case. The nucleotide sequence of the coding sequence (CDS), excluding any intronic sequence or gap regions ("adjustments"), is not the same between the two assemblies, however, the translated amino acid sequence is the same. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the A21 coding sequence of each of these ORFs, followed by the ORF name, locus name, A20 or A19 coordinates (the assembly from which the ORF was mapped), and A20 or A19 coding sequence of the ORF.
ORFsWsynChanges_alignment.txt - This file contains alignments between the A21 coding sequence (CDS) and the A20/A19 CDS of each ORF that shows synonymous changes upon mapping to Assembly 21. The alignments were generated using MUSCLE, with the sequences in the file ORFsWsynChanges.txt, and the output is in ClustalW format.
ORFsWnonsynChanges.txt - This file lists A21 ORFs that show non-synonymous changes when compared to the sequence in the (A20/A19) assembly that was used for mapping the ORF in each case. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the A21 coding sequence (CDS) of each of these ORFs, followed by the ORF name, locus name, A20 or A19 coordinates (the assembly from which the ORF was mapped), and A20 or A19 CDS of the ORF.
ORFsWnonsynChanges_alignment.txt - This file contains alignments between the A21 coding sequence (CDS) and the A20/A19 CDS of each ORF that shows non-synonymous changes upon mapping to Assembly 21. The alignments were generated using MUSCLE, with the sequences in the file ORFsWnonsynChanges.txt, and the output is in ClustalW format.
ORFsWsimpleChanges.txt - This file lists A21 ORFs that show simple changes when compared to the sequence in the (A20/A19) assembly that was used for mapping the ORF in each case. A "simple" change is defined as follows: the nucleotide identity across the ORFs (the genomic region from the start to stop codon, including introns or gaps within the ORF) is 98% or greater, and the aligned region encompasses the entire length of the ORF. The "simple change" category includes ORFs that may contain substitutions, small insertions, and/or small deletions. Cases in which only sequence within an intronic or gap ("adjustment") region has changed, and the translated sequence has not been affected, are also included in this category. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the A21 genomic sequence of each of these ORFs, followed by the ORF name, locus name, A20 or A19 coordinates (the assembly from which the ORF was mapped), and A20 or A19 genomic sequence of the ORF.
ORFsWsimpleChanges_alignment.txt - This file contains alignments between the A21 genomic sequence and the A20/A19 sequence of each ORF that shows simple changes upon mapping to Assembly 21. The alignments were generated using MUSCLE, with the sequences in the file ORFsWsimpleChanges.txt, and the output is in ClustalW format.
ORFsWcomplexChanges.txt - This file lists A21 ORFs that show complex changes when compared to the sequence in the (A20/A19) assembly that was used for mapping the ORF in each case. A "complex" change is defined as follows: the nucleotide identity across the ORFs (the genomic region from the start to stop codon, including introns or gaps within the ORF) is less than 98%, and/or the aligned region does not encompass the entire length of the ORF. This file contains the ORF name, locus name (where available), A21 coordinates, the assembly used for mapping, and the A21 genomic sequence of each of these ORFs, followed by the ORF name, locus name, A20 or A19 coordinates (the assembly from which the ORF was mapped), and A20 or A19 genomic sequence of the ORF.
ORFsWcomplexChanges_alignment.txt - This file contains alignments between the A21 sequence and the A20/A19 sequence of each ORF that shows complex changes upon mapping to Assembly 21. The alignments were generated using MUSCLE, with the sequences in the file ORFsWcomplexChanges.txt, and the output is in ClustalW format.
We first used the A20 sequence of each ORF to find the corresponding region on A21 chromosomes. The full A20 genomic DNA sequence of each ORF (including introns and any gap/adjustment regions) was mapped, using BLAST, to A21 chromosomes. The threshold parameters used were that both the percent identity and alignment length of the BLAST HSP (high-scoring pair) be > 95%. The ORFs that passed this threshold were called "good" full ORF matches. If there was no good match to the A20 sequence of that ORF, we try the mapping using A19 ORF sequence.
Then, for ORFs with multiple exons, the coding sequence segments (individually referred to as a "subfeature") were each separately mapped to the A21 region that matched the full ORF. Threholds were more strict for the subfeature mapping: alignment length was required to be 100% and percent identity > 99%. The matched subfeature regions were used to define the ORF location and then various checks were performed on the generated A21 ORF sequences.
Manual review was performed in cases where ORFs did not pass the criteria for a "good" match to A21 using either A20 or A19 sequence. There were 16 such cases; some had narrowly failed the criteria used for computational classification, and were manually assigned to A21 coordinates.
Below is the list of checks that were performed on each ORF.
(1) check that full DNA sequence does not have invalid bases
(2) check that full DNA sequence does not have N characters
(3) check that coding sequence is multiple of three in length (no partial codons)
(4) check that coding sequence start is ATG
(5) check that coding sequence ends with a stop codon
(6) check that coding sequence does not terminate with multiple sequential stop codons
(7) check that coding sequence does not stop codons internal to the ORF
(8) check that exon-intron boundaries have not changed and, for real introns (as opposed to "adjustments"), compare sequence around the
intron and its adjoining exons to make sure it is intact
(9) check if the intron splice sites are canonical, that is 5' : GT and 3' : AG
(10) line up subfeatures and find gaps or overlaps and suggest adjustments (Only exons and real introns were mapped to A21; adjustments were not mapped.)
(11) check coverage of subfeatures: Start of first subfeature should coincide with start of ORF
and, similarly, end of last subfeature should be the same as end of ORF.
(12) check that full genomic DNA sequence does not have non-ATGC characters
(13) check that coding sequence does not have non-ATGC characters
ORFs that failed one or more of the following checks were reviewed manually: 3, 4, 5, 7. Where possible, the coordinates of the ORF were updated to address the issue. Please see ORFsWmanualUpdates.txt for a list of ORFs for which the coordinates were updated during curatorial review at CGD.
Like Assembly 20, Assembly 21 of the C. albicans sequence was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan. Assembly 21 (A21) sequence files have been submitted by the A21 collaborators and described in van Het Hoog M, Rast TJ, Martchenko M, Grindle S, Dignard D, Hogues H, Cuomo C, Berriman M, Scherer S, Magee BB, Whiteway M, Chibana H, Nantel A, Magee PT. Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol. 2007 Apr 9;8(4):R52. The sequence files are available for download from the CGD Sequence Downloads directory. Please note that the sequences in these files are exactly as released to CGD by the A21 collaborators, prior to any analyses at CGD.
Return to CGD | Send a Message to the CGD Curators |