Assembly 19 is now considered to be archival with respect to sequence or gene model annotation updates, with the exception of updates to the mitochondrion, and very occasional updates to the contig annotation (see below).
The Assembly 19 files are not updated routinely: the gene model and sequence updates described in Butler et al. (2009) were made only in Assembly 21 (not Assembly 19), and incremental sequence or gene model annotation updates that we make as part of our curation process are also made in the current assembly (Assembly 21, as of August 2010), and not rooutinely in the archival assemblies (including Assembly 19). The updates made, and versions archived, with release notes, are listed on the Summary of Genome Versions page.
Mitochondrial updates are the exception, as follows: On 29-Jul-2011 a set of updates to the current tRNA annotation were made, affecting both mitochondrial and nuclear tRNA genes (see http://www.candidagenome.org/cgi-bin/reference/reference.pl?dbid=CAL0000142566). At this time, the current mitochondrial sequence and annotation is shared between Assembly 19 and Assembly 21 (because the Assembly 21 dataset contained no mitochondrial sequence as it was released by the BRI, the mitochondrial reference that we include as part of the current Assembly 21 reference set is technically part of Assembly 19). Because the mitochondrial reference is shared between the two assemblies, when we updated the Assembly 21 tRNA dataset, we consequently updated the Assembly 19 mitochondrial tRNA dataset. Hence, the Assembly version designation was advanced from A19-s01-m01-r01 to A19-s01-m02-r01. Please note that this change includes ONLY the updates to the coordinates of mitochondrial tRNAs; no updates to any of the genes on the nuclear chromosomes were made at this time.
The contig sequences in CGD are from Assembly 19 of the C. albicans genome sequence, from the supplementary material published in the C. albicans sequencing paper, Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., Newport, G., Thorstenson, Y.R., Agabian, N., Magee, P.T., Davis, R.W. and S. Scherer. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334. Supplementary data: http://genome-www.stanford.edu/candida-pnas2004-supplement/. (Older sequence assemblies, including Assemblies 4, 5, and 6, have been archived at CGD. These data may be retrieved from the "archived_assemblies" folder on the CGD Sequence Download Page.)
The Assembly 19 ORF coordinates (displayed on the Locus pages and displayed in the GBrowse Genome Browser, and available for download) come from the supplementary material published by Jones et al. and also from the Candida Annotation Working Group (AWG, http://candida.bri.nrc.ca/candida/index.cfm). The AWG formed at the ASM Conference on Candida and Candidiasis held January 13-17, 2002, in Tampa, FL. The group consists of researchers who volunteered their own time to annotate the genome (see Braun et al. 2005).
While the Assembly 19 AWG ORF set and the Jones et al. ORF set are similar, they are not identical. Members of the AWG have updated the ORF set of Jones et al. to include known and predicted introns. In addition, the AWG has adjusted the sequence to eliminate presumed sequence errors that create internal stops or frameshifts, and to change the 5' and 3' boundaries of some ORFs.
CGD has incorporated into Assembly 19 the AWG's changes that can be represented by "gaps" in the ORF sequences (or "joins" between regions of sequence along a contig) proposed by Jones et al. Approximately 400 of the ORFs in Assembly 19 are therefore presented as multiple conjoined coding sequences within CGD. Approximately 215 of these are the result of introns, and the remainder are due to short sequence gaps introduced by the AWG. CGD has not introduced any actual changes into the experimentally determined, published Assembly 19 contig sequences.
The Assembly 19 orf19s were loaded into CGD using the following procedure:
Note: CGD has not included the predicted changes in orf boundaries that would require alteration to the contig sequence. Therefore the predicted changes at 5' and 3' ends of orf19s have not been incorporated. The Assembly 19 sequence files available from the AWG and CGD differ in this respect.
Please also note that the AWG's sequence adjustments were typically made in the context of only one of the two alleles of each gene. For example, the orf19.5007 allele of ACT1 is depicted with an intron, whereas the orf19.12474 allele is not (see the ACT1 Locus Page). This difference is reflected in the predicted protein sequence translated from these ORFs. In the case of ACT1, the translation of orf19.12474 begins from an ATG that is downstream of the intron and N-terminal protein coding sequence of orf19.5007.
Assembly 19 is not entirely complete, such that some of the ORFs are truncated by the end of a contig, rather than with a stop codon. Therefore, the protein products of 3'-truncated genes (e.g., orf19.1004) do not end with a translational stop, which is denoted by an * in the protein sequence file. Likewise, some genes are truncated from the 5' end (e.g., orf19.1021). The position of a gene on its contig may be assessed using the GBrowse genome browser. Click on the small map on the right-hand side of any Locus Page to view the gene in the context of the contig that contains it.
In cases where both alleles of a gene are identical, the sequence has been assembled as a single ORF. Therefore, genes with identical alleles have only a single orf19 designation, and these genes have a single entry in the "orf_coding.fasta" file.
Note: This section contains documentation from the Stanford Genome Technology Center (SGTC). This documentation was previously available on the SGTC's Candida information server, and has been archived here (verbatim) for reference.
"Assembly 19 Release Notes The Candida genome contains regions that are homozygous, and others that are not. In homozygous regions, the assembler can combine reads from both alleles into the same contig. In heterozygous regions where the level of heterozygosity is low, it can do the same in spite of a few disagreements between alleles (it treats the polymorphisms as if they resulted from sequencing errors). From the assembler's pointof view these regions are effectively homozygous. In these release notes, the term "homozygous" should be interpreted as looking homozygous to the assembler, and a low level of polymorphsim between alleles can still be found in the homozygous regions. Assembly 19 does not currently contain information on polymorphisms in such regions. In the near future we will provide annotation of such residual polymorphisms. In regions with more than minimal divergence between alleles, the assembler must put reads from the two alleles into different contigs. This happened frequently in assembly 6, resulting in considerable fragmentation and difficulty in interpretation, e.g., in distinguishing allele pairs from family members. In assembly 19, we have developed techniques to detect separate assembly of alleles and to combine separated contigs from assembly 6 into diploid contigs in assembly 19. For most contigs in assembly 19, we present distinct sequences for the two alleles. Contig numbering. For some contigs from assembly 6, we found no indication of allele sequence assembled separately. Such contigs passed unchanged, except possibly for minor differences in trimming of low-quality bases at the end, into assembly 19. Contigs of this kind have the same number in assembly 19; for example, Contig19-1785 is the same as Contig6-1785, and is presumed to be homozygous. When we were able to detect separation of alleles in assembly 6, we combined the affected assembly 6 contigs into larger diploid contigs in assembly 19. All contigs so formed were assigned numbers starting at 10000; for example, Contig19-10014 is made up from contigs 6-1076, 6-2434, 6-1473, 6-1632, 6-2141, and 6-2001. A diagram is provided in PDF format for Contig19-10014 (and all others) showing how it is formed from assembly 6 contigs. A dotted line separates the assembly 6 contigs assigned to the two alleles. In regions where one allele has a gap, the sequence is presumed to be homozygous and is filled in from the other allele. Otherwise the top allele derives its sequence from the assembly 6 contig shown above the dotted line, and the bottom allele from the contig at the same position shown below the line. This process results in two sequences representing the two alleles for the contig. The top allele is arbitrarily designated as primary, and the sequence given for Contig19-10014 is that derived from the top set of assembly 6 contigs. The sequence for the other allele is given the name Contig19-20014 (i.e., add 10000 to the number of the primary allele). In viewing the diagrams, note that because of insertions and deletions between alleles, corresponding poisitions on the two alleles are not always connected by a direct vertical line, but usually in large diploid contigs the size of insertions is visually negligible. Contig19-10262 is exceptional in that it was constructed by joining two assembly 6 contigs linked based on sequence obtained from Genbank, with no evidence of separation of alleles. Accordingly it does not have a second allele, and there is no contig 19-20262. ORFs. ORFs were called using the same methods described for assembly 6, with one addition. In a small number of cases, the construction of diploid contigs involved the insertion of blocks of "N" bases to fill gaps on one allele where evidence indicated that the sequence should not be filled in from the other allele. Usually the number of N's to be inserted was known at best approximately. To avoid having ORFs crossing large blocks of N of essentially arbitrary length, ORF calling stopped at any group of 12 or more N's, and ORFs that run up against such N-blocks are labeled using the same incompleteness rules applied to ORFs running off the ends of contigs in assembly 6. ORFs of this type are identifiable by inclusion of 12 N's at the affected end, which translate to 4 X's in the protein sequence. ORFs were called using both alleles of the diploid contigs. There are 14220 ORFs in the complete set so obtained. In many cases ORFs are exactly duplicated between alleles. ORF Alleles. Nonredundant Protein Set. A computational process identified pairs of ORFs that are deemed to be alleles based on position and protein sequence similarity. Generally the identification of alleles is straightforward. In complicated instances we recommend examination of the ORFs and blast results to understand the situation. The web pages identify ORFs designated as alleles and give indications of which cases are complicated. The allele pairs were used to generate a nonredundant protein set using the following rule: whenever the protein sequences for a pair of alleles were identical, the translation of the ORF derived from the secondary allele (the 20000-series contig) was excluded from the nonredundant protein set. This set of proteins was used as the blast database in performing the searches of Candida ORFs against all other Candida ORFs. There are 9259 proteins in the nonredundant set."
PLEASE NOTE: Archived copies of the PDF diagrams mentioned in the SGTC documentation are available from the CGD Downloads page. You may download the Assembly 19 Contig Diagram files or view the README file.
Return to CGD | Send a Message to the CGD Curators |