The Candida Genome Database: Assembly 6 in CGD

This page provides documentation from the Stanford Genome Technology Center (SGTC). This documentation was previously available on the SGTC's Candida information server, and has been archived here (verbatim) for reference.

"Assembly 6 Release Notes

Assembly 6 is expected to be the final assembly of Candida albicans sequence data starting from the individual
reads. After preprocessing steps to remove most highly repeated sequences, a total of 313,165 reads were
assembled into 1213 contigs 2kb or greater. These data represent 10.4X mean coverage assuming a haploid genome
size of 15.5MB (excepting repeats such as the rDNA assemble as a single copy). The contigs add to 17.4MB,
exceededing the genome size because of heterozygous regions assembling separately. The mtDNA is not included
in the assembly as it must be translated with a different genetic code.

Translation of the assembly resulted in 9168 ORFs capable of encoding proteins 100aa or greater in length
including ambiguities. In general, the ORFs contained a start and stop codon. ORFs extending to the end of
a contig but lacking a stop were included as, except in rare cases where the contigs are at chromosome ends,
they will eventually reach one. Reading frames that remain open upstream to the beginning of a contig were
divided into two classes. Those that contained a 100aa or greater ORF with a start codon within them are
represented as the smaller ORF with the start because upstream sequences are more likely to encounter a
stop before a start. For completeness, those lacking an internal 100aa ORF with a start are included up to
the beginning of the contig.

Along with assembled DNA, a reference set of ORFs is being provided with assembly 6. The ORFs are numbered
sequentially with the lowest numbers deriving from the contigs with the lowest numbers. A typical fasta
header line reads as follows:

orf6.2.prot orf6-1097:610-281:e 330 bp, 109 aa, contig 2283 bp

This is interpreted as the protein sequence for ORF 2 from assembly 6. It derives from nucleotides 610-281
in contig 6-1097 (a start coordinate greater than the stop indicates it is read from the complementary strand).
The letter "e" indicates that an entire ORF is present (start and stop). The letter "i" is used to indicate an
incomplete ORF with the letters N and C to indicate the end of the ORF that is incomplete. The letter "n" is
used to indicate that while an entire ORF is given, the reading frame remains open to the beginning of the
contig. A count of codons that could be used to extend such an ORF is given in the header line. Examples are:

orf6.1.prot orf6-1072:1371-1:iC 1371 bp, 457 aa, contig 2182 bp
orf6.4.prot orf6-1097:2281-1868:iN 414 bp, 137 aa, contig 2283 bp
orf6.14.prot orf6-1174:18-545:en 528 bp, 175 aa, contig 2692 bp upcont=5

Fasta headers for the DNA sequences of ORFs are formed in the same way except that the ORF name is lacking
the ".prot" extension. ORF DNA sequences are always given as the sense strand.

The set of 9168 ORFs contains a large number of ORFs that are internal to or are overlapping with larger
ORFs. These smaller ORFs are currently included for completeness and will be removed from the reference
set at a later date. In a smaller number of cases, two ORFs are parts of the same gene. Causes include:
introns, gaps in the current sequence, and remaining frameshifts."

Note: The original SC5314 sequence trace files and quality scores generated by the Stanford Genome Technology Center are available for download from CGD.


Return to CGD	Send a Message to the CGD Curators