Assembly 20 of the C. albicans sequence was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan. ** PLEASE NOTE: Assembly 20 Sequence Advisory ** ** posted October 19, 2006, updated October 25, 2006 ** ** The collaborative group who generated Assembly 20 has discovered that ** the sequence traces that they had been using to fill some of the gaps ** and determine overlaps between Assembly 19 contigs were derived from ** strain WO-1, rather than from the reference strain, SC5314. ** ** Please see http://www.candidagenome.org/help/Assembly20_Advisory.shtml ** for the latest information and status updates. Whereas Assembly 19 is a diploid assembly that includes both alleles of each gene for cases in which they show significant sequence differences, Assembly 20 is a haploid assembly: in the production of Assembly 20, updates to Assembly 19 have been made in only one allele of each pair, though in some cases, genes may have been assembled from data from the two different alleles.Ê The chromosomes may be thought of as 'reftigs', where they are mosaics of haplotypes, rather than representative of a single haploid genome in the sequenced strain. The process used to generate this assembly is described on the project web site at URL: http://candida.bri.nrc.ca/candida/alignments/index.cfm. The files generated by these groups are posted at URL: http://candida.bri.nrc.ca/alignments/editedEMBL/final. All of the Assembly 20 data in CGD come from these EMBL-format files. The Assembly 20 files were processed at CGD to identify and classify changes that occurred between Assembly 19 and Assembly 20, and to identify other features in which users may be interested (e.g., introns), as described in detail below. Files containing all of these analyses (ORF lists, sequences, and/or alignments) are available for download from CGD (http://www.candidagenome.org/DownloadContents.shtml). Assembly 20 ORF Classification: The entire classification is summarized in the file "ClassificationTablePerGene.xls." Sequence comparisons between Assembly 19 and Assembly 20 were performed. Each ORF from Assembly 20 has been classified according to how it changed between Assembly 19 and Assembly 20. The classifications for each gene appear on its CGD Locus page, next to the "Feature type" heading. In addition, ORFs in each category have explanatory Locus History notes on the CGD web site. The source of all Assembly 20 information are the EMBL-format files posted at http://candida.bri.nrc.ca/alignments/editedEMBL/final/ Ca20Chr1.03s02.embl 07/21/2006 11:30:42 AM Ca20Chr2.03s03.embl 05/10/2006 02:09:50 PM Ca20Chr3.03s01.embl 05/11/2006 09:58:18 AM Ca20Chr4.02s01.embl 05/11/2006 10:56:56 AM Ca20Chr5.02s01.embl 05/11/2006 11:10:14 AM Ca20Chr6.04s04.embl 05/11/2006 01:34:54 PM Ca20Chr7.02s02.embl 05/11/2006 01:44:44 PM Ca20ChrR.04s03.embl 05/09/2006 02:16:36 PM Ca20FinalMay11.zip 05/11/2006 01:47:06 PM The source of all Assembly 19 sequence information is the Candida Genome Database, July 2006. Protein and nucleotide local sequence alignments were performed using bl2seq from the BLAST suite from NCBI. Global nucleotide sequence alignments were performed with the MUSCLE (multiple sequence comparison by log-expectation) software, available at the URL: http://www.drive5.com/muscle/ The following files are available for download: 1) New ORFs in Assembly 20 Criteria: The orf19 name is new in Assembly 20. This assignment was made computationally. File contains: list of all ORFs that are new in Assembly 20, not present in Assembly 19 File name: NewInAssembly20.txt 2) ORFs deleted from Assembly 20 Criteria: The orf19 name is present in Assembly 19 and it is not present in Assembly 20. This assignment was made computationally. File contains: list of all ORFs that were removed during preparation of Assembly 20; they were present in Assembly 19 File name: DeletedFromAssembly20.txt Note: A subset of the ORFs on this list were subsumed by, or "merged into" another ORF in Assembly 20. Some merged ORFs were combined with a neighboring ORF on the Contig from Assembly 19 (Contig-19). In other cases, an ORF was merged with an ORF that was not adjacent to it in Assembly 19; that is, the Contig-19s containing the two ORFs were not associated with each other in Assembly 19 but have been assembled next to, or overlapping with, each other in Assembly 20. 3) ORFs with no sequence change in Assembly 20 Criteria: The nucleotide sequence of the ORF in Assembly 19 and 20 is the same (sequence across the whole ORF, including any intronic sequence). This assignment was made computationally. File contains: list of all ORFs with no changes to the nucleotide sequence between Assembly 19 and Assembly 20 File name: NoSeqChangeInAssembly20.txt Note: These criteria do not exclude ORFs in which adjustments have been made to the position of an intron without any change in the underlying sequence. 4) ORFs with synonymous changes ONLY, between Assembly 19 and Assembly 20 Criteria: The nucleotide sequence of the coding sequence or CDS, excluding any intronic sequence, is not the same between the two assemblies, however, the translated amino acid sequence is the same. This assignment was made computationally. File contains: list of all ORFs with only synonymous changes between Assembly 19 and Assembly 20 (the nucleotide sequence has changed, yet the predicted amino acid translation is unchanged), with nucleotide alignments between the sequence of the ORF in Assembly 19 and the sequence of the ORF in Assembly 20 File name: SynonymousOnlyChangeInAssembly20.txt Note: ORFs classified in the categories "Simple Sequence Changes" and "Complex Sequence Changes" may have synonymous changes in addition to other, nonsynonymous sequence changes. Note: Problem ORFs that have been extended by one or two basepairs in Assembly 20, in the absence of other sequence changes that affect the translated sequence, will meet the criteria for inclusion in this category. 5) ORFs with simple sequence changes in Assembly 20 Criteria: The aligned region encompasses the entire length of the ORF in both Assembly 19 and Assembly 20, and amino acid identity is 98% or greater. This assignment was made computationally. File contains: list of all ORFs with small changes in protein sequence between Assembly 19 and Assembly 20, with protein sequence alignments. File name: SimpleSeqChangesInAssembly20.txt Note: This category includes ORFs that may contain substitutions, small insertions, and/or small deletions, yet overall identity between the two predicted protein sequences is 98% or greater. Cases in which only intronic sequence has changed, and the translated sequence has not been affected, are also included in this category. 6) ORFs with complex sequence changes in Assembly 20 Criteria: ORF has changed in nucleotide sequence, and changes do not fall into the "synonymous changes only" or "simple amino acid changes" categories. This assignment was made computationally. File contains: list of all ORFs that have changed significantly in sequence between Assembly 19 and Assembly 20, with protein sequence alignments. File name: complexSeqChangesInAssembly20.txt Note: This category includes ORFs that may contain substitutions, insertions, deletions, and/or changes to the 5' and/or 3' boundary (annotation changes, in which the ORF boundary is moved without an underlying sequence change, or sequence changes). The protein alignment may show 100% identity if complex changes have taken place outside of the aligned region (e.g., if the N- or C-terminal region has been changed). 7) Excel-format spreadsheet of all Assembly 20 ORFs and a classification of the type of change, if any, that affected the ORF between Assembly 19 and Assembly 20 File contains: Excel workbook with two worksheets. The first worksheet contains a list of all of the Assembly 20 ORFs and their classification into the six categories outlined above. The columns in the first worksheet are as follows: A) Assembly 20 ORF name B) Complex Sequence changes in Assembly 20 C) New in Assembly 20 D) No change in Assembly 20 E) Simple sequence including substitutions and indels in Assembly 20 F) Synonymous changes Only in Assembly 20 G) Chromosome H) Start I) Stop J) Strand K) Exon segments L) Contig19 coordinates A "1" in columns B through F indicates that the ORF is classified in the category. The second worksheet contains a list of all of the Assembly 19 ORFs that are not present in Assembly 20, and the Contig19 name and coordinates. File name: ClassificationTablePerGene.xls 8) Tab-delimited text file of all Assembly 20 ORFs and a classification of the type of change, if any, that affected the ORF between Assembly 19 and Assembly 20 File contains: A list of all of the Assembly 20 ORFs and their classification into the six categories outlined above. The columns are as follows: A) Assembly 20 ORF name B) Classification (into the categories described for files 1-6, above C) Chromosome (ORF name appears in this column if ORF is classified as "deleted from Assembly 20") D) Start coordinate on chromosome (Contig coordinates appear in this column if ORF is classified as "deleted from Assembly 20") E) Stop coordinate on chromosome F) Strand G) Exon Segments H) Contig19 coordinates File Name: ClassificationPerGene.txt 9) Merged ORFs Criteria: Merged ORFs were evaluated as follows: The Assembly 19 nucleotide sequence, with any introns, of each of the ORFs that were deleted from Assembly 20 were compared by BLAST against the set of all Assembly 20 ORFs (nucleotide sequence, with introns). A strong match indicates that the deleted ORF may have been subsumed by the Assembly 20 ORF. Such candidates were evaluated *manually*. If the orf19 names of the possible merged pair were numerically close to each other (e.g., orf19.1556 and orf19.1555), the candidate pairs were evaluated in the GBrowse genome browser. If the ORFs overlapped on the same strand, the pair was scored as "merged." If the ORFs did not overlap, or were on opposite strands, the pair was scored as "not merged." The possible merged pairs with the orf19 names that were not close to each other were evaluated in the GBrowse genome browser displaying the position of the Assembly 19 contigs overlaid on the Assembly 20 chromosomes. The ORFs were scored as "merged" if they were located on the overlapping segments of the adjacent contigs or if they spanned a junction between the adjacent contigs. File contains: The Feature name (orf19 name) of the ORF that remains after the merge, the Locus name (e.g., ABC1) of the ORF that remains after the merge, the Feature name (orf19 name) of the ORF that is deleted (subsumed) during the merge, the Locus name (e.g., ABC1) of the deleted/subsumed ORF. File name: MergedORFs.txt 10) ORFs truncated by contig ends in Assembly 19, along with the new coordinates in Assembly 20 Criteria: In Assembly 19, one terminus of the ORF was a contig end. File contains: ORF name, chromosomal coordinates in Assembly 20, contig coordinates in Assembly 19, length of protein in Assembly 20, length of protein in Assembly 19. Tab-delimited file. File name: OrfsAtEndOfContigInAssembly19.txt Note: This does not include any ORF whose terminus was near, but not at, the end of a contig in Assembly 19 and which was extended in Assembly 20. However, these ORFs are classified as having "complex sequence changes" as described above. 11) ORFs containing gaps/introns/adjustments in Assembly 19 Criteria: ORFs from Assembly 19 are included in this category if the coding sequence (CDS) comprises more than one segment. File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron; global nucleotide alignment of the entire sequence (including the introns) to the CDS (with introns removed) File name: OrfsWithIntrons_Assembly19.txt Note: ***This category includes gaps that are NOT bona fide introns.*** The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. Note that the length of some intron/gaps are negative numbers (i.e., a region of the exon is counted twice). All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation. If there are multiple gaps/introns, the sizes of the gaps/introns are separated by commas. 12) ORFs containing gaps/introns/adjustments in Assembly 19 (without alignments) Criteria: ORFs from Assembly 19 are included in this category if the coding sequence (CDS) comprises more than one segment. This file is identical to the file OrfsWithIntrons_Assembly19.txt, except that it does NOT contain the alignments and is therefore more amenable to viewing as a spreadsheet. File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron. The file is in tab-delimited text format. File name: OrfsWithIntrons_Assembly19_List.txt 13) ORFs containing gaps/introns in Assembly 20 Criteria: ORFs from Assembly 20 are included in this category if the coding sequence (CDS) comprises more than one segment. File contains: ORF name; chromosome and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron; global nucleotide alignment of the entire sequence (including the introns) to the CDS (with introns removed). The ortholog assignments have been updated to reflect the Assembly 20-based mapping generated on November 26, 2006. File name: OrfsWithIntrons_Assembly20.txt Note: ***This category includes gaps that are NOT bona fide introns.*** The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. Some of the gaps introduced by the Annotation Working Group have a length that is a negative number; that is, the coding sequence comprises two overlapping segments, such that some sequence is counted twice. These are called "Adjustments," rather than "Introns" on the Locus page of the affected ORFs. Like the introns/gaps that are small in size, these "adjustments" should also be considered flags that indicate that resequencing of the area is advised. Please also note: Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. In several such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). These ORFs are the following: orf19.1261, orf19.130, orf19.1639, orf19.1693, orf19.2440, orf19.3245, orf19.4136, orf19.5880. After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD adjusted the position of these gaps to restore their position as defined in Assembly 19. The other sequence will remain as-is in CGD until further information is available. All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation. We provide the size of the intron/gap/adjustment in Assembly 20 and information about the S. cerevisiae ortholog in this file to facilitate initial assessment. If there are multiple gaps/introns, the sizes of the gaps/introns are separated by commas. 14) ORFs containing gaps/introns/adjustments in Assembly 20 (without alignments) Criteria: ORFs from Assembly 20 are included in this category if the coding sequence (CDS) comprises more than one segment. This file is identical to the file OrfsWithIntrons_Assembly20.txt, except that it does NOT contain the alignments and is therefore more amenable to viewing as a spreadsheet. File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron. The ortholog assignments have been updated to reflect the Assembly 20-based mapping generated on November 26, 2006. The file is in tab-delimited text format. File name: OrfsWithIntrons_Assembly20_List.txt 15) ORFs with changes to intron/gap/adjustment regions between Assembly 19 and Assembly 20 Criteria: Assembly 20 ORFs are included if the number or nucleotide sequence of introns/gaps/adjustments differs between Assembly 19 and Assembly 20. File contains: ORF name; coordinates of exons in Assemblies 20 and 19; alignment of the Assembly 19 genomic nucleotide sequence (coding sequence plus intron(s)) vs. the Assembly 20 version; alignment of the Assembly 19 ORF protein sequence vs. the Assembly 20 version. File name: intronChangesInAssembly20.txt (show alignments: nucleotide and translations) Note: Small changes in coordinates may not result in changes at either the nucleotide or amino acid sequence levels. Note: ***Not all gaps are bona fide introns.*** The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation. Please also note: Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. In several such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD adjusted the position of these gaps to restore their position as defined in Assembly 19. 16) ORFs with changes to intron/gap/adjustment regions between Assembly 19 and Assembly 20 (without alignments) Criteria: Assembly 20 ORFs are included if the number or the nucleotide sequence of introns/gaps differs between Assembly 19 and Assembly 20. This file is identical to the file intronChangesInAssembly20.txt, except that it does NOT contain the alignments and is therefore more amenable to viewing as a spreadsheet. File contains: ORF names File name: intronChangesInAssembly20_OrfList.txt 17) Problem ORFs that have internal stop codons (with translation) Criteria: This set of ORFs has a stop codon within the reading frame, as presented in the Assembly 20 files from the Annotation Working Group. File contains: List of ORFs in this category, with nucleotide sequence (full, including any intronic sequence), coding sequence (CDS, with introns removed), and amino acid translation File name: OrfsWithInternalStopCodonsInAssembly20.txt Note: Most of the stop codons are near the end of the ORF described in the Assembly 20 file. Some are followed by a few residues of predicted protein sequence, some are followed by additional stop codons. After loading the data from the original Assembly 20 file and archiving this starting data, CGD plans to adjust the boundary of these ORFs in the database and in the subsequent sequence files. The four exceptions are orf19.4384.1, orf19.3813, orf19.359 and orf19.5775.3 (described in more detail in the file "problemORFInEMBLfiles.txt"); these ORFs will remain as-is in CGD until additional data are available. 18) Problem ORFs that are lacking terminal stop codons Criteria: This set of ORFs lacks the terminal stop codons, as presented in the Assembly 20 files from the Annotation Working Group. File contains: List of ORFs in this category, with nucleotide sequence (full, including any intronic sequence), coding sequence (CDS, with introns removed), and amino acid translation File name: OrfsWithoutEndStopCodonInAssembly20.txt Note: In most of these cases, adjusting the end coordinates to extend the ORF by a few nucleotides, relative to its coordinates in the initial Assembly 20 release, would append an in-frame stop codon. After loading the data from the original Assembly 20 file and archiving these starting data, CGD has adjusted the boundary of these ORFs. The new coordinates now appear in the CGD sequence files. There are two ORFs that end with undetermined sequence ("NNN"), orf19.2657 and orf19.7398.1, and the termini of these two ORFs will not be modified by CGD in the absence of additional sequence data. In addition, orf19.3073 runs of the end of Assembly 20 Chromosome 4 and it therefore lacks a terminal stop. Also included in this file are ORFs that extend downstream of an in-frame stop codon by a few residues. (These ORFs are also included in the category, "Problem ORFs that have internal stop codons," and are listed in the file OrfsWithInternalStopCodonsInAssembly20.txt, as described above.) The coordinates of ORFs with in-frame stops within a few codons of the terminus have also been adjusted; they have been truncated so that they end at the stop codon. These adjustments were performed after loading the data from the original Assembly 20 EMBL-format files and archiving this starting data at CGD. The adjustments are now present in the CGD sequence files. 19) ORFs with partial codons Criteria: Length of the coding sequence (CDS, with any intronic sequence removed), in nucleotides, is not a multiple of three File contains: ORF name, nucleotide sequence of the ORF (any intronic sequence included), translated sequence File name: OrfsWithPartialTerminalCodonInAssembly20.txt Note: Coordinates of ORFs have been adjusted so that the ORF ends at the stop codon; the extra nucleotides (partial codon) have been removed from the CGD sequence files. These adjustments were performed after loading the data from the original Assembly 20 EMBL-format files and archiving this starting data at CGD. Please also note that this query was run after other coordinate adjustments were made; some of the ORFs with partial codons in Assembly 20 were detected by other queries and corrected before this list was generated (e.g., ORFs without terminal stop codons). 20) ORFs with non-AUG start Criteria: ORF nucleotide sequence does not begin with ATG File contains: List of ORFs, with nucleotide sequence (including any intronic sequence) File name: OrfsWithNonAUGstartInAssembly20.txt Note: Eight cases in Assembly 20. 21) Missing Contig19s, and the Assembly 19 ORFs that they contain Criteria: Contig19s are included if they are not listed in the EMBL-format Assembly 20 files File contains: Contig19 name, name of ORF contained on the missing contig, Locus name (if any) of the ORF, Feature Type of ORF, notes File name: Missing_contigs.xls Note: The EMBL-format Assembly 20 files released by the Annotation Working Group/Assembly 20 collaboration specify mapping of some of the Assembly 19 contigs to the Assembly 20 chromosomes; however, not all of the Contig19s are included in the EMBL-format files. The file "Missing_contigs.xls" contains information about the Contig19s that are missing from the EMBL-format Assembly 20 files. Each ORF is contained on a single line; missing Contig19s that comprise multiple ORFs are listed on multiple lines. The Feature Type of each ORF indicates whether it is present in Assembly 20 and, if so, whether the sequence has changed between Assembly 19 and 20. The notes were entered based on manual investigation by BLAST. Excel format file. 22) Subdivided Contig19's Criteria: Contig 19's that are listed in the EMBL-format file, and which are split into pieces in Assembly 20 File contains: ID of Contig19 fragment; name of Contig19, Assembly 20 chromosome where contig fragment matches, chromosomal coordinates of match File name: SplitContig19ToChromosomes.txt Note: The subdivided Contig19 fragments are designated numerically, for example, "Contig19-10070_1," "Contig19-10070_2," "Contig19-10070_3." 23) List of other Contig mapping problems File contains: Notes on some problems with the Contig19 mapping onto Assembly 20 chromosomes from the EMBL-format files. File name: problemContigMappingToChr.txt 24) Notes on problematic entries in the Assembly 20 files File contains: List of problematic ORFs from the Assembly 20 EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration. Notes on the way in which these issues will be handled in CGD. File name: problemORFInEMBLfiles.txt Note: This file describes the following types of problems in the EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration: orf19 names that have been used for two different regions in the EMBL-format Assembly 20 files (4 cases), orf19 name that is used as the name of one ORF and also as an allele name of a different ORF (1 case), ORF without a name in the EMBL-format Assembly 20 files (1 case), ORFs with internal stop codons that are not amenable to correction by a simple adjustment in the terminal coordinate (4 cases), ORFs that are extremely changed in sequence between Assembly 19 and Assembly 20 (4 cases), and ORFs that contain a stop codon in Assembly 20 in the absence of any underlying sequence changes (coordinates of an intronic or gap sequence has changed position ("slipped") between the two assemblies, creating an in-frame stop codon).