Assembly 20 of the C. albicans sequence was a collaborative effort of groups at the Biotechnology Research Institute of the National Research Council, Canada; the University of Minnesota, USA; and Chiba University, Japan.
WARNING: Assembly 20 Sequence Advisory
The collaborative group who generated Assembly 20 has discovered that the sequence traces that they had been using to fill some of the gaps and determine overlaps between Assembly 19 contigs were derived from strain WO-1, rather than from the reference strain, SC5314. The sequence of these regions are consequently expected to be inaccurate where WO-1 sequence has been used, and there also exists the chance that small contigs have been misassembled based on the WO-1 sequence data.
The Biotechnology Research Institute of the National Research Council of Canada released a list of the regions affected. By comparing the 1kb flanking parts for each suspect region against the Contig19 sequences, CGD was able to reduce the size of many of the suspect regions. In CGD, these regions are reflected in the downloadable GFF files.
A list of the reduced regions and their chromosomal locations may be downloaded. A list of the ORFs that
are affected by
the regions may also be downloaded.
A list of the original regions and their chromosomal locations may be downloaded. A list of the ORFs that
are affected by
these regions may also be downloaded.
The physical mapping data are now available from the University of Minnesota, at http://albicansmap.ahc.umn.edu/index.html. The optical mapping data have been made available by P.T. Magee, and are now archived at CGD. The mapping data, which were used to order and orient contigs, originate exclusively from the reference strain SC5314, and may be downloaded.
To ensure that you are working only with sequence from the reference strain SC5314, you may retrieve data from Assembly 19 or Assembly 21 instead of Assembly 20. Please feel free to contact us with any further questions.
Whereas Assembly 19 is a diploid assembly that includes both alleles of each gene for cases in which they show significant sequence differences, Assembly 20 is a haploid assembly: in the production of Assembly 20, updates to Assembly 19 have been made in only one allele of each pair, though in some cases, genes may have been assembled from data from the two different alleles. The chromosomes may be thought of as 'reftigs', where they are mosaics of haplotypes, rather than representative of a single haploid genome in the sequenced strain. The process used to generate this assembly is described on the project web site at URL: http://candida.bri.nrc.ca/candida/alignments/index.cfm. The files generated by these groups are posted at URL: http://candida.bri.nrc.ca/alignments/editedEMBL/final. All of the Assembly 20 data in CGD come from these EMBL-format files, copies of which which have also been archived at CGD.
Assembly 20 does not include the sequence of the mitochondrial DNA. Datasets that contain the mitochondrial genome use the sequence from Assembly 19.
The Assembly 20 files were processed at CGD to identify and classify changes that occurred between Assembly 19 and Assembly 20, and to identify other features in which users may be interested (e.g., introns), as described in detail below. Files containing all of these analyses (ORF lists, sequences, and/or alignments) are available for download from CGD. They may be accessed from the CGD Downloads web page, or downloaded by selecting the hyperlinked file names below.
Assembly 20 ORF Classification: The entire classification is summarized in the file, ClassificationTablePerGene.xls.
Sequence comparisons between Assembly 19 and Assembly 20 were performed. Each ORF from Assembly 20 has been classified according to how it changed between Assembly 19 and Assembly 20. The classifications for each gene appear on its CGD Locus page, next to the "Feature type" heading. In addition, ORFs in each category have explanatory Locus History notes on the CGD web site.
The source of all the Assembly 20 information is the set of EMBL-format chromosome files from the BRI, dated May 11, 2006. These were originally posted at http://candida.bri.nrc.ca/alignments/editedEMBL/final/:
Ca20FinalMay11.zip 05/11/2006 01:47:06 PM
A copy of these files has been archived at CGD (please see the CGD Downloads web page for archived, downloadable files).
The source of all Assembly 19 sequence information used for these analyses is the Candida Genome Database, July 2006.
Protein and nucleotide local sequence alignments were performed using bl2seq from the BLAST suite from NCBI. Global nucleotide sequence alignments were performed with the MUSCLE (multiple sequence comparison by log-expectation) software, available at the URL: http://www.drive5.com/muscle/.
1) New ORFs in Assembly 20
Criteria: The orf19 name is new in Assembly 20; it was not present in Assembly 19. This assignment was made computationally.
File contains: List of all ORFs that are new in Assembly 20
File name: NewInAssembly20.txt.
2) ORFs deleted from Assembly 20
Criteria: The orf19 name is present in Assembly 19 and it is not present in Assembly 20. This assignment was made computationally.
File contains: List of all ORFs that were removed during preparation of Assembly 20; they were present in Assembly 19
File name: DeletedFromAssembly20.txt
Note: A subset of the ORFs on this list were subsumed by, or "merged into" another ORF in Assembly 20. Some merged ORFs were combined with a neighboring ORF on the Contig from Assembly 19 (Contig-19). In other cases, an ORF was merged with an ORF that was not adjacent to it in Assembly 19; that is, the Contig-19s containing the two ORFs were not associated with each other in Assembly 19 but have been assembled next to, or overlapping with, each other in Assembly 20.
3) ORFs with no sequence change in Assembly 20
Criteria: The nucleotide sequence of the ORF in Assembly 19 and 20 is the same (sequence across the whole ORF, including any intronic sequence). This assignment was made computationally.
File contains: List of all ORFs with no changes to the nucleotide sequence between Assembly 19 and Assembly 20
File name: NoSeqChangeInAssembly20.txt
Note: These criteria do not exclude ORFs in which adjustments have been made to the position of an intron without any change in the underlying sequence.
4) ORFs with synonymous changes ONLY, between Assembly 19 and Assembly 20
Criteria: The nucleotide sequence of the coding sequence or CDS, excluding any intronic sequence, is not the same between the two assemblies, however, the translated amino acid sequence is the same. This assignment was made computationally.
File contains: List of all ORFs with only synonymous changes between Assembly 19 and Assembly 20 (the nucleotide sequence has changed, yet the predicted amino acid translation is unchanged), with nucleotide alignments between the sequence of the ORF in Assembly 19 and the sequence of the ORF in Assembly 20
File name: SynonymousOnlyChangeInAssembly20.txt
Note: ORFs classified in the categories "Simple Sequence Changes" and "Complex Sequence Changes" may have synonymous changes in addition to other, nonsynonymous sequence changes.
Note: Problem ORFs that have been extended by one or two basepairs in Assembly 20, in the absence of other sequence changes that affect the translated sequence (and therefore the alignment), will meet the criteria for inclusion in this category.
5) ORFs with simple sequence changes in Assembly 20
Criteria: The aligned region encompasses the entire length of the ORF in both Assembly 19 and Assembly 20, and amino acid identity is 98% or greater. This assignment was made computationally.
File contains: List of all ORFs with small changes in protein sequence between Assembly 19 and Assembly 20, with protein sequence alignments.
File name: SimpleSeqChangesInAssembly20.txt
Note: This category includes ORFs that may contain substitutions, small insertions, and/or small deletions, yet overall identity between the two predicted protein sequences is 98% or greater. Cases in which only intronic sequence has changed, and the translated sequence has not been affected, are also included in this category.
6) ORFs with complex sequence changes in Assembly 20
Criteria: ORF has changed in nucleotide sequence, and changes do not fall into the "synonymous changes only" or "simple amino acid changes" categories. This assignment was made computationally.
File contains: List of all ORFs that have changed significantly in sequence between Assembly 19 and Assembly 20, with protein sequence alignments.
File name: complexSeqChangesInAssembly20.txt
Note: This category includes ORFs that may contain substitutions, insertions, deletions, and/or changes to the 5' and/or 3' boundary (annotation changes, in which the ORF boundary is moved without an underlying sequence change, or sequence changes). The protein alignment may show 100% identity if complex changes have taken place outside of the aligned region (e.g., if the N- or C-terminal region has been changed).
7) All Assembly 20 ORFs, classified by the type of change, if any, that affected the ORF between Assembly 19 and Assembly 20 (Excel-format spreadsheet)
File contains: Excel workbook with two worksheets. The first worksheet contains a list of all of the Assembly 20 ORFs and their classification into the six categories, by the criteria outlined above for files 1-6. A "1" in columns B through F indicates that the ORF is classified in the category.
The columns in the first worksheet are as follows:
A) ORF name
B) Complex Sequence changes in Assembly 20
C) New in Assembly 20
D) No change in Assembly 20
E) Simple sequence changes including substitutions and indels in Assembly 20
F) Synonymous changes Only in Assembly 20
G) Chromosome
H) Start
I) Stop
J) Strand
K) Exon segments
L) Contig19 coordinates
The second worksheet contains a list of all of the Assembly 19 ORFs that are not present in Assembly 20, and the Contig19 name and coordinates. The columns in the second worksheet are as follows:
A) Assembly 19 ORF name
B) Contig19 name and contig coordinates from Assembly 19
File name: ClassificationTablePerGene.xls
8) All Assembly 20 ORFs, classified by the type of change, if any, that affected the ORF between Assembly 19 and Assembly 20 (Tab-delimited text format)
File contains: List of all of the Assembly 20 ORFs and their classification into the six categories outlined above.
The columns are as follows:
A) ORF name
B) Classification (into the categories described for files 1-6, above)
C) Chromosome (ORF name appears in this column if ORF is classified as "deleted from Assembly 20")
D) Start coordinate on chromosome (Contig coordinates appear in this column if ORF is classified as "deleted from Assembly 20")
E) Stop coordinate on chromosome
F) Strand
G) Exon Segments
H) Contig19 coordinates
File Name: ClassificationPerGene.txt
9) Merged ORFs
Criteria: Merged ORFs were evaluated as follows: The Assembly 19 nucleotide sequence, with any introns, of each of the ORFs that were deleted from Assembly 20 were compared by BLAST against the set of all Assembly 20 ORFs (nucleotide sequence, with introns). A strong match indicates that the deleted ORF may have been subsumed by the Assembly 20 ORF. Such candidates were evaluated *manually*. If the orf19 names of the possible merged pair were numerically close to each other (e.g., orf19.1556 and orf19.1555), the candidate pairs were evaluated in the GBrowse genome browser. If the ORFs overlapped on the same strand, the pair was scored as "merged." If the ORFs did not overlap, or were on opposite strands, the pair was scored as "not merged." The possible merged pairs with the orf19 names that were not close to each other were evaluated in the GBrowse genome browser displaying the position of the Assembly 19 contigs overlaid on the Assembly 20 chromosomes. The ORFs were scored as "merged" if they were located on the overlapping segments of the adjacent contigs or if they spanned a junction between the adjacent contigs.
File contains: The Feature name (orf19 name) of the ORF that remains after the merge, the Locus name (e.g., ABC1) of the ORF that remains after the merge, the Feature name (orf19 name) of the ORF that is deleted (subsumed) during the merge, the Locus name (e.g., ABC1) of the deleted/subsumed ORF.
File name: MergedORFs.txt
10) ORFs truncated by contig ends in Assembly 19, along with the new coordinates in Assembly 20
Criteria: In Assembly 19, one terminus of the ORF was a contig end.
File contains: ORF name, chromosomal coordinates in Assembly 20, contig coordinates in Assembly 19, length of protein in Assembly 20, length of protein in Assembly 19. Tab-delimited file.
File name: OrfsAtEndOfContigInAssembly19.txt
Note: This does not include any ORF whose terminus was near, but not at, the end of a contig in Assembly 19 and which was extended in Assembly 20. However, these ORFs are classified as having "complex sequence changes" as described above.
11) ORFs containing gaps/introns/adjustments in Assembly 19
Criteria: ORFs from Assembly 19 are included in this category if the coding sequence (CDS) comprises more than one segment.
*Please note that significant changes to intron and gap-containing ORFs were made in May 2007. This file should be considered a historical record; please see the Intron data and non-intron adjustments to ORF coordinates section of this page for more information.*
File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron; global nucleotide alignment of the entire sequence (including the introns) to the CDS (with introns removed)
File name: OrfsWithIntrons_Assembly19.txt
Note: ***This category includes gaps that are NOT bona fide introns.***
The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. Note that the length of some intron/gaps are negative numbers (i.e., a region of the exon is counted twice).
All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation.
If there are multiple gaps/introns, the sizes of the gaps/introns are separated by commas.
12) ORFs containing gaps/introns/adjustments in Assembly 19 (without alignments)
Criteria: ORFs from Assembly 19 are included in this category if the coding sequence (CDS) comprises more than one segment. This file is identical to the file OrfsWithIntrons_Assembly19.txt, except that it does NOT contain the alignments and is therefore more amenable to viewing as a spreadsheet.
*Please note that significant changes to intron and gap-containing ORFs were made in May 2007. This file should be considered a historical record; please see the Intron data and non-intron adjustments to ORF coordinates section of this page for more information.*
File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron. The file is in tab-delimited text format.
File name: OrfsWithIntrons_Assembly19_List.txt
13) ORFs containing gaps/introns/adjustments in Assembly 20
Criteria: ORFs from Assembly 20 are included in this category if the coding sequence (CDS) comprises more than one segment.
*Please note that significant changes to intron and gap-containing ORFs were made in May 2007. This file should be considered a historical record; please see the Intron data and non-intron adjustments to ORF coordinates section of this page for more information.*
File contains: ORF name; chromosome and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron; global nucleotide alignment of the entire sequence (including the introns) to the CDS (with introns removed). The ortholog assignments have been updated to reflect the Assembly 20-based mapping generated on November 26, 2006.
File name: OrfsWithIntrons_Assembly20.txt
Note: ***This category includes gaps that are NOT bona fide introns.***
The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. Some of the gaps introduced by the Annotation Working Group have a length that is a negative number; that is, the coding sequence comprises two overlapping segments, such that some sequence is counted twice. These are called "Adjustments," rather than "Introns" on the Locus page of the affected ORFs. Like the introns/gaps that are small in size, these "adjustments" should also be considered flags that indicate that resequencing of the area is advised.
Please also note: Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. In several such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). These ORFs are the following: orf19.1261, orf19.130, orf19.1639, orf19.1693, orf19.2440, orf19.3245, orf19.4136, and orf19.5880. After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD adjusted the position of these gaps to restore their position as defined in Assembly 19. The other sequence will remain as-is in CGD until further information is available.
All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation. We provide the size of the intron/gap/adjustment in Assembly 20 and information about the S. cerevisiae ortholog in this file to facilitate initial assessment.
If there are multiple gaps/introns, the sizes of the gaps/introns are separated by commas.
14) ORFs containing gaps/introns/adjustments in Assembly 20 (without alignments)
Criteria: ORFs from Assembly 20 are included in this category if the coding sequence (CDS) comprises more than one segment. This file is identical to the file OrfsWithIntrons_Assembly20.txt, except that it does NOT contain the alignments and is therefore more amenable to viewing as a spreadsheet.
*Please note that significant changes to intron and gap-containing ORFs were made in May 2007. This file should be considered a historical record; please see the Intron data and non-intron adjustments to ORF coordinates section of this page for more information.*
File contains: ORF name; contig and coordinates; size of the intron/gap (nucleotides); orthologous gene from S. cerevisiae, if any; whether or not orthologous gene from S. cerevisiae contains an intron. The ortholog assignments have been updated to reflect the Assembly 20-based mapping generated on November 26, 2006. The file is in tab-delimited text format.
File name: OrfsWithIntrons_Assembly20_List.txt
15) ORFs with changes to intron/gap/adjustment regions between Assembly 19 and Assembly 20
Criteria: Assembly 20 ORFs are included if the number or nucleotide sequence of introns/gaps/adjustments differs between Assembly 19 and Assembly 20.
File contains: ORF name; coordinates of exons in Assemblies 20 and 19; alignment of the Assembly 19 genomic nucleotide sequence (coding sequence plus intron(s)) vs. the Assembly 20 version; alignment of the Assembly 19 ORF protein sequence vs. the Assembly 20 version.
File name: intronChangesInAssembly20.txt
Note: Small changes in coordinates may not result in changes at either the nucleotide or amino acid sequence levels.
Note: ***Not all gaps are bona fide introns.***
The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation.
Please also note: Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. In eight such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD adjusted the position of these gaps to restore their position as defined in Assembly 19.
16) ORFs with changes to intron/gap/adjustment regions between Assembly 19 and Assembly 20 (without alignments)
Criteria: Assembly 20 ORFs are included if the number or sequence of introns/gaps/adjustments differs between Assembly 19 and Assembly 20. This file is identical to the file intronChangesInAssembly20.txt, except that it does NOT contain the alignments.
File contains: ORF names.
File name: intronChangesInAssembly20_OrfList.txt
17) Problem ORFs that have internal stop codons (with translation)
Criteria: This set of ORFs has a stop codon within the reading frame, as presented in the Assembly 20 files from the Annotation Working Group.
File contains: List of ORFs in this category, with nucleotide sequence (full, including any intronic sequence), coding sequence (CDS, with introns removed), and amino acid translation
File name: OrfsWithInternalStopCodonsInAssembly20.txt
Note: Most of the stop codons are near the end of the ORF described in the Assembly 20 file. Some are followed by a few residues of predicted protein sequence, some are followed by additional stop codons. After loading the data from the original Assembly 20 file and archiving this starting data, CGD has adjusted the boundary of these ORFs in the database and in the sequence files. The four exceptions are orf19.4384.1, orf19.3813, orf19.359 and orf19.5775.3 (described in more detail in the file problemORFInEMBLfiles.txt); these ORFs will remain as-is in CGD until additional data are available.
18) Problem ORFs that are lacking terminal stop codons
Criteria: This set of ORFs lacks the terminal stop codons, as presented in the Assembly 20 files from the Annotation Working Group/Assembly 20 collaboration.
File contains: List of ORFs in this category, with nucleotide sequence (full, including any intronic sequence), coding sequence (CDS, with introns removed), and amino acid translation
File name: OrfsWithoutEndStopCodonInAssembly20.txt
Note: In most of these cases, adjusting the end coordinates to extend the ORF by a few nucleotides, relative to its coordinates in the initial Assembly 20 release, would append an in-frame stop codon. After loading the data from the original Assembly 20 file and archiving these starting data, CGD has adjusted the boundary of these ORFs. The new coordinates now appear in the CGD sequence files. There are two ORFs that end with undetermined sequence ("NNN"), orf19.2657 and orf19.7398.1, and the termini of these two ORFs will not be modified by CGD in the absence of additional sequence data. In addition, the orf19.3073 runs of the end of Assembly 20 Chromosome 4 and it therefore lacks a terminal stop. Also included in this file are ORFs that extend downstream of an in-frame stop codon by a few residues. (These ORFs are also included in the category, "Problem ORFs that have internal stop codons," and are listed in the file OrfsWithInternalStopCodonsInAssembly20.txt, as described above.) The coordinates of ORFs with in-frame stops within a few codons of the terminus have also been adjusted; they have been truncated so that they end at the stop codon. These adjustments were performed after loading the data from the original Assembly 20 EMBL-format files and archiving this starting data at CGD. The adjustments are now present in the CGD sequence files.
19) ORFs with partial codons
Criteria: Length of the coding sequence (CDS, with any intronic sequence removed), in nucleotides, is not a multiple of three
File contains: ORF name, nucleotide sequence of the ORF (any intronic sequence included), translated sequence
File name: OrfsWithPartialTerminalCodonInAssembly20.txt
Note: Coordinates of ORFs have been adjusted so that the ORF ends at the stop codon; the extra nucleotides (partial codon) have been removed from the CGD sequence files. These adjustments were performed after loading the data from the original Assembly 20 EMBL-format files and archiving this starting data at CGD.
This query was run after other coordinate adjustments were made; some of the ORFs with partial codons in Assembly 20 were detected by other queries and corrected before this list was generated (e.g., ORFs without terminal stop codons).
20) ORFs with non-AUG start
Criteria: ORF nucleotide sequence does not begin with ATG
File contains: List of ORFs, with nucleotide sequence (including any intronic sequence). There are eight of these in Assembly 20.
File name: OrfsWithNonAUGstartInAssembly20.txt
21) Missing Contig19s, and the Assembly 19 ORFs that they contain
Criteria: Contig19s are included if they are not listed in the EMBL-format Assembly 20 files
File contains: Contig19 name, name of ORF contained on the missing contig, Locus name (if any) of the ORF, Feature Type of ORF, notes
File name: Missing_contigs.xls
Note: The EMBL-format Assembly 20 files released by the Annotation Working Group/Assembly 20 collaboration specify mapping of some of the Assembly 19 contigs to the Assembly 20 chromosomes; however, not all of the Contig19s are included in the EMBL-format files. The file "Missing_contigs.xls" contains information about the Contig19s that are missing from the EMBL-format Assembly 20 files.
Each ORF is contained on a single line; missing Contig19s that comprise multiple ORFs are listed on multiple lines. The Feature Type of each ORF indicates whether it is present in Assembly 20 and, if so, whether the sequence has changed between Assembly 19 and 20. The notes were entered based on manual investigation by BLAST. Excel format file.
22) Subdivided Contig19's
Criteria: Contig 19's that are listed in the EMBL-format file, and which are split into pieces in Assembly 20
File contains: ID of Contig19 fragment; name of Contig19, Assembly 20 chromosome where contig fragment matches, chromosomal coordinates of match
File name: SplitContig19ToChromosomes.txt
Note: The subdivided Contig19 fragments are designated numerically, for example, "Contig19-10070_1," "Contig19-10070_2," "Contig19-10070_3."
23) List of other Contig mapping problems
File contains: Notes on some problems with the Contig19 mapping onto Assembly 20 chromosomes from the EMBL-format files.
File name: problemContigMappingToChr.txt
24) Notes on problematic entries in the Assembly 20 files
File contains: List of problematic ORFs from the Assembly 20 EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration. Notes on the way in which these issues will be handled in CGD.
File name: problemORFInEMBLfiles.txt
Note: This file describes the following types of problems in the EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration: two different orf19 names that have been used for the same region in the EMBL-format Assembly 20 files (2 cases), orf19 names that have been used for two different regions in the EMBL-format Assembly 20 files (4 cases), ORF without a name in the EMBL-format Assembly 20 files (1 case), ORFs with internal stop codons that are not amenable to correction by a simple adjustment in the terminal coordinate (4 cases), ORFs that are extremely changed in sequence between Assembly 19 and Assembly 20 (4 cases), ORFs that contain a stop codon in Assembly 20 in the absence of any underlying sequence changes (coordinates of an intronic or gap sequence has changed position ("slipped") between the two assemblies, creating an in-frame stop codon).
The archived Assembly 20 EMBL-format files are *unmodified* copies of the files released by the Annotation Working Group/Assembly 20 collaboration. Please note that there are some issues with the data in these files, as described in detail below. Thus, these archival copies are not recommended for use as-is.
Subsequent to the May 2006 release of the EMBL-format Assembly 20 files by the Annotation Working Group/Assembly 20 collaboration, updated EMBL-format files have been released to their web site, http://candida.bri.nrc.ca/alignments/editedEMBL/final/. Please note that issues with the data in these files remain unresolved (issues described in detail on the Sequence Help Page). Thus, the EMBL-format files at http://candida.bri.nrc.ca/alignments/editedEMBL/final/ are not recommended for use as-is.
Not all gaps are bona fide introns. All intron predictions should be considered to be preliminary, and these predictions should be subject to further evaluation. The Annotation Working Group added small gaps to make adjustments to the reading frame, or to eliminate stop codons in cases in which the annotator judged that the sequence was likely to be in error. Places where small gaps have been introduced into an ORF should be considered flags that indicate that resequencing of the area is advised.
Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. All gap/intron changes between Assembly 19 and 20 are listed in the file "intronChangesInAssembly20.txt." In several such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). These ORFs are: orf19.1261, orf19.130, orf19.1639, orf19.1693, orf19.2440, orf19.3245, orf19.4136, and orf19.5880. After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD adjusted the position of these gaps to restore their position as defined in Assembly 19. Current CGD files therefore contain the corrected sequence and coordinate data. The other sequence will remain as-is in CGD until further information is available.
Some of the gaps introduced by the Annotation Working Group have a length that is a negative number; that is, the coding sequence comprises two overlapping segments, such that some sequence is counted twice. These are called "Adjustments," rather than "Introns" on the Locus page of the affected ORFs, and they are listed in the file OrfsWithIntrons_Assembly20.txt (with alignments) and OrfsWithIntrons_Assembly20_List.txt (without alignments). Like the introns/gaps that are small in size, these "adjustments" should also be considered flags that indicate that resequencing of the area is advised.
The original Assembly 20 EMBL-format files include ORFs that have internal stop codons. They are listed in the file "OrfsWithInternalStopCodonsInAssembly20.txt." Most of the stop codons are near the end of the ORF described in the Assembly 20 file. (The exceptions are the four ORFs that have internal stop codons which are not amenable to correction by a simple adjustment of the terminal coordinate, orf19.3813 orf19.4384.1 orf19.359 and orf19.5775.3, as described in more detail in the file problemORFInEMBLfiles.txt; these ORFs will remain as-is in CGD until further information is available.) In addition, there are 15 cases in which ORFs have multiple stop codons in the EMBL-format Assembly 20 files (orf19.2309 orf19.1658 orf19.6947 orf19.5870 orf19.2758 orf19.5046 orf19.3140.1 orf19.942 orf19.4305.1 orf19.5592 orf19.7076 orf19.7056 orf19.2423 orf19.6382 orf19.854). After loading the data from the original Assembly 20 file and archiving this starting data, CGD adjusted the boundaries of the ORFs with multiple terminal stops and the ORFs with near-terminal stops, and the updated coordinates now appear in the database and in the sequence files released by CGD.
The original Assembly 20 EMBL-format files include ORFs that are lacking stop codons. They are listed in the file "OrfsWithoutEndStopCodonInAssembly20.txt." In most of these cases, adjusting the end coordinates to extend the ORF by a few nucleotides, relative to its position in the original Assembly 20 files, would append an in-frame stop codon. After loading the data from the original Assembly 20 file and archiving these starting data, CGD adjusted the boundary of these ORFs in CGD, and in the sequence files released by CGD, so that each ORF terminates at the next in-frame, downstream stop codon. There are two ORFs that end with undetermined sequence ("NNN"), orf19.2657 and orf19.7398.1. In addition, the runs of the end of Assembly 20 Chromosome 4 and it therefore lacks a terminal stop. The termini of these three ORFs will not be modified by CGD in the absence of additional sequence data.
The original Assembly 20 EMBL-format files lacks entries for a subset of the contigs from Assembly 19. These are described in the file, /orfsFromMissingContigs_list.txt.
Additional notes on some other issues with the Contig19 mapping onto Assembly 20 chromosomes are contained in the file, problemContigMappingToChr.txt.
The original Assembly 20 EMBL-format files use the same orf19 name for two different ORFs, in four cases. There is also an ORF that has no name in the original Assembly 20 EMBL-format files. The file, problemORFInEMBLfiles.txt contains notes on CGD's investigation of these issues and a detailed description of how CGD has addressed these problems.
It appears that some Assembly 19 ORFs were erroneously deleted during generation of Assembly 20. Two cases were encountered during investigation of the problems described in the file, problemORFInEMBLfiles.txt. These two ORFs,
orf19.71 and orf19.544.1, have been reinstated in CGD. In other cases, an ORF was deleted and then replaced with a new, nearly identical ORF (e.g., orf19.2217, which was replaced with orf19.1860.1). In this case, and others like it, the two ORFs have been scored as "merged." The orf19.2217 will retain its deleted status in CGD; the name orf19.2217 will be added as an alias of the ORF with which it has been merged, orf19.1860.1; and any curated information associated with the deleted ORF has been copied to the remaining member of the ORF pair.
CGD would appreciate it if users report other problems that they may encounter, so that issues can be documented and resolved wherever possible.
Return to CGD | Send a Message to the CGD Curators |