Index of /download/Assembly21notes

Icon  Name                              Last modified      Size  Description
[PARENTDIR] Parent Directory - [TXT] ORFsDeletedFromA21.txt 2009-02-02 10:35 48 [TXT] ORFsWadjustments.txt 2009-02-02 10:35 1.2M [TXT] ORFsWadjustments_alignment.txt 2009-02-02 10:35 2.5M [TXT] ORFsWambiguousBases.txt 2009-02-02 10:35 251K [TXT] ORFsWcomplexChanges.txt 2009-02-02 10:35 173K [TXT] ORFsWcomplexChanges_alignment.txt 2009-02-02 10:35 363K [TXT] ORFsWinternalStops.txt 2009-02-02 10:35 102K [TXT] ORFsWintrons.txt 2009-02-02 10:35 818K [TXT] ORFsWintrons_alignment.txt 2009-02-02 10:35 1.7M [TXT] ORFsWmanualUpdates.txt 2009-02-02 10:35 165K [TXT] ORFsWnoChanges.txt 2009-02-02 10:35 8.5M [TXT] ORFsWnoEndStop.txt 2009-02-02 10:35 137K [TXT] ORFsWnonATGstart.txt 2009-02-02 10:35 29K [TXT] ORFsWnonsynChanges.txt 2009-02-02 10:35 883K [TXT] ORFsWnonsynChanges_alignment.txt 2009-02-02 10:35 1.7M [TXT] ORFsWpartialCodon.txt 2009-02-02 10:35 271K [TXT] ORFsWsimpleChanges.txt 2009-02-02 10:35 965K [TXT] ORFsWsimpleChanges_alignment.txt 2009-02-02 10:35 1.9M [TXT] ORFsWsynChanges.txt 2009-02-02 10:35 141K [TXT] ORFsWsynChanges_alignment.txt 2009-02-02 10:35 275K
Assembly 21 (A21) is described in van het Hoog M, et al. (2007)
Assembly of the Candida albicans genome into sixteen supercontigs
aligned on the eight chromosomes. Genome Biol 8(4):R52.  URL:
http://genomebiology.com/2007/8/4/R52 In addition to making a
chromosomal level assembly, by mapping the contigs of Assembly 19
(A19) to chromosomes and filling many of the gaps between them, the
authors also made numerous and widespread modifications to the genomic
sequence based on alignments of the sequence traces generated by
inputting the SGTC's sequence traces into Sequencher software.  Many
of these modifications have introduced insertions, deletions, and
substitutions relative to the Assembly 19 sequence.  In many cases A,
C, T, or G has been substituted with an ambiguous nucleotide; within
ORFs, such ambiguous nucleotides consequently result in ambiguous
amino acids in the predicted ORF translation, which is represented as
an "X" within the A21 protein sequence.

The coordinates of each ORF in A21 were determined at CGD using a
sequence-based mapping procedure, described in detail under "A21 ORF
mapping procedure" below.  Where possible, the Assembly 20 (A20)
sequence of each ORF was used to determine the A21 coordinates; if the
A20 ORF sequence did not match any sequence in A21, then the A19
sequence of the ORF was compared to the A21 sequence (because the
sequence of some ORFs changed significantly between A19 and A20).
Where relevant, it is noted in the downloadable files whether the A19
or A20 version of the ORF was used for mapping onto A21 (called
"assembly used for mapping" in the file descriptions below).

Some ORFs from Assembly 20 map to corresponding regions in A21
chromosomes that, due to sequence changes made during generation of
A21, have frameshifts or new in-frame stop codons in A21.  In such
cases, CGD manually added "adjustments" to the ORF coordinates,
splitting the ORF into segments, with gaps or overlap between segments
so as to restore the A21 coding sequence (CDS) to match the A20 CDS.
In cases in which the A21 sequence lacks the stop codon that
terminates the ORF in A20, the A21 ORF has been extended downstream to
the next in-frame stop codon.  Issues encountered in mapping
individual ORFs to A21 are described on the Locus History page for the
ORF in CGD.
 
ORFs in Assembly 19 that were deleted in A20 were not mapped to A21.

The original Assembly 21 released by van het Hoog et al. does not
include the sequence of the mitochondrial DNA. The mitochondrial
sequence from Assembly 19 has been included in A21 prior to its
release at CGD.

CGD has conducted a series of analyses of the A21 ORF set, the results
of which are available for download and described in detail here:

 
A21 ORF analyses

Note: Throughout these files, coding sequence is denoted by upper case
letters, and noncodong (introns or adjustment/gap) sequences are
denoted by lower case letters.  An ORF's "genomic sequence" is the
chromosomal sequence of the region, including any intronic sequence or
gap sequence within the boundaries of the ORF, from the beginning of
the start codon through the last nucleotide of the stop codon.  An
ORF's "coding sequence" or "CDS" is the sequence of the actual open
reading frame, after introns have been spliced out and any non-intron
adjustments have been made (gaps removed, or overlapping segments
aligned into place).

ORFsDeletedFromA21.txt - This file contains list of ORF names and,
where available, locus names for ORFs that could not be mapped well
(from A20 nor A19) or which were deemed not fixable (e.g., multiple
stop codons within all three A21 reading frames).  Hence, these are
the ORFs that are "Deleted from Assembly 21" and labeled as such on
their Locus pages.  ORFs that have been deleted have been eliminated
from the other files listed below (e.g., an ORFs with stop codons
internal to the reading frame that has been deleted from A21 is
present on this list and not included in the ORFsWinternalStops.txt
file).

ORFsWintrons.txt - This file has ORFs that have introns in A21. This
file contains the ORF name, locus name (where available), A21
coordinates, the assembly used for mapping, and full genomic and
coding sequence of each of these ORFs.

ORFsWintrons_alignment.txt - This file has alignments between the full
A21 genomic sequence and the A21 coding sequence of ORFs that have
introns. The alignments were generated using MUSCLE, with the
sequences in the file ORFsWintrons.txt, and the output is in ClustalW
format.

ORFsWambiguousBases.txt - This file contains ORFs with ambiguous bases
(other than A, T, G, or C) in their A21 genomic sequence (the sequence
of the ORF itself and any gap or intron sequences within).  The
following ambiguous bases are contained in the Assembly 21 sequence as
released to CGD: B : C or G or T D : A or G or T K : G or T M : A or C
N : unspecified A or C or G or T R : A or G S : C or G W : A or T Y :
C or T This file contains the ORF name, locus name (where available),
A21 coordinates, the assembly used for mapping, and full genomic
sequence of each of these ORFs.

ORFsWadjustments.txt - This file has ORFs that have adjustments in
their A21 coordinates. "Adjustments" are gaps or overlap between
coding segments that have been introduced into the ORF coordinates to
compensate for presumed errors in the sequence that cause stop codons
internal to the ORF region (either an in-frame stop codon or a shift
in reading frame).  Please note that some A20 ORFs also have
adjustments, and that the list of A20 ORFs with adjustments differs
from those with adjustments in A21, due to changes that have been made
to the sequence during generation of A21.  This file contains the ORF
name, locus name (where available), A21 coordinates, the assembly used
for mapping, and full genomic and coding sequence of each of these
ORFs.

ORFsWadjustments_alignment.txt - This file has alignments between the
full A21 genomic sequence and the A21 coding sequence of ORFs that
have adjustments. The alignments were generated using MUSCLE, with the
sequences in the file ORFsWadjustments.txt, and the output is in
ClustalW format.

ORFsWinternalStops.txt - This file has protein sequences of ORFs that,
when mapped to A21, have stop codons within the reading frame in the
A21 sequence.  "Adjustments" have been manually introduced to the ORF
coordinates to reconstruct a complete reading frame, where possible
(see also: ORFsWadjustments.txt).  This file contains the ORF name,
locus name (where available), A21 coordinates, the assembly used for
mapping, and the amino acid sequence of the predicted ORF translation.
For ORFs that have been given "adjustments," the adjusted sequence is
also listed.

ORFsWnoEndStop.txt - This file has protein sequences of ORFs that,
when mapped to A21, do not end with a stop codon.  Many of these have
other problems in A21, such as insertions or deletions in the sequence
that disrupt the reading frame.  Where possible, ORF coordinates have
been manually updated to create a complete reading frame (e.g.,
truncated at an in-frame stop, extended downstream to the next
in-frame stop, or given an "adjustment" that changes the reading
frame.)  This file contains the ORF name, locus name (where
available), A21 coordinates, the assembly used for mapping, and the
amino acid sequence of the predicted ORF translation.  For cases where
the ORF has been manually updated, the updated sequence is also
included.

ORFsWnonATGstart.txt - This file lists ORFs that do not start with an
ATG in A21.  In some cases, the coordinates were updated manually
during curatorial review, and the updated ORF does start with an ATG;
these cases are also listed in this file.  This file contains the ORF
name, locus name (where available), A21 coordinates, the assembly used
for mapping, and the coding sequence of each of these ORFs.  For cases
where the ORF has been manually updated, the updated sequence is also
included.

ORFsWpartialCodon.txt - This file has coding sequences of ORFs that
have partial codons (i.e., the length of the coding sequence is not a
multiple of 3).  Where possible, ORF coordinates have been manually
updated to create a complete reading frame (e.g., truncated at an
in-frame stop, extended downstream to the next in-frame stop, or given
an "adjustment" that changes the reading frame.)  This file contains
the ORF name, locus name (where available), A21 coordinates, the
assembly used for mapping, and the coding sequence (CDS).  For cases
where the ORF has been manually updated, the updated coding sequence
is also included.

ORFsWmanualUpdates.txt - This file lists the A21 "problem ORFs" that
were manually updated at CGD to address problems in A21.  All ORFs
that failed one or more of the checks described below under "Details
of the A21 ORF Mapping Procedure" were subject to manual review at
CGD.  Where possible, the issues were addressed: "adjustment(s)" were
added to change reading frame or to bypass in-frame stop codons,
coordinates were updated to extend or truncate the ORF to acquire a
terminal stop codon, and ORFs that narrowly failed criteria for "good"
matches to A21 were reinstated if they could be mapped to A21.  Only
the ORFs for which the coordinates were changed as a result of this
review are listed in this file.  Only the ORFs for which the
coordinates were changed as a result of this review are listed in this
file.  This file contains the ORF name, locus name (where available),
A21 coordinates determined by the computational mapping procedure,
manually updated ORF coordinates, descriptive note, the assembly used
for mapping, the predicted ORF translation based on the coordinates
before updating, and the predicted ORF translation based on the
updated coordinates.


A21 ORF mapping reports:

ORFsWnoChanges.txt - This file lists ORFs that have not changed at all
in A21 as compared to the sequence in the assembly (A20/A19) that was
used for mapping the ORF in each case. (Where possible, the A20
sequence of each ORF was used to determine the A21 coordinates; if the
A20 ORF sequence did not match any sequence in A21, then the A19
sequence of the ORF was compared to the A21 sequence; this was
necessary because the sequence of some ORFs changed significantly
between A19 and A20.)  This file contains the ORF name, locus name
(where available), A21 coordinates, the assembly used for mapping, and
the genomic sequence of each of these ORFs.

ORFsWsynChanges.txt - This file lists A21 ORFs that show synonymous
changes when compared to the sequence in the (A20/A19) assembly that
was used for mapping the ORF in each case.  The nucleotide sequence of
the coding sequence (CDS), excluding any intronic sequence or gap
regions ("adjustments"), is not the same between the two assemblies,
however, the translated amino acid sequence is the same.  This file
contains the ORF name, locus name (where available), A21 coordinates,
the assembly used for mapping, and the A21 coding sequence of each of
these ORFs, followed by the ORF name, locus name, A20 or A19
coordinates (the assembly from which the ORF was mapped), and A20 or
A19 coding sequence of the ORF.

ORFsWsynChanges_alignment.txt - This file has alignments between the
A21 coding sequence (CDS) and the A20/A19 CDS of each ORF that shows
synonymous changes upon mapping to Assembly 21. The alignments were
generated using MUSCLE, with the sequences in the file
ORFsWsynChanges.txt, and the output is in ClustalW format.

ORFsWnonsynChanges.txt - This file lists A21 ORFs that show
non-synonymous changes when compared to the sequence in the (A20/A19)
assembly that was used for mapping the ORF in each case.  This file
contains the ORF name, locus name (where available), A21 coordinates,
the assembly used for mapping, and the A21 coding sequence (CDS) of
each of these ORFs, followed by the ORF name, locus name, A20 or A19
coordinates (the assembly from which the ORF was mapped), and A20 or
A19 CDS of the ORF.

ORFsWnonsynChanges_alignment.txt - This file has alignments between
the A21 coding sequence (CDS) and the A20/A19 CDS of each ORF that
shows non-synonymous changes upon mapping to Assembly 21. The
alignments were generated using MUSCLE, with the sequences in the file
ORFsWnonsynChanges.txt, and the output is in ClustalW format.

ORFsWsimpleChanges.txt - This file lists A21 ORFs that show simple
changes when compared to the sequence in the (A20/A19) assembly that
was used for mapping the ORF in each case.  A "simple" change is
defined as follows: the nucleotide identity across the ORFs (the
genomic region from the start to stop codon, including introns or gaps
within the ORF) is 98% or greater, and the aligned region encompasses
the entire length of the ORF.  The "simple change" category includes
ORFs that may contain substitutions, small insertions, and/or small
deletions. Cases in which only sequence within an intronic or gap
("adjustment") region has changed, and the translated sequence has not
been affected, are also included in this category.  This file contains
the ORF name, locus name (where available), A21 coordinates, the
assembly used for mapping, and the A21 genomic sequence of each of
these ORFs, followed by the ORF name, locus name, A20 or A19
coordinates (the assembly from which the ORF was mapped), and A20 or
A19 genomic sequence of the ORF.
 
ORFsWsimpleChanges_alignment.txt - This file has alignments between
the A21 genomic sequence and the A20/A19 sequence of each ORF that
shows simple changes upon mapping to Assembly 21. The alignments were
generated using MUSCLE, with the sequences in the file
ORFsWsimpleChanges.txt, and the output is in ClustalW format.

ORFsWcomplexChanges.txt - This file lists A21 ORFs that show complex
changes when compared to the sequence in the (A20/A19) assembly that
was used for mapping the ORF in each case.  A "complex" change is
defined as follows: the nucleotide identity across the ORFs (the
genomic region from the start to stop codon, including introns or gaps
within the ORF) is less than 98%, and/or the aligned region does not
encompass the entire length of the ORF.  This file contains the ORF
name, locus name (where available), A21 coordinates, the assembly used
for mapping, and the A21 genomic sequence of each of these ORFs,
followed by the ORF name, locus name, A20 or A19 coordinates (the
assembly from which the ORF was mapped), and A20 or A19 genomic
sequence of the ORF.

ORFsWcomplexChanges_alignment.txt - This file has alignments between
the A21 sequence and the A20/A19 sequence of each ORF that shows
complex changes upon mapping to Assembly 21. The alignments were
generated using MUSCLE, with the sequences in the file
ORFsWcomplexChanges.txt, and the output is in ClustalW format.


Details of the A21 ORF Mapping Procedure

We first used the A20 sequence of each ORF to find the corresponding
region on A21 chromosomes.  The full A20 genomic DNA sequence of each
ORF (including introns and any gap/adjustment regions) was mapped,
using BLAST, to A21 chromosomes. The threshold parameters used were
that both the percent identity and alignment length of the BLAST HSP
(high-scoring pair) be > 95%. The ORFs that passed this threshold were
called "good" full ORF matches.  If there was no good match to the A20
sequence of that ORF, we try the mapping using A19 ORF sequence.

Then, for ORFs with multiple exons, the coding sequence segments 
(individually referred to as a "subfeature") were each
separately mapped to the A21 region that matched the full ORF.
Threholds were more strict for the subfeature mapping:  alignment
length was required to be 100% and percent identity > 99%. 
The matched subfeature regions were used to define the ORF location
and then various checks were performed on the generated A21 ORF
sequences. 

Manual review was performed in cases where ORFs did not pass the
criteria for a "good" match to A21 using either A20 or A19 sequence.
There were 16 such cases; some had narrowly failed the criteria used
for computational classification, and were manually assigned to A21
coordinates.

Below is the list of checks that were performed on each ORF.

 (1) check that full DNA sequence does not have invalid bases
 (2) check that full DNA sequence does not have N characters
 (3) check that coding sequence is multiple of three in length (no partial codons)
 (4) check that coding sequence start is ATG
 (5) check that coding sequence ends with a stop codon
 (6) check that coding sequence does not terminate with multiple sequential stop codons
 (7) check that coding sequence does not stop codons internal to the ORF
 (8) check that exon-intron boundaries have not changed and, for real
introns (as opposed to "adjustments"), compare sequence around the
intron and its adjoining exons to make sure it is intact
 (9) check if the intron splice sites are canonical, that is 5' : GT
and 3' : AG 
(10) line up subfeatures and find gaps or overlaps and
suggest adjustments (Only exons and real introns were mapped to A21;
adjustments were not mapped.)  
(11) check coverage of subfeatures:
Start of first subfeature should coincide with start of ORF and,
similarly, end of last subfeature should be the same as end of ORF.
(12) check that full genomic DNA sequence does not have non-ATGC
characters 
(13) check that coding sequence does not have non-ATGC
characters

ORFs that failed one or more of the following checks were reviewed
manually: 3, 4, 5, 7.  Where possible, the coordinates of the ORF were
updated to address the issue.  Please see ORFsWmanualUpdates.txt for a
list of ORFs for which the coordinates were updated during curatorial
review at CGD.


Please note: 
The former "CurrentNotes" directory and files therein have 
been removed from the downloads directory.  
/download/CurrentNotes/AllGaps.tab
/download/CurrentNotes/DeletedGenesSinceAssembly21.txt
/download/CurrentNotes/DeletedOrfsSinceAssembly20.txt
/download/CurrentNotes/GenesWithIntrons.tab    
/download/CurrentNotes/NewGenesSinceAssembly21.txt 
/download/CurrentNotes/NewOrfsSinceAssembly20.txt
/download/CurrentNotes/ORFsWithAdjustments.tab 
If you have any need for these files or information, please contact CGD: 
http://www.candidagenome.org/cgi-bin/suggestion