In each case, we BLASTed the Assembly 21 (A21) sequence against all of the sequence traces, both the original set from the Stanford Genome Technology Center (SGTC) shotgun sequencing plus the new set of 454 traces, and also against the contig sequences from Assembly 19 (A19). NCBI BLAST was used with an e-value cutoff of 1e-15. The query sequence comprised the "error region" (i.e., the site of the actual "adjustment" itself) plus flanking sequence of 100 nucleotides to each side. We imposed an an additional constraint that the match to each sequence trace span the entire "error region" (but not the entire flanking sequence).
Alignment of the BLAST hits was performed with the MUSCLE alignment software. We manually curated each of these alignments. In many cases, the sequence data clearly indicate that there are sequence differences between the two alleles of each ORF in SC5314. Because Assembly 21 is already a mix of haplotypes, we focused only on making the sequence corrections/updates that would restore the open reading frame (insertions and deletions that cause frameshifts, and substitutions that affect in-frame stop codons), and did not attempt to dissect out the alleles or to make other corrections based on these sequence data. (Construction of an updated diploid assembly is beyond the scope of this current work.) Where the traces support a change that restores (or lengthens) the open reading frame, we recorded the "snippet" of sequence from A21, the "snippet" with the change included, and then the supporting evidence: the number of traces that support this change, and the total number of traces aligned across this region. The sequence changes were parsed from the sequence snippets. The sequence updates were then incorporated into the context of the genomic sequence of the affected ORFs (with the "adjustments" removed). The coding sequence of each of the ORFs (taking introns into account) was computationally translated, and tested for proper start and stop codons as well as for stop codons internal to the reading frame. Remaining errors were addressed by iterations of generation of new alignments (centered on a different region, using a higher stringency e-value cutoff, or removing the full-length match constraint, as needed), curation of new sequence snippets, and translation tests.
As part of this work, all of the "adjustments", which were put into the ORF annotation to compensate for presumed errors in the genomic sequence, have now been removed from Assembly 21. In each case the sequence has been re-examined; underlying errors have been corrected or, in cases where no error is confirmed, the ORF annotation has been updated accordingly. The updates to individual ORFs have been grouped broadly into Sequence Changes and Annotation Changes. Annotation changes include:
- addition of a new ORF (or non-ORF feature such as a pseudogene, or a blocked reading frame, which is a conserved region that includes an ORF from start to stop as well as conserved sequence outside of the ORF),
- change in the location of one or both termini of an ORF (ORF boundary change),
- addition of an old ORF that was previously deleted (a reinstated ORF),
- merge between two or more distinct ORFs which results in a single contiguous ORF (the terms "merge-keep" and "merge-delete" are applied to the ORFs within the pair that become the new primary ORF in CGD and the ORF that is assigned "deleted" status after the merge, respectively),
- separation (un-merging) of two ORFs that were previously merged and which are now being restored as separate entities.
Each change is described in detail on the Locus History of each ORF, which may be accessed from the Locus History tab near the top of each Locus Summary Page or the Locus History link in the Additional Information section near the bottom of each Locus Summary page.
In cases in which there is a predicted conserved coding region but not a clear ORF, or a CGD ORF with an "adjustment" for which the sequence data do not support a sequence correction that would restore the open reading frame, our curation guidelines are the following: If the N- or C-terminus can be moved, such that a smaller intact ORF remains, we annotate the smaller ORF. If this is not possible, we annotate the region as a "pseudogene." In cases where the smaller ORF region appears to be part of a larger conserved coding region, we annotate the larger region as a "blocked reading frame," to make this information accessible to our users.
When a new ORF is added to CGD, it receives a name that follows the "orf19" systematic name convention that is in use in the community. The guideline for adding a new ORF name is to identify the neighboring ORFs, start with the flanking ORF that contains the lower number in its numeric name, add a ".1" suffix, and assign this new name to the new ORF. (If the flanking ORF already has a ".1" suffix, we add a ".2" suffix to the new name.) All new ORF names were checked for uniqueness in A19, A20, and A21. Names for pseudogenes and blocked reading frames follow the same convention as ORF names.
Return to CGD | Send a Message to the CGD Curators |