Problematic ORFs in Assembly 20 Note: This file describes the following types of problems in the EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration: two different orf19 names that have been used for the same region in the EMBL-format Assembly 20 files (2 cases), orf19 names that have been used for two different regions in the EMBL-format Assembly 20 files (4 cases), orf19 name that is used as the name of one ORF and also as an allele name of a different ORF (1 case), ORF without a name in the EMBL-format Assembly 20 files (1 case), ORFs with internal stop codons that are not amenable to correction by a simple adjustment in the terminal coordinate (4 cases), ORFs that are extremely changed in sequence between Assembly 19 and Assembly 20 (4 cases), ORFs that acquired a stop codon in Assembly 20 in the absence of any underlying sequence changes (coordinates of an intronic or gap sequence has changed position ("slipped") between the two assemblies, creating an in-frame stop codon; 8 cases). There are two ORFs in the Chromosome 6 EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration that have the same DNA coordinates: FT CDS 393785..394534 FT /gene="orf19.3391" FT CDS 393785..394534 FT /gene="orf19.683" CGD will merge orf19.683 into orf19.3391 and delete it. ______________________________________________________________________ There are two ORFs in the Chromosome R EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration that have the same DNA coordinates: FT CDS 1166552..1166866 FT /gene="orf19.5284" FT CDS 1166552..1166866 FT /gene="orf19.3510" CGD will merge orf19.3510 into orf19.5284 and delete it. ______________________________________________________________________ There are two ORFs named orf19.3066 in the Chromosome 1 EMBL-format file released by the Annotation Working Group/Assembly 20 collaboration. FT CDS join(761797..764793,764795..765231) FT /gene="orf19.3066" and FT CDS 765849..766900 FT /gene="orf19.3066" They are not related in sequence. The regions CaChr1:761797..764793 and CaChr1:764795..765231 match Assembly 19 orf19.3066/ENG1 by BLASTN at CGD. This region will retain the name orf19.3066. The region CaChr1:765849..766900 matches Assembly 19 orf19.71 by BLASTN at CGD. The orf19.71 has been deleted from Assembly 20 by the Annotation Working Group/Assembly 20 collaboration. It appears: 1. that this region corresponds to orf19.71 2. that the region is mislabeled "orf19.3066" in the EMBL-format file. 3. that deletion of orf19.71 is an error in the EMBL-format file. CGD will reinstate orf19.71 in Assembly 20. ______________________________________________________________________ There are two ORFs named orf19.6897 in the EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration. >From the Chromosome 7 file: FT CDS complement(236285..236725) FT /gene="orf19.6897" and >From the Chromosome 4 file: FT CDS complement(855407..855846) FT /gene="orf19.6897" The two ORFs are on different chromosomes, one on Chr 4 and one on Chr 7. The two ORFs are related to each other in sequence. In Assembly 19, there are two ORFs that are related to orf19.6897; they are orf19.7004 and orf19.4484. Neither orf19.7004 nor orf19.4484 is included in Assembly 20. In Assembly 19, orf19.6897 is on Contig19-2485. This contig corresponds to the region of Chromosome 7 in Assem 20 that contains the region called orf19.6897. Because the region CaChr7:complement(236285..236725) corresponds to orf19.6897 from Assembly 19, this region will retain the name orf19.6897. The region CaChr4:complement(855407..855846) will be given the new ORF name orf19.1208.1. ______________________________________________________________________ There are two ORFs named orf19.4379 on Chromosome R in the EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration. FT CDS complement(801765..804368) FT /gene="orf19.4379" and FT CDS 850051..852654 FT /gene="orf19.4379" In Assembly 19, orf19.4379 is on Contig19-10203, and this contig has been subdivided in construction of Assembly 20. Both of the regions called "orf19.4379" in Assembly 20 are in the same "genomic context" of the Assembly 19 orf19.4379, as if the contig has been split in the region of this ORF, such that there are now two ORFs in Assembly 20 that correspond to the orf19.4379 from Assembly 19. The region CaChrR:complement(801765..804368) will be given the name orf19.4380.1. The region CaChrR:850051..852654 will retain the name orf19.4379. ______________________________________________________________________ There are two ORFs named orf19.730 on Chromosome R in the EMBL-format files released by the Annotation Working Group/Assembly 20 collaboration. FT CDS join(1573570..1575687,1575689..1576210) FT /gene="orf19.730" and FT CDS 1646829..1649468 FT /gene="orf19.730" In Assembly 19, orf19.730 is on Contig19-10070. This contig is subdivided into 3 regions in Assem 20. Both of the regions called "orf19.730" in Assembly 20 are in the same "genomic context" of the Assembly 19 orf19.730 as if the contig has been split in the region of this ORF, such that there are now two ORFs in Assembly 20 that correspond to the orf19.730 from Assembly 19. The region CaChrR: join(1573570..1575687,1575689..1576210) will retain the name orf19.730. The region CaChrR:1646829..1649468 will be given the name orf19.729.1. ______________________________________________________________________ orf19 name that is used as the name of one ORF and also as an allele name of a different ORF The name orf19.7542 was added for the region: Chr R, coordinates 11815 to 10646 This is a new ORF in Assembly 20. It will be renamed "orf19.7539.1." The name orf19.7542 was already in use as the name of the allele of PGA25/orf19.6336, which is located on Chr 6, coordinates 9900 to 7282. The name orf19.7542 will remain as the name of the allele of PGA25/orf19.6336. ______________________________________________________________________ This ORF has no name in Assembly 20 EMBL file: It is on Chr R: FT CDS join(1005780..1005814,1005910..1005929,1005983..1006716) FT /gene="" By BLAST, these regions match the orf19.544.1/PRE6, which was deleted from Assembly 20 by the Annotation Working Group/Assembly 20 collaboration. CGD will reinstate orf19.544.1/PRE6 in Assembly 20. ______________________________________________________________________ ORF with internal stop codon in Assembly 20 orf19.359 orf19.359 has an internal stop codon. The situation is complicated by an intron/gap sequence. It appears that there are multiple frame shifts in the sequence of this ORF in Assembly 20 compared to Assembly 19, and that sequence that was intronic in Assembly 19 is coding in Assembly 20. > orf19.359 Assembly 20 MSDSKIILH*HSSPFILLTRKYVRLNYSRSQRVLWLLGELNIPFELKSLLEKKEFRAPKELENVHPLGKSPVIEVIDSKTGKSEIIAETGHIFNYILSNYDTTNILIPFNRDLQSQVDYFLHYTEGTLQPKLVALMVHGVAKKKAPFGARFLMGLLMGGIDDAFYIPDLKKNLKYLENIIHKQHEKGSKYFVGDKLSGADIILEFPVITNIFQNKRGAEQLGAGDVEKEYPHLNQWAEDIKKEPKYIKAQELVAKHETVKPNI This ORF will remain as-is, until further information becomes available. ______________________________________________________________________ ORF with internal stop codons in Assembly 20 orf19.5775.3 orf19.5775.3 has multiple internal stop codons in Assembly 20. Also, the sequence is very different when compared to orf19.5775.3 from Assembly 19. > orf19.5775.3_old Assembly 19 VGIDLCLVRESTEDEYSVLENQFHPGAVEFLKIMTKSKSYIFAEAFDFALKNNRELVTAIHKAKFFKLGYGLVKTTLVSWIMLLYQSFSIHDNPLLWPYGFILSNFGAALIDVLGLVTGASFGRCCSGFDQIAVTLLWISKLKIQN > orf19.5775.3_new Assembly 20 ENAEIRVLVAGLCRILYGKSHVTRRRS*GSCTRENERGI***IIIIWELRRELINDVKL*LVVWYCILKFPLI*TSR*ST**SYCK*YS*IYNGYVFYSNVFGRTSHTWYWCHSRLLGLLCQLFQNTVFLKIGLVILRVEAR*HC*LDTLRLI*SRPREY*MTTEKCIPICISRNMSVAMS This ORF will remain as-is, until further information becomes available. ______________________________________________________________________ ORF with internal stop codons in Assembly 20 orf19.4384.1 orf19.4384.1 has multiple internal stop codons. >orf19.4384.1 Assembly 20, Chromosome R (794397, 792205) CDS, reverse complemented, translated using codon table 12 (647 residues) MRQPGLVWLENCYSTRLNISYCPISRFLLKVAQMEKHNQNAINTVSFCGLL*IPMISLHV GFFFFHHFLEYFFVQSSDLQ*TIYQCCSFHWKRIPVCLESMYMMRPRKTSPNMNTRKGWS LVQTLTLTASLKSMPPWVSSRSSRALG*PISKT*SFPLEYPRRCWSRRWVYLSHSVVFFS VSTSR*SRVPPLVWTSPCILTVTKTHWYRRSCRWAQ*SVLFSSRRCRGSGVRRRSWPPVS CTLLVQFCVQPPPFQKNWSMAPSTTIGPETSCMREDLFWGLVWVSKVVVLVCTLVSSCPV TSVGVLFRSINSVLLLERFWGTPLSLSSTRCTEGGVSWWVRRWCFLRLC*LVFSSCQSPR VTWCSRGSLAKHSMCLLG*GTWGTDTARSSLCRWSRSSRQKESCVSRRTRARCGLSWSPS PETAVPWCMVL**SPWVS*RVSTSLCTTCRR*WAESGSPQRTPCLCP*LVAGRCLLVPFQ QSCTWTSLVVVCGGTTLLGSSLGWCWLVLDTRLTLKQTWSSPKVSISSVLFCTWAFSGRT PV*HGLCPPSRFRWGPDLKV*LSVRHCCTCGRSSSHTTSTE*RWP*HTQG*LLGSTVVLR SSVSSTRCCLCQRPRGRLWKRLTRYSRSHPDSSSRRIFRG*RGSGAV This ORF will remain as-is, until further information becomes available. ______________________________________________________________________ ORF with internal stop codons in Assembly 20 orf19.3813 orf19.3813 has two stop codons near the end of the ORF and a third, upstream stop. >orf19.3813 Assembly 20, Chromosome 4 (1016696, 1015187) CDS, reverse complemented, translated using codon table 12 (502 residues) MSVFYILFFYSLFFFCQSPIKTSRHPPTYTNCVTMKITRSQRTKHIRLTLISLVIISIFV LVYTLSKSAAVSTYSDVSKMITDFGQKYNPDTEKNDHSSSVSKANSQNVKDTKGNDSKHE VSGNSQSDSTGQVTSQNPGNYLGEIEQMSPEELERAKTIMEEPKEGEETTTKLGVPDPNN SKFSPVTKNKPELPQIPLVGTKDKLKRFKFRIYSHNVKNGGYHSLVPGEQQWSDRLIPLV ASIKFNTFPDNSIVTLQEVYKFQMLDILKELNRHEPGKWDYYGKGRIDGEEMGEFVPNNL ATY*WELMYSDTMWLNDKDPRTSFEGWDAVYARIVSYVTLRHKATDNYINVFNTHFDHIG QEAQVGSAKLIIDTMKELNSWPSFLSGDLNIEPNSDPYKVLNNYLSNTANLATPFNKFGH FKSTVTGFEGEVLLDGGQNIDYIFAPKYTLKMDSDENTKGAGELTTDISLKLYQLGMLHS KFNGKYISDHRPLVADFIMN*I* This ORF will remain as-is, until further information becomes available. ______________________________________________________________________ Sequence has changed a great deal between Assembly 19 and Assembly 20: orf19.2813 The Assembly 19 and Assembly 20 orf19.2813 sequences share a small piece at the N terminus. The Assembly 20 ORF orf19.2813 is significantly shorter than Assembly 19 orf19.2813. > orf19.2813_old Assembly 19 MSESDQLNSTNNTGNTFISLDQASTIMADQTQGGNPQQVMSLARLLNLDNIGAMKAFESVDFPESLKLESKINFQVWRNEILRYARGIGAEFENFVLNETPAHSYDLRLGNMLHQLLIRTVKEKVRMPRQELGKSGKELYLDLIKSFGTQYPYDKFEIVKYYWDQLTNPLINVKRRFEIEEVWVQYINAQTATEREVLNSFVWLHLSKSILPQEYLRSAHPVLDKNVIKIFLDTHPKCDIDQIMSFVNNESINYVGKNDTRENDMGQNLRESDLRESDLSENDIQQNELSESDSSENDLREIATKETVSELFENQCQNCFGLGHDSYECSSAFRNNQYIPDLFSRLQSFRGNRIQNNNRNVWSRFSEQDESIANTEKGN > orf19.2813_new Assembly 20 MSESDQLNSTNKRENTFLSQNQAKTIMADQAQGVNPLHGVGL This ORF will remain as-is, until further information becomes available. ______________________________________________________________________ Sequence has changed a great deal between Assembly 19 and Assembly 20: orf19.6833.2 orf19.6833.2: The sequence of the ORF with this name in Assembly 19 is very different from the Assembly 20 orf19.6833.2. The Assembly 20 sequence does hit the corresponding Assembly 19 ORF by TBLASTN. > orf19.6833.2_old Assembly 19 MSHVSIITGASRGGIGKAIAEVLLKDPNTKVVVVARTEAPLEALANKYGSDRVDFVVGDITDSSTSEKAVESAISKFGQLNSIIANAGVLEPVGPIESTSVDDWKRLYDINLFAVVELIKHSLPHLKKTNGKVIAVSSGAATKAYSGWYAYGSSKAALNHLILSLASDEKDVQAISIAPGVVDTEMQTDIREKFGKNMTPEGLQRFVDLHENKQLASPEEPGTVYAKLALQEWSEDLNGKFLRYNDEILKAYQL > orf19.6833.2_new Assembly 20 MTLPLVFFKCGKSCFINSTTANKLISYNLFQSSTEVDSIGPTGSKTPAFAIIEFNCPNFEIASLTAFSEVDESVMSPTTKSTLSEPYLFVRASNGACVLATTTTFVLGSFKSTSAIALPIP This ORF will remain as-is, until further information becomes available. ______________________________________________________________________ Sequence has changed a great deal between Assembly 19 and Assembly 20: orf19.6922 The start of the ORF appears to have moved upstream a bit, and the reading frame has changed entirely, so that the translations are unrelated. The ORF in Assembly 20 is significantly smaller than the ORF in Assembly 19. > orf19.6922_old Assembly 19 MKESFMYKSVDARECRFSIWYCCVNDSGIHMERGTYLPLLRHSSLYDALLYNWIDGDKRIHKILAKMGVPIVAAKQQWQYLDPPIKNKLPGLLKKYLPELPQVEIFYRCGVTSMDVF > orf19.6922_new Assembly 20 MSQSYEGIVYVQAGRCKRMSFFYMVLLCQ This ORF will remain as-is, until further information becomes available. ______________________________________________________________________ Sequence has changed a great deal between Assembly 19 and Assembly 20: orf19.1860 (Note: Assembly 19 sequence of orf19.1860 updated 14 March 2007) The orf19.1860 was once listed with two distinct sets of coordinates as follows (in an Annotation Working Group file): orf19.1860 Contig19-10133 join(33030..33062,33454..>33720) orf19.1860 Contig19-10135 <1..1359 Subsequently the AWG renamed the orf19.1860 Contig19-10135 <1..1359 to its current name, orf19.1860.1. The ORF orf19.1860 was originally entered incorrectly into CGD from the older AWG data. This problem affected only the Assembly 19 (contig) coordinates of the ORF, and these coordinates have been updated in CGD as of March 15, 2007. The coordinates of orf19.1860 on the Assembly 20 chromosomes have not changed; these coordinates were not affected. Before correction, orf19.1860 in Assembly 19 had multiple internal stops. > orf19.1860_old Assembly 19 (erroneous coordinates listed on Contig19-10135) *S*SGNLLKIIPRQQ**HRFCRVWFATQKPHLAIDFPVDILATLPTTFSPIASDAFLYR*TQPRNV*SFSSHL*PKFHR*TTH*KYQ*IHCKLHCPW*PI > orf19.1860 Assembly 19 (corrected coordinates listed on Contig19-10133) MLSRSFARISRSAAQQKRFLSLHEYRSAALLSEYGVPIPKGYPATTPEGAYDAAKKLGTNELVIKAQALTGGRGKGHFDSGLQGGVKLISSAEEAKDLAS > orf19.1860_new Assembly 20 MLSRSFARISRSAAQQKRFLSLHEYRSAALLSEYGVPIPKGYPATTPEGAYDAAKKLGTNELVIKAQALTGGRGKGHFDSGLQGGVKLISSAEEAKDLASQMLNHKLITKQTGAAGKEVTAVYIVERRDAASEAYVAILMDRTRQTPVIVASAQGGMDIEGVAAKDPSAIKTFPVPLEEGVSDSLATEIAGALGFTQDAIPEAAKTIQSLYKCFIERDCTQVEINPLSETPDHKVLAMDAKLGFDDNASFRQEEVFSWRDPTQEDPQEAEAGKYGLNFIKLDGNIANIVNGAGLAMATMDIIKLYGGEPANFLDCGGTATPETIEKAFELILSDPKVNGIFVNIFGGIVRCDYVAKGLIAATKNFNLDIPVVVRLQGTNMAEAKELIDNSGLKLYAFEDLDPAAEKIVQLAPKNN By TBLASTN the Assembly 20 orf19.1860 sequence hits near the end of Contig19-10133 a kb or so from orf19.1857, and hits Contig 19-10065 near orf19.710. The orf19.710 was merged with orf19.186 in Assembly 20. ______________________________________________________________________ Problem ORFs with small changes in the position of gaps/introns: Changes in the position of gaps/introns (a gap/intron that had "slid" or "slipped"), without other changes to the annotation of the region, appear to be due to some problem with file manipulations during generation of Assembly 20. All gap/intron changes between Assembly 19 and 20 are listed in the file "intronChangesInAssembly20.txt," which is available from the CGD Downloads page, http://www.candidagenome.org/DownloadContents.shtml or via http://www.candidagenome.org/help/SequenceHelp.shtml. In eight such cases, an internal stop codon was generated in Assembly 20 in ORFs that did not have such internal stops in Assembly 19 (and in which the underlying nucleotide sequence was unchanged between the two assemblies). These ORFS are: orf19.1261, orf19.130, orf19.1639, orf19.1693, orf19.2440, orf19.3245, orf19.4136, orf19.5880. After the initial loading of the Annotation Working Group's Assembly 20 data into CGD, CGD will adjust the position of these gaps to restore their position as defined in Assembly 19. Files subsequently generated from CGD will therefore contain the updated sequence and coordinate data. ______________________________________________________________________ Problem ORFs with incomplete codons Some ORFs in Assembly 20 have coding sequences (CDSs, from which any intronic sequence are removed) with lengths, in nucleotides, that are not multiples of three. This is likely due to an error in the coordinates. If the incomplete codon falls at the N terminus, it will disrupt the reading frame. If it results from a truncation at the C terminus, we would expect that the ORF will be included on the list of ORFs that lack stop codons. If it results from a C-terminal extension of one or two basepairs, it would not affect the predicted translated sequence, and would therefore not be included on other lists of problem ORFs. C-terminal extensions of three basepairs or greater will be detected on the list of ORFs with internal stop codons. The ORFS with incomplete stop codons are listed in the file, OrfsWithPartialTerminalCodonInAssembly20.txt