This file contains data for C. albicans SC5314. The various columns in the mapping summary files are: Query name : Name of the ORF or contig from the historic assembly that is being mapped to the newer assembly (Assembly 19, 20, or 21). Best HSP match : Name of the Assembly 20/21 chromosome or Assembly 19 contig that contains that sequence match to the query ORF or contig Best HSP range : Start and stop coordinates of the best match on the Ca20/Ca21 chromosome or Ca19 contig sequence Best HSP strand : Strand of the best match (W or C) Best HSP score : Score of the best HSP match, as reported by BLAST Best HSP e-value : E-value of the best HSP match, as reported by BLAST Percent length of query in HSP : Percent of the total length of the query sequence that is contained in the best HSP alignment Best HSP percent identity : Percent identity between the query and the match, over the region of the best HSP Best HSP percent gaps : Total gaps in the alignment, as a percent of the length of the best HSP Match class (pass/fail) : Match class assignment, based on the criteria described above ------------------ Mapping to Assembly 21 BLAST analysis was performed to map Contigs and ORF sequences from each of the older assemblies to the Assembly 21 chromosomes. The thresholds defining a "good match" were set to: i) E-value threshold < 1E-5 ii) Percent of query sequence in the alignment > 50%, by length iii) Percent identity of best HSP match > 60% iv) Total gaps in alignment < 15% of best HSP If the best HSP match passed the above criteria, the match was subject to a name/alias-based test. The HSP was checked to see that the name(s) of the ORFs contained within the query sequence correspond to the name(s) of the ORFs (standard names or aliases) contained within the Assembly 21 matching sequence. In case of ORFs from these historic assemblies, the best match sequence from Assembly 21 has to overlap with its correspondingly named ORF by at least 50% of the length of the query ORF. In case of historic contigs, the best HSP match has to have some overlap with at least half of the expected ORFs. If the best HSP match passes both the match thresholds and the name/alias overlap requirements, the mapping to Assembly 21 is considered to be a good match, is marked "pass" in the summary files listed below, and the match is displayed in the CGD GBrowse Genome Browser. All others are considered to be poor matches, are marked "fail" in the summary files, and are not displayed in the Genome Browser. The following individual files, listing the best match mapping to Assembly 21 and whether it passes or fails the above criteria, are available for download from this directory. Assembly20ORFsMappedtoAssembly21.tab Assembly19ORFsAWGMappedtoAssembly21.tab Assembly19ORFsSGTCMappedtoAssembly21.tab Assembly6ORFsMappedtoAssembly21.tab Assembly19ContigsMappedtoAssembly21.tab Assembly6ContigsMappedtoAssembly21.tab Assembly5ContigsMappedtoAssembly21.tab Assembly4ContigsMappedtoAssembly21.tab Assembly6OrfsMappedtoAssembly21Orfs.tab Assembly6OrfsMappedtoAssembly21Orfs.tab contains a mapping between Assembly 6 ORFs and Assembly 21 ORFs. The file format is similar to the Assembly6ORFsMappedtoAssembly21.tab file, except that the Assembly6OrfsMappedtoAssembly21Orfs.tab file contains the best match between each Assembly 6 ORF and an ORF from Assembly 21 (instead of the best match between each Assembly 6 ORF and a region of an Assembly 21 chromosome, which is contained in the Assembly6ORFsMappedtoAssembly21.tab file). This file was generated upon specific request by a CGD user. If you would find it useful to have a similar mapping between other assemblies, please contact us: http://www.candidagenome.org/cgi-bin/suggestion Generation of A21 is described partially in van het Hoog M, et al. (2007) Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes. Genome Biol 8(4):R52. URL: http://genomebiology.com/2007/8/4/R52. The SGTC ORFs and Assembly 19 Contigs are those released by the Stanford Genome Technology Center (SGTC), which sequenced and assembled the Candida albicans genome, as described in Jones et al. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334. The AWG ORFs are the ORFs after modification by the Annotation Working Group (AWG), as described in Braun et al. (2005) A human-curated annotation of the Candida albicans genome. PLoS Genet. 2005 Jul;1(1):36-57. ------------------- Mapping to Assembly 20 BLAST analysis was performed to map Contigs and ORF sequences from each of the older assemblies to the Assembly 20 chromosomes. The thresholds defining a "good match" were set to: i) E-value threshold < 1E-5 ii) Percent of query sequence in the alignment > 50%, by length iii) Percent identity of best HSP match > 60% iv) Total gaps in alignment < 15% of best HSP If the best HSP match passed the above criteria, the match was subject to a name/alias-based test. The HSP was checked to see that the name(s) of the ORFs contained within the query sequence correspond to the name(s) of the ORFs (standard names or aliases) contained within the Assembly 20 matching sequence. In case of ORFs from these historic assemblies, the best match sequence from Assembly 20 has to overlap with its correspondingly named ORF by at least 50% of the length of the query ORF. In case of historic contigs, the best HSP match has to have some overlap with at least half of the expected ORFs. If the best HSP match passes both the match thresholds and the name/alias overlap requirements, the mapping to Assembly 20 is considered to be a good match, is marked "pass" in the summary files listed below, and the match is displayed in the CGD GBrowse Genome Browser. All others are considered to be poor matches, are marked "fail" in the summary files, and are not displayed in the Genome Browser. The following individual files, listing the best match mapping to Assembly 20 and whether it passes or fails the above criteria, are available for download from this directory. Assembly19ContigsMappedtoAssembly20.tab Assembly5ContigsMappedtoAssembly20.tab Assembly6ContigsMappedtoAssembly20.tab Assembly4ContigsMappedtoAssembly20.tab Assembly19OrfsAWGMappedtoAssembly20.tab Assembly19OrfsSGTCMappedtoAssembly20.tab Assembly6OrfsMappedtoAssembly20.tab The SGTC ORFs and Assembly 19 Contigs are those released by the Stanford Genome Technology Center (SGTC), which sequenced and assembled the Candida albicans genome, as described in Jones et al. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334. The AWG ORFs are the ORFs after modification by the Annotation Working Group (AWG), as described in Braun et al. (2005) A human-curated annotation of the Candida albicans genome. PLoS Genet. 2005 Jul;1(1):36-57. ------------------- Mapping to Assembly 19 BLAST analysis was performed to map Contigs and ORF sequences from each of the older assemblies to the Assembly 19 supercontigs. The criteria are the same as those used to generate the mappings to the Assembly 20 chromosomes, as described above: i) E-value threshold < 1E-5 ii) Percent of query sequence in the alignment > 50%, by length iii) Percent identity of best HSP match > 60% iv) Total gaps in alignment < 15% of best HSP If the best HSP match passes these match thresholds, it is considered a good mapping to Assembly 19, labeled "pass" in the summary files, and displayed in the CGD GBrowse Genome Browser. All the matches that do not meet the criteria are considered to be poor matches, and are labeled "fail" in the summary files. The following individual summary files, listing the best match mapping to Assembly 19 and whether it passes or fails the above criteria, are available for download from this directory. Assembly6ContigsMappedtoAssembly19.tab Assembly5ContigsMappedtoAssembly19.tab Assembly4ContigsMappedtoAssembly19.tab Assembly6OrfsMappedtoAssembly19.tab ------------------ Sequences used for older assemblies: Assem20Orf - download/sequence/C_albicans_SC5314/Assembly20/current/orf_genomic_assembly_20.fasta.gz Assem19Contig - /download/sequence/C_albicans_SC5314/Assembly19/current/Ca19-supercontigs.fasta.gz Assem19Orf (SGTC) - /download/sequence/C_albicans_SC5314/Assembly19/archived_as_released/Ca-Assembly19.orf.gz Assem19Orf (AWG) - /download/sequence/C_albicans_SC5314/Assembly19/current/orf_genomic_assembly_19.fasta.gz Assem6Contig - /download/sequence/C_albicans_SC5314/Assembly6/Ca-Assembly6.contigs.gz Assem6Orf - /download/sequence/C_albicans_SC5314/Assembly6/Ca-Assembly6.orf.gz Assem5Contig - /download/sequence/C_albicans_SC5314/Assembly5/Ca-Assembly5.contigs.gz Assem4Contig - /download/sequence/C_albicans_SC5314/Assembly4s/Ca-Assembly4.contigs.gz ------------------ orf19_orf6_mapping.txt This file provides a mapping from the names of the Open Reading Frames identified in Assembly 19, to the names of the ORFs in Assembly 6. This mapping was done by blasting the haploid set of orf19 predicted proteins (file available at http://www.candidagenome.org/ download/sequence/genomic_sequence/orf_protein/orf_trans_all_haploid.fasta, as of October 27, 2005) against orf6 predicted proteins (file from the Stanford Genome Technology Center, downloaded from http://www.candidagenome.org/ download/sequence/genomic_sequence/archived_assemblies/Ca-Assembly6.orf_trans). The best hit, or hits with >90% identity were retained. The pairs were subsequently screened, such that if an orf6 in an orf19-orf6 pairing had a more significant hit to a different orf19, then the less significant pairing was removed. In cases where multiple orf6 matches were observed for a single orf19, some subsequent manual curation was performed to remove pairs with less significant E values. An attempt was made to ensure that adjacent orf6's aligned with adjacent orf19's; however, this approach proved not to be helpful as a measure of validation due to apparent regions of misassembly in Assembly 6. Note, this is not necessarily a 1-to-1 mapping; some ORFs have multiple matches. The file of pairing contains the following columns: Column Description: 1 The orf19 identifier 2 The Assembly 19 Contig from which the orf19 ORF derives 3 The orf6 identifier 4 The Assembly 6 Contig from which the orf6 ORF derives 5 E, the expectation or E-value 6 N, the number of scores considered jointly in computing E 7 Sprime, the normalized alignment score, expressed in units of bits 8 S, the raw alignment score 9 alignlen, the overall length of the alignment including any gaps 10 nident, the number of identical letter pairs 11 npos, the number of letter pairs contributing a positive score 12 nmism, the number of mismatched letter pairs 13 pcident, percent identity over the alignment length (as a fraction of alignlen) 14 pcpos, percent positive letter pairs over the alignment length (as a fraction of alignlen) 15 qgaps, number of gaps in the query sequence 16 qgaplen, total length of all gaps in the query sequence 17 sgaps, number of gaps in the subject sequence 18 sgaplen, total length of all gaps in the subject sequence 19 qframe, the reading frame in the query sequence (+0 for protein sequences in BLASTP and TBLASTN searches) 20 qstart, the starting coordinate of the alignment in the query sequence 21 qend, the ending coordinate of the alignment in the query sequence 22 sframe, the reading frame in the subject sequence (+0 for protein sequences in BLASTP and BLASTX searches) 23 sstart, the starting coordinate of the alignment in the subject sequence 24 send, the ending coordinate of the alignment in the subject sequence ------------------ orf4_orf19_mapping.txt This file provides a mapping from the names of the Open Reading Frames identified in Assembly 4, to the names of the ORFs in Assembly 19. This mapping is based on a mapping provided by Judy Berman, with some additional manual curation.