Index of /download/mapping_historic_assemblies

Icon  Name                                     Last modified      Size  Description
[PARENTDIR] Parent Directory - [TXT] orf4_orf19_mapping_README.txt 2011-05-23 16:08 280 [TXT] orf19_orf6_mapping_README.txt 2011-05-23 16:07 2.8K [   ] Assembly19ContigsMappedtoAssembly21.tab 2023-06-29 09:53 39K [   ] Assembly19ContigsMappedtoAssembly20.tab 2023-06-29 09:53 43K [   ] Assembly6ContigsMappedtoAssembly19.tab 2023-06-29 09:53 91K [   ] Assembly6ContigsMappedtoAssembly21.tab 2023-06-29 09:53 105K [   ] Assembly6ContigsMappedtoAssembly20.tab 2023-06-29 09:53 109K [   ] Assembly4ContigsMappedtoAssembly21.tab 2023-06-29 09:53 152K [   ] Assembly5ContigsMappedtoAssembly21.tab 2023-06-29 09:53 156K [   ] Assembly5ContigsMappedtoAssembly19.tab 2023-06-29 09:53 156K [   ] Assembly5ContigsMappedtoAssembly20.tab 2023-06-29 09:53 158K [   ] Assembly4ContigsMappedtoAssembly20.tab 2023-06-29 09:53 161K [   ] Assembly4ContigsMappedtoAssembly19.tab 2023-06-29 09:53 162K [TXT] orf4_orf19_mapping.txt 2023-06-29 09:53 202K [   ] Assembly20ORFsMappedtoAssembly21.tab 2023-06-29 09:53 389K [   ] Assembly6OrfsMappedtoAssembly21Orfs.tab 2023-06-29 09:53 532K [   ] Assembly6ORFsMappedtoAssembly21.tab 2023-06-29 09:53 603K [   ] Assembly6OrfsMappedtoAssembly19.tab 2023-06-29 09:53 635K [   ] Assembly6OrfsMappedtoAssembly20.tab 2023-06-29 09:53 648K [   ] Assembly19ORFsAWGMappedtoAssembly21.tab 2023-06-29 09:53 789K [   ] Assembly19OrfsAWGMappedtoAssembly20.tab 2023-06-29 09:53 801K [TXT] orf19_orf6_mapping.txt 2023-06-29 09:53 912K [   ] Assembly19ORFsSGTCMappedtoAssembly21.tab 2023-06-29 09:53 935K [   ] Assembly19OrfsSGTCMappedtoAssembly20.tab 2023-06-29 09:53 1.0M


This file contains data for C. albicans SC5314.  
The various columns in the mapping summary files are:

Query name : Name of the ORF or contig from the historic assembly that
is being mapped to the newer assembly (Assembly 19, 20, or 21).

Best HSP match : Name of the Assembly 20/21 chromosome or Assembly 19
contig that contains that sequence match to the query ORF or contig

Best HSP range : Start and stop coordinates of the best match on the
Ca20/Ca21 chromosome or Ca19 contig sequence

Best HSP strand : Strand of the best match (W or C)

Best HSP score : Score of the best HSP match, as reported by BLAST

Best HSP e-value : E-value of the best HSP match, as reported by BLAST

Percent length of query in HSP : Percent of the total length of the
query sequence that is contained in the best HSP alignment

Best HSP percent identity : Percent identity between the query and the
match, over the region of the best HSP

Best HSP percent gaps : Total gaps in the alignment, as a percent of
the length of the best HSP

Match class (pass/fail) : Match class assignment, based on the
criteria described above

------------------


Mapping to Assembly 21

BLAST analysis was performed to map Contigs and ORF sequences from
each of the older assemblies to the Assembly 21 chromosomes. The
thresholds defining a "good match" were set to: i) E-value threshold <
1E-5 ii) Percent of query sequence in the alignment > 50%, by length
iii) Percent identity of best HSP match > 60% iv) Total gaps in
alignment < 15% of best HSP

If the best HSP match passed the above criteria, the match was subject
to a name/alias-based test.  The HSP was checked to see that the
name(s) of the ORFs contained within the query sequence correspond to
the name(s) of the ORFs (standard names or aliases) contained within
the Assembly 21 matching sequence.  In case of ORFs from these
historic assemblies, the best match sequence from Assembly 21 has to
overlap with its correspondingly named ORF by at least 50% of the
length of the query ORF. In case of historic contigs, the best HSP
match has to have some overlap with at least half of the expected
ORFs.

If the best HSP match passes both the match thresholds and the
name/alias overlap requirements, the mapping to Assembly 21 is
considered to be a good match, is marked "pass" in the summary files
listed below, and the match is displayed in the CGD GBrowse Genome
Browser.  All others are considered to be poor matches, are marked
"fail" in the summary files, and are not displayed in the Genome
Browser.

The following individual files, listing the best match mapping to
Assembly 21 and whether it passes or fails the above criteria, are
available for download from this directory.

Assembly20ORFsMappedtoAssembly21.tab
Assembly19ORFsAWGMappedtoAssembly21.tab
Assembly19ORFsSGTCMappedtoAssembly21.tab
Assembly6ORFsMappedtoAssembly21.tab
Assembly19ContigsMappedtoAssembly21.tab
Assembly6ContigsMappedtoAssembly21.tab
Assembly5ContigsMappedtoAssembly21.tab
Assembly4ContigsMappedtoAssembly21.tab
Assembly6OrfsMappedtoAssembly21Orfs.tab

Assembly6OrfsMappedtoAssembly21Orfs.tab contains a mapping between
Assembly 6 ORFs and Assembly 21 ORFs.  The file format is similar to
the Assembly6ORFsMappedtoAssembly21.tab file, except that the
Assembly6OrfsMappedtoAssembly21Orfs.tab file contains the best match
between each Assembly 6 ORF and an ORF from Assembly 21 (instead of
the best match between each Assembly 6 ORF and a region of an Assembly
21 chromosome, which is contained in the
Assembly6ORFsMappedtoAssembly21.tab file).  This file was generated
upon specific request by a CGD user.  If you would find it useful to
have a similar mapping between other assemblies, please contact us:
http://www.candidagenome.org/cgi-bin/suggestion

Generation of A21 is described partially in van het Hoog M, et
al. (2007) Assembly of the Candida albicans genome into sixteen
supercontigs aligned on the eight chromosomes. Genome Biol
8(4):R52. URL: http://genomebiology.com/2007/8/4/R52.

The SGTC ORFs and Assembly 19 Contigs are those released by the
Stanford Genome Technology Center (SGTC), which sequenced and
assembled the Candida albicans genome, as described in Jones et
al. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334.

The AWG ORFs are the ORFs after modification by the Annotation Working
Group (AWG), as described in Braun et al. (2005) A human-curated
annotation of the Candida albicans genome.  PLoS Genet. 2005
Jul;1(1):36-57.

-------------------



Mapping to Assembly 20

BLAST analysis was performed to map Contigs and ORF sequences from
each of the older assemblies to the Assembly 20 chromosomes. The
thresholds defining a "good match" were set to: i) E-value threshold <
1E-5 ii) Percent of query sequence in the alignment > 50%, by length
iii) Percent identity of best HSP match > 60% iv) Total gaps in
alignment < 15% of best HSP

If the best HSP match passed the above criteria, the match was subject
to a name/alias-based test.  The HSP was checked to see that the
name(s) of the ORFs contained within the query sequence correspond to
the name(s) of the ORFs (standard names or aliases) contained within
the Assembly 20 matching sequence.  In case of ORFs from these
historic assemblies, the best match sequence from Assembly 20 has to
overlap with its correspondingly named ORF by at least 50% of the
length of the query ORF. In case of historic contigs, the best HSP
match has to have some overlap with at least half of the expected
ORFs.

If the best HSP match passes both the match thresholds and the
name/alias overlap requirements, the mapping to Assembly 20 is
considered to be a good match, is marked "pass" in the summary files
listed below, and the match is displayed in the CGD GBrowse Genome
Browser.  All others are considered to be poor matches, are marked
"fail" in the summary files, and are not displayed in the Genome
Browser.

The following individual files, listing the best match mapping to
Assembly 20 and whether it passes or fails the above criteria, are
available for download from this directory.

Assembly19ContigsMappedtoAssembly20.tab
Assembly5ContigsMappedtoAssembly20.tab
Assembly6ContigsMappedtoAssembly20.tab
Assembly4ContigsMappedtoAssembly20.tab

Assembly19OrfsAWGMappedtoAssembly20.tab
Assembly19OrfsSGTCMappedtoAssembly20.tab
Assembly6OrfsMappedtoAssembly20.tab


The SGTC ORFs and Assembly 19 Contigs are those released by the
Stanford Genome Technology Center (SGTC), which sequenced and
assembled the Candida albicans genome, as described in Jones et
al. (2004) The Diploid Genome of Candida albicans. PNAS 101:7329-7334.

The AWG ORFs are the ORFs after modification by the Annotation Working
Group (AWG), as described in Braun et al. (2005) A human-curated
annotation of the Candida albicans genome.  PLoS Genet. 2005
Jul;1(1):36-57.

-------------------

Mapping to Assembly 19

BLAST analysis was performed to map Contigs and ORF sequences from
each of the older assemblies to the Assembly 19 supercontigs.

The criteria are the same as those used to generate the mappings to
the Assembly 20 chromosomes, as described above:

i) E-value threshold < 1E-5 ii) Percent of query sequence in the
alignment > 50%, by length iii) Percent identity of best HSP match >
60% iv) Total gaps in alignment < 15% of best HSP

If the best HSP match passes these match thresholds, it is considered
a good mapping to Assembly 19, labeled "pass" in the summary files,
and displayed in the CGD GBrowse Genome Browser.  All the matches that
do not meet the criteria are considered to be poor matches, and are
labeled "fail" in the summary files.

The following individual summary files, listing the best match mapping
to Assembly 19 and whether it passes or fails the above criteria, are
available for download from this directory.

Assembly6ContigsMappedtoAssembly19.tab
Assembly5ContigsMappedtoAssembly19.tab
Assembly4ContigsMappedtoAssembly19.tab
Assembly6OrfsMappedtoAssembly19.tab



------------------

Sequences used for older assemblies: 

Assem20Orf - 
download/sequence/C_albicans_SC5314/Assembly20/current/orf_genomic_assembly_20.fasta.gz

Assem19Contig -
/download/sequence/C_albicans_SC5314/Assembly19/current/Ca19-supercontigs.fasta.gz

Assem19Orf (SGTC) -
/download/sequence/C_albicans_SC5314/Assembly19/archived_as_released/Ca-Assembly19.orf.gz

Assem19Orf (AWG) -
/download/sequence/C_albicans_SC5314/Assembly19/current/orf_genomic_assembly_19.fasta.gz

Assem6Contig - /download/sequence/C_albicans_SC5314/Assembly6/Ca-Assembly6.contigs.gz

Assem6Orf - /download/sequence/C_albicans_SC5314/Assembly6/Ca-Assembly6.orf.gz

Assem5Contig - /download/sequence/C_albicans_SC5314/Assembly5/Ca-Assembly5.contigs.gz

Assem4Contig - /download/sequence/C_albicans_SC5314/Assembly4s/Ca-Assembly4.contigs.gz

------------------

orf19_orf6_mapping.txt

This file provides a mapping from the names of the Open Reading Frames
identified in Assembly 19, to the names of the ORFs in Assembly
6. This mapping was done by blasting the haploid set of orf19
predicted proteins (file available at http://www.candidagenome.org/
download/sequence/genomic_sequence/orf_protein/orf_trans_all_haploid.fasta,
as of October 27, 2005) against orf6 predicted proteins (file from the
Stanford Genome Technology Center, downloaded from
http://www.candidagenome.org/
download/sequence/genomic_sequence/archived_assemblies/Ca-Assembly6.orf_trans). The
best hit, or hits with >90% identity were retained.  The pairs were
subsequently screened, such that if an orf6 in an orf19-orf6 pairing
had a more significant hit to a different orf19, then the less
significant pairing was removed.  In cases where multiple orf6 matches
were observed for a single orf19, some subsequent manual curation was
performed to remove pairs with less significant E values.  An attempt
was made to ensure that adjacent orf6's aligned with adjacent orf19's;
however, this approach proved not to be helpful as a measure of
validation due to apparent regions of misassembly in Assembly 6.

Note, this is not necessarily a 1-to-1 mapping; some ORFs have
multiple matches.  The file of pairing contains the following columns:

Column Description: 
1 The orf19 identifier 
2 The Assembly 19 Contig from which the orf19 ORF derives 
3 The orf6 identifier 
4 The Assembly 6 Contig from which the orf6 ORF derives 
5 E, the expectation or E-value 
6 N, the number of scores considered jointly in computing E 
7 Sprime, the normalized alignment score, expressed in units of bits 
8 S, the raw alignment score 
9 alignlen, the overall length of the alignment including any gaps 
10 nident, the number of identical letter pairs 
11 npos, the number of letter pairs contributing a positive score 
12 nmism, the number of mismatched letter pairs 
13 pcident, percent identity over the alignment length (as a fraction of alignlen)
14 pcpos, percent positive letter pairs over the alignment length (as
   a fraction of alignlen) 
15 qgaps, number of gaps in the query sequence
16 qgaplen, total length of all gaps in the query sequence 
17 sgaps, number of gaps in the subject sequence 
18 sgaplen, total length of all gaps in the subject sequence 
19 qframe, the reading frame in the query sequence 
   (+0 for protein sequences in BLASTP and TBLASTN searches) 
20 qstart, the starting coordinate of the alignment in the query sequence
21 qend, the ending coordinate of the alignment in the query sequence
22 sframe, the reading frame in the subject sequence (+0 for protein
   sequences in BLASTP and BLASTX searches) 
23 sstart, the starting coordinate of the alignment in the subject sequence 
24 send, the ending coordinate of the alignment in the subject sequence

------------------

orf4_orf19_mapping.txt

This file provides a mapping from the names of the Open Reading Frames
identified in Assembly 4, to the names of the ORFs in Assembly 19.
This mapping is based on a mapping provided by Judy Berman, with some
additional manual curation.