Index of /download/External_id_mappings

 Name                                Last modified      Size  Description
 Parent Directory                                         -   
 CGDID_2_GeneID.tab.gz               2009-02-02 10:42   65K  
 CGDID_2_RefSeqID.tab.gz             2009-02-02 10:42   54K  
 gp2protein.cgd.gz                   2022-12-06 11:41  166K  
 gp2protein_C_albicans_SC5314.gz     2023-06-29 09:53   37K  
 gp2protein_C_dubliniensis_CD36.gz   2023-06-29 09:53   32K  
 gp2protein_C_glabrata_CBS138.gz     2023-06-29 09:53   30K  
 gp2protein_C_parapsilosis_CDC317.gz 2023-06-29 09:53   31K

This directory contains mappings between CGD features and sequences
from external resources, such as Uniprot/Swissprot, RefSeq and Entrez
Gene databases. Each of these mappings were generated at CGD using
sequence similarity as explained in detail below.

As of June 2011, only C. albicans genes are included.  We will add
mappings for genes from other species in CGD in the future.

* gp2protein.cgd.gz This gzipped file contains mappings of CGD
features to Uniprot/Swissprot protein sequence records.  The
gp2protein.cgd.gz file is also available for download at the Gene
Ontology Consortium web site at :
http://www.geneontology.org/gp2protein/gp2protein.cgd.gz

* CGDID_2_RefSeqID.tab.gz This gzipped file contains mappings of CGD
features to Entrez RefSeq Nucleotide sequence records.

* CGDID_2_GeneID.tab.gz This gzipped file contains mappings of CGD
features to Entrez Gene records.


Each of the files has two columns:

1) CGDID : Primary CGD identifier (Eg: CAL0069221) of the CGD gene or
feature that is being mapped.  
2) External ID : Identifier of the
matching entry at the corresponding external resource.


--------------------

Procedure used for Entrez RefSeq mapping, based on sequence
similarity:

We downloaded all sequences from Entrez Nucleotide database with an
organism specific query for 'Candida albicans SC5314'. Then, we
performed BLAST comparisons for each of these sets of sequences
against the haploid set of Assembly 19 sequences.  The following
strict thresholds were used to ensure good quality matches: i) E-value
threshold < 1E-5; ii) Percent of query sequence in the alignment =
100%; iii) Percent of matching sequence in the alignment = 100%; iv)
Percent identity of best HSP = 100%.  For the sequences that could not
be mapped to CGD genes using the first step as explained above, we ran
another BLAST analysis of those sequences against the diploid set of
Assembly 19 sequences. At this step, we were able to find additional
mappings to allelic genes. The same strict thresholds were used for
this run of BLAST analysis as well.

Using the above procedure, we were able to map 11517 RefSeq Nucleotide
sequences to CGD genes.

To generate mappings of CGD genes to Entrez Gene records, we used the
RefSeq ID to Entrez Gene ID associations provided at NCBI
(ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz). We used the CGD
ID to RefSeq ID mappings and the gene2refseq file downloaded from NCBI
to get mappings between CGD ID and Entrez Gene IDs.

-----------------

Procedure used for UniProt mapping for generation of the
gp2protein.cgd file:

The mappings were generated using the latest set of C. albicans, C. glabrata,
C.parapsilosis and C. dublinensis protein sequences as the queries and BLAST them
against the complete corresponding proteome datasets downloaded from UniProt.  
The threshold used to filter the BLAST matches was an exact match over the query
sequence length.