Index of /download/External_id_mappings
Name Last modified Size Description
Parent Directory -
CGDID_2_GeneID.tab.gz 01-Oct-2008 16:43 65K GZIP compressed docume>
CGDID_2_RefSeqID.tab.gz 01-Oct-2008 16:43 54K GZIP compressed docume>
gp2protein.cgd.gz 01-Oct-2008 16:43 43K GZIP compressed docume>
This directory contains mappings between CGD features and sequences from external resources, such as Uniprot/Swissprot,
RefSeq and Entrez Gene databases. Each of these mappings were generated at CGD using sequence
similarity as explained in detail below.
* gp2protein.cgd.gz
This gzipped file contains mappings of CGD features to Uniprot/Swissprot protein sequence records.
The gp2protein.cgd.gz file is also available for download at the Gene Ontology Consortium web site at :
http://www.geneontology.org/gp2protein/gp2protein.cgd.gz
* CGDID_2_RefSeqID.tab.gz
This gzipped file contains mappings of CGD features to Entrez RefSeq Nucleotide sequence records.
* CGDID_2_GeneID.tab.gz
This gzipped file contains mappings of CGD features to Entrez Gene records.
Each of the files has two columns:
1) CGDID : Primary CGD identifier (Eg: CAL0069221) of the CGD gene or feature that is being mapped.
2) External ID : Identifier of the matching entry at the corresponding external resource.
--------------------
Procedure used for mapping based on sequence similarity:
BLAST analysis was performed to map sequences from each of the external database resources to CGD
features. We first downloaded all sequences from Uniprot by querying the database with an organism
specific query for 'Candida albicans'. Similarly, we downloaded all sequences from Entrez Nucleotide
database with an organism specific query for 'Candida albicans SC5314'. Then, we performed BLAST
comparisons for each of these sets of sequences against the haploid set of Assembly 19 sequences.
The following strict thresholds were used to ensure good quality matches: i) E-value threshold < 1E-5;
ii) Percent of query sequence in the alignment = 100%; iii) Percent of matching sequence in the
alignment = 100%; iv) Percent identity of best HSP = 100%.
For the sequences that could not be mapped to CGD genes using the first step as explained above,
we ran another BLAST analysis of those sequences against the diploid set of Assembly 19 sequences. At
this step, we were able to find additional mappings to allelic genes. The same strict thresholds were
used for this run of BLAST analysis as well.
Using the above procedure, we were able to map 7481 Uniprot sequences and 11517 RefSeq Nucleotide
sequences to CGD genes.
To generate mappings of CGD genes to Entrez Gene records, we used the RefSeq ID to Entrez Gene ID associations
provided at NCBI (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz). We used the CGD ID to RefSeq ID
mappings and the gene2refseq file downloaded from NCBI to get mappings between CGD ID and Entrez Gene IDs.