This directory contains mappings between CGD features and sequences from external resources, such as Uniprot/Swissprot, RefSeq and Entrez Gene databases. Each of these mappings were generated at CGD using sequence similarity as explained in detail below. As of June 2011, only C. albicans genes are included. We will add mappings for genes from other species in CGD in the future. * gp2protein.cgd.gz This gzipped file contains mappings of CGD features to Uniprot/Swissprot protein sequence records. The gp2protein.cgd.gz file is also available for download at the Gene Ontology Consortium web site at : http://www.geneontology.org/gp2protein/gp2protein.cgd.gz * CGDID_2_RefSeqID.tab.gz This gzipped file contains mappings of CGD features to Entrez RefSeq Nucleotide sequence records. * CGDID_2_GeneID.tab.gz This gzipped file contains mappings of CGD features to Entrez Gene records. Each of the files has two columns: 1) CGDID : Primary CGD identifier (Eg: CAL0069221) of the CGD gene or feature that is being mapped. 2) External ID : Identifier of the matching entry at the corresponding external resource. -------------------- Procedure used for Entrez RefSeq mapping, based on sequence similarity: We downloaded all sequences from Entrez Nucleotide database with an organism specific query for 'Candida albicans SC5314'. Then, we performed BLAST comparisons for each of these sets of sequences against the haploid set of Assembly 19 sequences. The following strict thresholds were used to ensure good quality matches: i) E-value threshold < 1E-5; ii) Percent of query sequence in the alignment = 100%; iii) Percent of matching sequence in the alignment = 100%; iv) Percent identity of best HSP = 100%. For the sequences that could not be mapped to CGD genes using the first step as explained above, we ran another BLAST analysis of those sequences against the diploid set of Assembly 19 sequences. At this step, we were able to find additional mappings to allelic genes. The same strict thresholds were used for this run of BLAST analysis as well. Using the above procedure, we were able to map 11517 RefSeq Nucleotide sequences to CGD genes. To generate mappings of CGD genes to Entrez Gene records, we used the RefSeq ID to Entrez Gene ID associations provided at NCBI (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz). We used the CGD ID to RefSeq ID mappings and the gene2refseq file downloaded from NCBI to get mappings between CGD ID and Entrez Gene IDs. ----------------- Procedure used for UniProt mapping for generation of the gp2protein.cgd file: The mappings were generated using the latest set of C. albicans, C. glabrata, C.parapsilosis and C. dublinensis protein sequences as the queries and BLAST them against the complete corresponding proteome datasets downloaded from UniProt. The threshold used to filter the BLAST matches was an exact match over the query sequence length.