Coding Potential Calculator
This document is about the cpc algorithm known issues and our solution. If you want to use our CPC online, please refer to our CPC online.

TERM Recognizing Potential UTR Regions
EXPLAIN In most mammalian genomes, the 3' UTR regions of a coding transcript may extend for several kb and are abundant in many EST libraries. The current version of CPC's SVM classifier could not accurately distinguish transcripts that fall entirely within UTR regions from those true non-coding transcripts. To deal with the limitation, we provide the option of BLAST searching against UTRdb on the CPC web server.

For example, AK057932 is a partial GenBank mRNA derived entirely from 3' UTR of human protein coding gene pantothenate kinase 1 (PANK1). Since it could not encode a peptide, CPC classified it as "noncoding". However, the existence of five BLAST hits in UTRdb suggested that this transcript was likely to be derived from UTR regions.
TERM Performance on Short Peptides
EXPLAIN Recent reports suggested that the short peptides with no more than 100 amino acids may play key roles in many biological processes and are abundant in the mammalian proteome ( Frith, M.C., Forrest, A.R., Nourbakhsh, E., Pang, K.C., Kai, C., Kawai, J., Carninci, P., Hayashizaki, Y., Bailey, T.L. and Grimmond, S.M. (2006) The abundance of short proteins in the mammalian proteome. PLoS Genet, 2, e52.).

To assess the CPC's performance in identifying short peptides, we derived a testing dataset by retrieving eukaryotic proteins with no more than 100 residues from NCBI Entrez system. To ensure quality of the dataset, only entries from RefSeq with the status label of "Validated" or "Reviewed" are kept for subsequent analysis. Corresponding RefSeq mRNAs were then fetched by using "Nucleotide Links" function of Entrez system. Finally, we generated a dataset of 2,849 mRNA sequences which encode small peptides with no more than 100 amino acids. And results showed that CPC was able to predict 92.00% (2621 out of 2849) of the short peptides correctly.
TERM The Prediction Gray Area
EXPLAIN CPC summarizes the main output in a table. Each row corresponds to one input sequence. The columns show the sequence ID, the coding/noncoding classification, the SVM score (the "distance" to the SVM classification hyper-plane in the features space). In general, the farther away the score is from zero, the more reliable the prediction is. As a rule of thumb from our experience, the transcripts with score between -1 and 1 are marked as "weak noncoding"( [-1, 0] ) or "weak coding"( (0,-1] ).