For the prediction of gene function in the yeast we used bio-molecular data sets.
For each data set we selected only the genes annotated to FunCat (funcat-2.1 scheme, and funcat-2.1_data_20070316, available from
the MIPS web site (http://mips.gsf.de/projects/funcat), using the HCgene R package [14].
We also removed the genes annotated only with the "99" FunCat class ("UNCLASSIFIED PROTEINS") and selected classes with at least
20 positive examples, in order to get a not too small set of positive examples for training.
From the data sets we removed also uninformative features (e.g. features with the same value for all the available examples).
At the end of these pre-processing steps we obtained data whose characteristics are summarized in Tab. 1.
Data set | Description | n. examples | n. feat. | n.classes |
Pfam-1 | protein domain binary data from Pfam [3] | 3529 | 4950 | 211 |
Pfam-2 | protein domain log E data from Pfam [5] | 3529 | 5724 | 211 |
Phylo | phylogenetic data [10] | 2445 | 24 | 187 |
Expr | gene expression data [11,6] | 4532 | 250 | 230 |
PPI-BG | PPI data from BioGRID [12] | 4531 | 5367 | 232 |
PPI-VM | PPI data from von Mering experiments [16] | 2338 | 2559 | 177 |
SP-sim | Sequence pairwise similarity data [9] | 3527 | 6349 | 211 |
Pfam-1 data have been originally analyzed by Deng et al. [3]: for each gene product the presence or absence of 4950 protein domains obtained from the Pfam (Protein families) database [5] is stored as as a binary vector. Moreover we used also an enriched representation of Pfam domains (Pfam-2), by replacing the binary scoring with log E-values obtained with the HMMER software toolkit [4].
Phylogenetic data (Phylo) are obtained through BLAST searches: each feature corresponds to the negative logarithm of the lowest E-value reported by BLAST version 2.0 in a search against a complete genome, with negative values (corresponding to E-values greater than 1 truncated to 0 [10].
We merged the experiments of Spellman et al. (gene expression measures relative to 77 conditions) [11] with the transcriptional responses of yeast to environmental stress (173 conditions) by Gasch et al. [6] to obtain the ``Expr'' data set.
Protein-protein interaction data (PPI-BG) have been downloaded from the BioGRID database, that collects PPI data from both high-throughput studies and conventional focused studies [12]. BioGRID houses high-throughput two-hybrid [13], mass spectrometric protein interaction data [8] and synthetic lethal genetic interactions obtained through synthetic genetic array and molecular barcode methods [2], as well as a vast collection of well-validated physical and genetic interactions from literature. Data are binary: they represent the presence or absence of protein-protein interactions.
We used also another data set of protein-protein interactions (PPI-VM) that collects binary protein-protein interaction data from yeast two-hybrid assay, mass-spectrometry of purified complexes, correlated mRNA expression and genetic interactions [16]. These data are binary too.
Finally we considered pairwise similarities between yeast genes (SP-sim), by using data collected by William Noble and colleagues [9]. They computed the Smith and Waterman log-E values between all pairs of yeast sequences. obtaining a symmetric matrix that expresses the pairwise similarities between yeast genes.
Different strategies can be chosen to select negative examples for each functional class [1,14]. In this work negative examples for each class have been selected in such a way that they are not annotated for the class, but belong to the parent class (i.e. positive for the parent class). In this way only negative examples that are not too dissimilar to the positive ones are selected.
The structure of the FunCat tree relative to the Gene expression data set is shown in Fig.1. Note that a dummy node has been added as root to obtain a single tree from a tree forest.