Traditional empirical techniques for assigning functions to proteins are time-consuming and expensive. Machine learning (ML) methods have been recently applied to build systems that automatically classify proteins. In [3,7] gene products are classified using data generated by high-throughput bio-technologies [3,7]; in [13,8] heterogeneous bio-molecular data sources are also used.
Functional classes of genes are hierarchically structured: the Gene Ontology Consortium [6] organizes functional classes (terms) in a Directed Acyclic Graph (DAG); the FunCat [10] taxonomy organizes gene classes in a tree [10].
In both cases, gene products are assigned to one or more classes according to the structure of the underlying DAG or tree. Thus, gene function prediction can be naturally viewed as a hierarchical classification problem with structured labels involving multiple and partial paths.
The hierarchical classification of functional classes of genes requires an efficient and automatic organization and processing of the GO and FunCat hierarchies, of the gene products associated to the GO, and of the heterogeneous bio-molecular data (e.g., gene expression or protein interaction data) associated to the gene products. Indeed, although GO has thousands of functional classes, in many practical cases we are only interested in a subset of them; e.g., species-specific ontologies or classes related to specific biological processes. The task of associating instances (genes and gene products) to multilabels (sets of GO terms) is carried out by the members of the Gene Ontology Annotation (GOA) consortium [2]. GOA annotations are complemented by references and indication of supporting evidences. FunCat annotations of genomes are manually performed at MIPS or Biomax, and cover several species from prokaryotic organisms to fungi, plants and animals [9,5,11].
In this respect, the main goal of HCGene is to provide a software tool to process GO DAGs, FunCat trees, and related data in order to enable the efficient application of ML methods to different gene prediction tasks for various species, using specific subsets of functional classes, and using different types of large scale genomic, transcriptomic, and proteomic data.