Subsections


An example of the usage of HCGene for the functional classification of genes with the Gene Ontology

HCGene is mainly designed to support the functional classification of genes. In this example we show how to use HCGene to perform the several tasks related to the classification of yeast genes annotated with TAS (Traceable Author Statement) evidence in the BP (Biological Processes) gene ontology. The source code of the script used for these examples can be downloaded from: http://homes.dsi.unimi.it/valenti/SW/hcgene/examples/GO.class.example.R.

Construction of the GO graph and of the multilabel table

At first we obtain the most specific BP (Biological Processes) GO annotations for the yeast (that is, the "deepest" annotations in the DAG of the BP ontology); only the Traceable Author Statement (TAS) annotations are considered, all the remaining ones are discarded:

> Yeast.specific.TAS <- Get.GO.specific.classes(ontology = "BP", gene2GO = YEASTGO, evidence = "TAS");
To obtain the same annotations for another species (e.g., the mouse), one can simply use the appropriate environment that maps genes to TAS annotated BP GO classes:
> data(MOUSEGO);
> Mouse.specific.TAS <- Get.GO.specific.classes(ontology = "BP", gene2GO = MOUSEGO, evidence = "TAS");

Coming back to the yeast, there are 1201 genes annotated with TAS evidence:

> length(Yeast.specific.TAS)
[1] 1201
For instance, the first five yeast genes are annotated with the following (most specific) BP terms:
> Yeast.specific.TAS[1:5]
$YEL061C
[1] "GO:0000070" "GO:0030472"

$YBR289W
[1] "GO:0006338"

$YGL215W
[1] "GO:0007049"

$YNL289W
[1] "GO:0007049"

$YDR343C
[1] "GO:0008645"
Note that "YEL061C", "YBR289W", "YGL215W", "YNL289W", "YDR343C" are the identifiers of the yeast ORF.

We then obtain all available annotations, exploiting transitivity between GO classes:

> Yeast.general.TAS <- Get.GO.all.classes(Yeast.specific.TAS);
> Yeast.general.TAS[1:5];
$YEL061C
 [1] "GO:0000087" "GO:0000278" "GO:0000279" "GO:0000819" "GO:0006996"
 [6] "GO:0007001" "GO:0007049" "GO:0007059" "GO:0007067" "GO:0008150"
[11] "GO:0009987" "GO:0016043" "GO:0022402" "GO:0022403" "GO:0051276"
[16] "GO:0000070" "GO:0000226" "GO:0007010" "GO:0007017" "GO:0007051"
[21] "GO:0007052" "GO:0030472"
$YBR289W
 [1] "GO:0006139" "GO:0006259" "GO:0006323" "GO:0006325" "GO:0006996"
 [6] "GO:0007001" "GO:0008150" "GO:0008152" "GO:0009987" "GO:0016043"
[11] "GO:0016568" "GO:0043170" "GO:0043283" "GO:0044237" "GO:0044238"
[16] "GO:0051276" "GO:0006338"
$YGL215W
[1] "GO:0008150" "GO:0009987" "GO:0007049"
$YNL289W
[1] "GO:0008150" "GO:0009987" "GO:0007049"
$YDR343C
[1] "GO:0006810" "GO:0008150" "GO:0008643" "GO:0009987" "GO:0015749"
[6] "GO:0051179" "GO:0051234" "GO:0008645"
For instance, the yeast gene YEL061C (Kinesin motor protein involved in mitotic spindle assembly and chromosome segregation) is annotated with 22 BP classes, while YBR289W (Subunit of the SWI/SNF chromatin remodeling complex involved in transcriptional regulation) is annotated with 17 BP terms.

Now we have a list (Yeast.general.TAS) with all the annotated yest genes: each element of the list is a gene with a vector of the corresponding GO identifiers. We can construct a multilabel table T with the multilabel associated to each gene; rows are genes and columns are GO terms: an entry T[i,j] of the table is 1 if the i-th gene is annotated with the j-th class, 0 otherwise:

> Table.classes.TAS <- Build.GO.class.labels(Yeast.general.TAS);
> dim(Table.classes.TAS)
[1] 1201 1074
We thus have 1201 yeast genes; each one can be annotated with TAS evidence with one or more BP terms of the GO; a total of 1074 GO BP classes has yeast genes annotated with TAS evidence.

The resulting directed acyclic graph (DAG) of the GO classes can be constructed using the following code:

> BP.univ.graph <- Build.universal.graph.ontology.down(ontology = "BP");
> BP.Yeast.classes.TAS <- Get.classes(Yeast.general.TAS);
> gYeast.BP.TAS <- subGraph(BP.Yeast.classes.TAS, BP.univ.graph);
The first line of the code constructs the "universal graph" for the BP ontology collecting all the available terms of the BP ontology. This a very big and complex DAG:
> BP.univ.graph
A graphNEL graph with directed edges
Number of Nodes = 13155 
Number of Edges = 23701
The graph of the yeast classes (TAS evidence) is obtained through the last two lines of code. This is significantly smaller, but still quite complex:
> gYeast.BP.TAS
A graphNEL graph with directed edges
Number of Nodes = 1074 
Number of Edges = 1804
The yeast DAG is plotted in Fig. 1

Figure 1: DAG of the yeast BP ontology, with terms having annotated genes with TAS evidence. The DAG includes $1074$ GO classes (nodes), connected by $1804$ edges.
\begin{figure}\centering
\includegraphics [width = 16.0cm] {ps/Yeast.GO.graph.TAS.ps}\end{figure}

We can use the overall graph of the BP ontology to classify the genes/gene products of the yeast, or we may be interested in a classification at "low detail of resolution". For instance, we could only consider classes at distance of at most 3 from the BP root node, excluding all the other more specific classes. To this end, we can use HCgene to extract only the classes of interest:

> classes.depth.2 <- Select.GO.classes.by.distance(gYeast.BP.TAS, distance = 2, ontology = "BP");
> gYeast.depth.2 <- subGraph(classes.depth.2, gYeast.BP.TAS);
> Pretty.plot.graph(gYeast.depth.2,fontsize=12,fillcolor="transparent",height=2,width=3,color="transparent", fontcolor="transparent");
The corresponding graph is shown in Fig. 2. Note that we considered the minimum distance from the root node of the DAG.
Figure 2: DAG of the yeast BP ontology with nodes at distance at most 2 from the root.
\begin{figure}\centering
\includegraphics [width = 16.0cm] {ps/Yeast.GO.graph.TAS.depth.2.ps}\end{figure}

If we choose to study the classification of genes related to a given GO term, considering all the classes rooted in the graph at that term (or at a set of terms), we can apply the function Select.GO.rooted.classes. For instance, if we need to study the classes rooted to the GO term GO:0006041:

> get("GO:0006041",GOTERM)
GOID = GO:0006041

Term = glucosamine metabolic process

Synonym = chitosamine metabolic process
Synonym = chitosamine metabolism
Synonym = glucosamine metabolism

Definition = The chemical reactions and pathways involving glucosamine
     (2-amino-2-deoxyglucopyranose), an aminodeoxysugar that occurs in
     combined form in chitin.

Ontology = BP

we can use the following code:

selected.nodes <- Select.GO.rooted.classes (gYeast.BP.TAS, root.nodes="GO:0006041", ontology="BP");
g.rooted <- subGraph(selected.nodes, gYeast.BP.TAS);
The corresponding graph is depicted in Fig. 3:
Pretty.plot.graph(g.rooted,fontsize=12,fillcolor="yellow",height=0.9,width=1.4,color="transparent", fontcolor="black");
Figure 3: DAG of the yeast BP ontology rooted at the node GO:0006041 (glucosamine metabolic process).
\begin{figure}\centering
\includegraphics [width = 12.0cm] {ps/Yeast.GO.graph.TAS.glucosamine.ps}\end{figure}


Preparing data for the functional classification

Once we choose a specific classification task, we need to prepare data for the classification. For example, we can use phylogenetic data to classify genes annotated with TAS evidence in the yeast BP ontology, considering only classes whose distance from the BP root is at most 2 (Fig. 2).

To this end, we need to prepare phylogenetic data using HCGene facilities (but users can prepare their own data).

library(scehomology);
# get the ORF ID from Yeast.general.TAS
yeast.names.TAS<-names(Yeast.general.TAS)
data(ORF2HGID);
hgid.yeast.TAS <- mget(yeast.names.TAS, ORF2HGID, ifnotfound=list(NA));
hgid.yeast.TAS <- hgid.yeast.TAS[!is.na(hgid.yeast.TAS)];
# get the list of lists of homology data
l <- mget(sapply(hgid.yeast.TAS,function(x) x),scehomologyDATA);
names(l) <- names(hgid.yeast.TAS);

# building matrix of binary homology data
homo.binary.data <- Do.binary.data.homology(l);
# building matrix of floating homology data
homo.float.data <- Do.float.data.homology(l);
# writing binary homology data to a file 
write.table(homo.binary.data, file="Yeast.homo.binary.data.TAS.txt");
To construct the phylogenetic data we use the environment (hash table) ORF2HGID that maps yeast ORFs to HomoloGene IDs and then, using the HomoloGene IDs corresponding to the ORF IDs annotated with TAS evidence in the BP ontology, we extract from scehomologyDATA (package scehomology) the homology data related to the selected HomoloGene IDs. Hence, using the HCGene functions Do.binary.data.homology and Do.float.data.homology, we construct a data frame with the homology data of the yeast genes with respect to other 18 species (see the reference manual for more details about the functions Do.binary.data.homology and Do.float.data.homology). The first rows of the data frame for the binary homology data appears as follows:
> homo.binary.data
        kla ego ncr mgr spo sce ath hsa ptr cfa gga dme aga cel mmu rno osa pfa
YEL061C   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
YBR289W   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
YGL215W   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
YNL289W   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
YDR343C   0   1   1   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0
YEL060C   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0
YDL106C   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
YPL237W   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1   0
YPL028W   1   1   1   0   1   0   1   1   1   1   1   1   1   1   1   1   1   0
YHR084W   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
......
Each row corresponds to a yeast gene (with its ORF ID) and the rows correspond to homologies with other species Columns are labeled with three-letter short names of the organisms (e.g. ath corresponds to Arabidopsis thaliana, hsa corresponds to Homo sapiens and so on (see the scehomology package for more details).


Association of the data to the multilabels

After preparing the data, we need to pair them with the multilabels associated to the yeast genes. In principle we could learn a direct mapping from the genes to their multilabels. This would be achieved by combining the previously computed table of multilabels Table.classes.TAS with the corresponding data homo.binary.data, and using the DAG for taking class relationships into account.

In this example we consider a different approach in which we decompose the multiclass classification problem into a set of binary classification tasks, one for each node of the DAG. Positive examples for a given node are those genes which are annotated with the class associated to the node. Negative examples are selected among those genes whose annotations do not include the node under consideration. Each binary classification task in this set can be learned independently, or it can be learned by exploiting the relationships between the class nodes described by the DAG to improve the generalization performance [1].

Negative examples can be selected in different ways. The HCGene package supports three general strategies:

  1. A negative example for a node is any example whose multilabel does not include that node (Get.matrix.data.for.classid).
  2. A negative example for a node is any example whose multilabel does not include that node and any of its ancestors (Get.matrix.data.without.ancestors).
  3. A negative example for a node is any example whose multilabel does not include that node and includes at least one of its parents (Get.matrix.data.from.parent.only).
The following code prepares the data for the binary classification problem using the first strategy for selecting negative examples:
data.0006807.1 <- Get.matrix.data.for.classid(Table.classes.TAS, "GO:0006807", data.matrix=homo.binary.data, ontology = "BP");
This code uses the second strategy:
data.0006807.2 <- Get.matrix.data.without.ancestors(Table.classes.TAS, "GO:0006807", data.matrix=homo.binary.data, ontology = "BP");
Finally, this code uses the third strategy:
data.0006807.3 <- Get.matrix.data.from.parent.only(Table.classes.TAS, "GO:0006807", data.matrix=homo.binary.data, ontology = "BP");

Note that while the positive examples remain the same independently of the chosen strategy, the negative examples are substantially different. Indeed, considering the number of negative examples (element n.neg of the list returned by the three functions) we have:

> data.0006807.1$n.neg
[1] 1024
> data.0006807.2$n.neg
[1] 306
> data.0006807.3$n.neg
[1] 718
while the number of positive examples remains, of course, the same:
> data.0006807.1$n.pos
[1] 96
> data.0006807.2$n.pos
[1] 96
> data.0006807.3$n.pos
[1] 96

The data are stored in the X component of the lists data.0006807.1, data.0006807.2, data.0006807.3, while the labels for the positive examples (1) and the negative examples (2) are stored in the component labels.

By changing the arguments of the functions one can easily control the mix of positive and negative examples. For instance, by selecting the first strategy in the previous case we get ten times more negative examples than positive examples. To obtain a balanced data set is enough to set the argument ratio.negative to 1: in this case the number of positive and negative examples are equal. The negative examples are randomly sampled without replacement (see the reference manual for more details):

> data.0006807.1.balanced <- Get.matrix.data.for.classid(Table.classes.TAS, "GO:0006807", data.matrix=homo.binary.data, ontology = "BP", ratio.negative = 1);
> data.0006807.1.balanced$n.neg
[1] 96

Note that the above function can be used for different types of data (e.g. gene expression data, protein-protein interaction data, protein structural data). For instance, to obtain gene expression data related to the genes annotated with TAS evidence for the previously considered phylogenetic data, we can use the following lines of code based on the data from the Spellman et al. (1998) [12] yeast cell cycle microarray experiment:

> library(yeastCC);
> data(spYCCES);
> exprs.data <- exprs(spYCCES);
> data.0006807.exprs.1 <- Get.matrix.data.for.classid(Table.classes.TAS, "GO:0006807", data.matrix=exprs.data, ontology = "BP");
> dim(data.0006807.exprs.1$X)
[1] 1185   77

These functions can be also used to facilitate the integration between different types of data. For instance, in the above case we would like to select only the genes for which there are data in both the phylogenetic and gene expression datasets (see the reference manual for more details):

> common.genes <- Get.all.common.genes(Table.classes.TAS, homo.binary.data, exprs.data);
> length(common.genes)
[1] 1110
> data.phylo.0006807.1.common <- Get.matrix.data.for.classid(Table.classes.TAS, "GO:0006807", data.matrix=homo.binary.data, ontology = "BP", common.genes = common.genes);
> data.exprs.0006807.1.common <- Get.matrix.data.for.classid(Table.classes.TAS, "GO:0006807", data.matrix=exprs.data, ontology = "BP", common.genes = common.genes);
> dim(data.phylo.0006807.1.common$X)
[1] 1110   18
> dim(data.exprs.0006807.1.common$X)
[1] 1110   77
> sum(row.names(data.phylo.0006807.1.common$X)!=row.names(data.exprs.0006807.1.common$X))
[1] 0
The function Get.all.common.genes automatically finds the genes common to all the listed datasets. Subsequently, the resulting genes (common.genes) are passed as argument to the function Get.matrix.data.for.classid (or to the other related functions Get.matrix.data.without.ancestors and Get.matrix.data.from.parent.only) to obtain datasets having the same common genes. This could be easily extended to an arbitrary number of datasets.

These functions can be used to prepare data for any node in the GO DAG. Note that the same functions can be also used to build data within the FunCat taxonomy (see the reference manual for more details).