Van Dinh Tran

Curriculum
Computer Science and Innovation for Societal Challenges, XXX series
Grant sponsor
BioInfogen project
Supervisor 
Alessandro Sperduti
Co-supervisor 
Franca Stablum

Project: Kernel methods for large-scale graph-based heterogeneous biological data integration.
Full text of the dissertation book can be downloaded from: http://paduaresearch.cab.unipd.it/11217/

Abstract
The last decade has experienced a rapid growth in volume and diversity of biological data, thanks to the development of high-throughput technologies related to web services and embeded systems. It is common that infor- mation related to a given biological phenomenon is encoded in multiple data sources. On the one hand, this provides a great opportunity for biologists and data scientists to have more unified views about phenomenon of interest. On the other hand, this presents challenges for scientists to find optimal ways in order to wisely extract knowledge from such huge amount of data which normally cannot be done without the help of automated learning systems. Therefore, there is a high need of developing smart learning systems, whose input as set of multiple sources, to support experts to form and assess hypotheses in biology and medicine. In these systems, the problem of combining multiple data sources or data integration needs to be effciently solved to achieve high performances.
Biological data can naturally be represented as graphs. By taking graphs for data representation, we can take advantages from the access to a solid and principled mathematical framework for graphs, and the problem of data integration becomes graph-based integration. In recent years, the machine learning community has witnessed the tremendous growth in the development of kernel-based learning algorithms. Kernel methods whose kernel functions allow to separate between the representation of the data and the general learning algorithm. Interestingly, kernel representation can be applied to any type of data, including trees, graphs, vectors, etc. For this reason, kernel methods are a reasonable and logical choice for graph-based inference systems. However, there is a number of challenges for graph-based systems using kernel methods need to be effectively solved, including definition of node similarity measure, graph sparsity, scalability, efficiency, complementary property exploitation, integration methods. The contributions of the thesis aim at investigating to propose solutions that overcome the challenges faced when constructing graph-based data integration learning systems.
The first contribution is the definition of a decompositional graph node kernel, named Conjunctive Disjunctive Node Kernel (CDNK), which intends to measure the similarities between nodes of graphs. Difierently of existing graph node kernels that only exploit the topologies of graphs, the proposed kernel also utilizes the available information on the graph nodes. In CDNK, first, the graph is transformed into a set of linked connected components in which we distinguish between \conjunctive" links whose endpoints are in the same connected components and \disjunctive" links that connect nodes located in different connected components. Then the similarity between any couple of nodes is measured by employing a particular graph kernel on two neighborhood subgraphs rooted as each node. Next, it integrates the side information by applying convolution of the discrete information with the real valued vectors associated to graph nodes. Empirical evaluation shows that the kernel presents better performance compared to state-of-the-art graph node kernels.
The second contribution aims at dealing with the graph sparsity problem. When working with sparse graphs, i.e graphs with a high number of missing links, the available information is not efficient to learn effectively. An idea to overcome this problem is to use link enrichment to enrich information for graphs. However, the performance of a link enrichment strongly depends on the adopted link prediction method. Therefore, we propose an effective link prediction method (JNSL). In this method, first, each link is represented as a joint neighborhood subgraphs. Then link prediction is considered as a binary classiffication. We empirically show that the proposed link prediction outperforms various other methods. Besides, we also present a method to boost the performance of diffusion-based kernels, which are most popularly used, by coupling kernel methods with link enrichment. Experimental results prove that the performances of diffusion-based graph node kernels are considerably improved by using link enrichment.
The last contribution proposes a general kernel-based framework for graph integration that we name Graph-one. Graph-one is designed to over- come the challenges when handling with graph integration. In particular, it is a scalable and efficient framework. Besides, it is able to deal with unbanlanced settings where the number of positive and negative instances are much different. Numerous variations of Graph-one are evaluated in dis- ease gene prioritization context. The results from experiments illustrate the power of the proposed framework. Precisely, Graph-one shows better per- formance than various methods. Moreover, Graph-one with data integration gets higher results than it with any single data source. It presents the ef- fectiveness of Graph-one in exploiting the complementary property of graph integration.