Ananth Kalyanaraman @ WSU

Research @ HPCBio lab

Disclaimer: Unfortunately I have not been able to keep this webpage updated for a while now. To get the best idea about the research that goes on in our lab, please visit the publications page. The following page shows a few examples (some active and some from the past lines of inquiry). Please also check out AgAID.org for work related to AI and its applications to agriculture.

Our research interests lie at the intersection of three broad areas: high performance computing, bioinformatics and computational biology, and combinatorial algorithms. Specifically, we are drawn to problems that are motivated through their applications to data-driven sciences (particularly, from modern day life sciences); that have a combinatorial flavor (e.g., graphs, strings, searching); and that have a need for tackling scale and complexity.

Active Research Topics:

Scalable Graph Analytics: Algorithms and Architectures

o Parallel Graph Community Detection

o Bipartite Graph Community Detection

o Homology Graph Construction

o Graph Coloring

o Parallel Architectures for Graph Analytics and Biocomputing

Bioinformatics

o Genome Assembly

o Topological Data Analytics

Funding Sources: We gratefully acknowledge all our research sponsors that include NSF, DOE, USDA, and CDC.

Scalable Graph Analytics for Big Data Applications and HPC Architectures

A key characteristic of big data that is integral to the discovery pipelines that use the data, is the inherent inter-connectivity of entities captured by the data – e.g., a set of interacting molecules that form biological networks, or a set of neurons signaling each other to dictate the functioning of a brain, or people communicating via social media to form friendship networks. Consequently, graph and network representations have taken a centerstage in modeling the behavior of systems at scale. However, graph algorithms have been known to be notorious for parallelization as they generate irregular memory and data access patterns, creating a body of unique design challenges.

communities_example One of our primary research interests is in designing and developing novel scalable algorithms and software to support large-scale graph analytics for real world applications.

Parallel Graph Community Detection

Graph clustering (or community detection) is a fundamental operation in graph theory, used as a structure discovery tool for analyzing large graphs. The goal is to identify tightly-knit groups of vertices in a given input graph. Community detection finds use in a broad range of application areas including biological networks, citation networks, social networks, among others. Since 2015, we have been developing the Grappolo-Vite graph community detection toolkit.

Grappolo is a multithreaded implementation of the widely used Louvain heuristic for community detection. The novelty in our approach stems from the set of schemes we devised to break dependencies and generate concurrency while processing large graphs. Grappolo has demonstrated the ability to detect communities from graphs containing billions of edges in a matter of minutes without compromising on the quality of the solution.
Vite is an MPI-based implementation for parallel community detection on distributed memory clusters.

Representative Papers:

Parallel heuristics for scalable community detection. H. Lu, M. Halappanavar, A. Kalyanaraman. Parallel Computing, vol. 47, pp. 19-37, 2015.
Distributed Louvain algorithm for graph community detection. S. Ghosh, M. Halappanavar, A. Tumeo, A. Kalyanaraman, H. Lu, D. Chavarria-Miranda, A. Khan, A. Gebremedhin. Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 885-895, 2018.

Software: Grappolo, Vite

Key Collaborators: Mahantesh Halappanavar

Bipartite Graph Community Detection

Heterogeneous graph-theoretic modeling has become an important part of biological network science, owing to the variety in data sources. Analyzing the interrelationships between genes vs. diseases, proteins vs. drugs, transcriptome vs. metabolites, predators vs. preys, or hosts vs. pathogens – all such relationships can be modeled in the form of a bipartite graph.

biLouvain_example

biLouvain is a bipartite graph community detection tool. As part of this work, we defined a new modularity-based metric (which we call Murata+) that can serve as an objective function for bipartite community detection; and designed an efficient algorithm to detect communities from such networks based on the metric. Our findings show that: a) this problem is more time-consuming than the standard graph version of the problem; and b) biLouvain is able to achieve between one to four orders of magnitude speedup over state-of-the-art. We have also extended biLouvain to work for cases where there are intra-type edges (in addition to the inter-type edges of a classical bipartite graph).

Representative Papers:

Efficient detection of communities in biological bipartite networks. P. Pesantez, A. Kalyanaraman. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), In Press, 2017. doi: 10.1109/TCBB.2017.2765319.

Software: biLouvain

Homology Graph Construction

In a number of large-scale graph applications, particularly in the life sciences, an input graph is not readily always available; instead it needs to be constructed using pairwise homology information available from raw data. Our original work in homology graph construction was motivated by its application in identifying protein families from newly sequenced environmental microbial communities (i.e., from metagenomics data). We pose this problem as one of constructing a protein sequence homology graph in the first step, and subsequently identifying dense subgraphs within that graph. This work led to the pGraph-pClust software pipeline, for homology graph construction (pGraph) and graph clustering (pClust). pGraph_performance

Our pGraph work uses string data structures such as suffix trees and arrays, and parallel techniques to perform load balancing such as work stealing. It is designed for distributed memory parallel machines. Notably, the pGraph work showed for the first time that it is possible to construct highly accurate homology graphs for tens of millions of proteins in a matter of minutes.
Our pClust work uses an efficient approach based on MinHashing that made it readily parallelizable on distributed memory supercomputers.

Representative Papers:

A work stealing based approach for enabling scalable optimal sequence homology. J. Daily, A. Kalyanaraman, S. Krishnamoorthy, A. Vishnu. Journal of Parallel and Distributed Computing (JPDC), vol. 79-80, pp. 132-142, 2015. doi: 10.1016/j.jpdc.2014.08.009.
pGraph: Efficient parallel construction of large-scale protein sequence homology graphs. C. Wu, A. Kalyanaraman, W.R. Cannon. IEEE Transactions on Parallel and Distributed Systems (TPDS), 23(10):1923-1933, 2012. doi: 10.1109/TPDS.2012.19.
An efficient parallel approach for identifying protein families in large-scale metagenomic data sets. C. Wu, A. Kalyanaraman. Proc. ACM/IEEE Supercomputing conference (SC'08), pp. 1-10, 2008.

Software: pGraph/pGraph-Tascel, pClust

Key Collaborators: Sriram Krishnamoorthy

Graph Coloring

Coloring is a fundamental graph operation that is widely used by numerous applications that attempt to identify maximally independent subsets of vertices (i.e., those that do not depend on one another). Many parallel computing applications use coloring to identify such subsets so that they determine what subset of vertices can be processed concurrently. However, traditional formulations of graph coloring focus solely on minimizing the number of colors used (i.e., to reduce the number of parallel steps); and in the process they end up generating skewed distributions of color sizes where a a majority of the color classes receive very few vertices (thereby negatively impacting thread utilization). coloring

We present a new variant called the balanced graph coloring, which tries to achieve a dual objective of: a) minimizing the number of colors used; and b) keeping the color classes balanced in size. For this formulation, we designed multiple classes of heuristics and tested them on real world inputs. Our studies showed that our schemes were able to achieve near perfect balancing while keeping the colors as low as the traditional schemes. We also applied the results of balanced coloring in the community detection application and were able to improve its parallel performance two-fold for most inputs. This coloring routine has been integrated into the Grappolo toolkit.

Representative Papers:

Algorithms for balanced colorings with applications in parallel computing. H. Lu, M. Halappanavar, D. Chavarria-Miranda, A. Gebremedhin, A. Panyala, A. Kalyanaraman. IEEE Transactions on Parallel and Distributed Systems (TPDS), 28(5):1240-1256, 2017. doi: 10.1109/TPDS.2016.2620142.

Software: Grappolo

Key Collaborators: Mahantesh Halappanavar, Daniel Chavarria-Miranda, Assefaw Gebremedhin

Parallel Architectures for Graph Analytics and Biocomputing

Mapping irregular application codes from bioinformatics and graph computations, on the next generation of high performance computing architectures, is an important challenge in high performance computing. NoC_hilbert

We have extensively worked on mapping a number of bioinformatics and graph compute kernels to Network-on-Chip (NoC) enabled manycore platforms. Many of these operations, as particularly with graphs, introduce irregular on-chip traffic patterns that call for new types of on-chip interconnects. We have pursued multiple wireless- and wireline-based NoC interconnects. We have also mapped popular bioinformatics software routines for sequence alignment (using dynamic programming) and phylogenetic reconstruction on NoC architectures.

This line of work represents some of the first studies for mapping large-scale combinatorial irregular applications on NoC based manycore architectures.

Representative Papers:

Accelerting graph community detection with approximate updates via an energy-efficient NoC. K. Duraisamy, H. Lu, P. Pande, A. Kalyanaraman. Proc. Design Automation Conference (DAC), p. 89, 2017. doi: 10.1145/3061639.3062194.
Network-on-Chip hardware accelerators for biological sequence alignments. S. Sarkar, G. Kulkarni, P. Pande, A. Kalyanaraman. IEEE Transactions on Computers, 59(1):29-41, 2010.

Key Collaborators: Partha Pande

Bioinformatics Research

Genome Assembly:

De novo genome assembly is a classical problem in bioinformatics that aims to assemble an unknown genome from the short DNA reads obtained from it through sequencing. Due to significant advancements in sequencing technology, de novo genome assembly continues to be an active research topic. Over the years, we have contributed to the development of genome assemblers and their application to multiple genome projects (apple, maize, Brachypodium). Yet, the problem with tackling very large inputs (billions of DNA reads) continues to be both a time- and memory-consuming process. genome_assembly

We have developed a new approach FastEtch for genome assembly. FastEtch is fundamentally different from state-of-the-art assemblers in a couple of ways: i) it constructs an approximate (and incomplete) version of the de Bruijn graph in order to tradeoff quality for performance (both time and memory); and ii) it operates with a sub-linear space complexity and uses fast heuristics to generate the assembly. Our experimental evaluation shows that despite approximation, our assembler is able to reconstruct a target genome nearly as accurate as the best of the exact methods and in time and space that is significantly smaller. We are currently working on designing a new distributed memory genome assembler.

Representative Papers:

FastEtch: A fast sketch-based assembler for genomes. P. Ghosh, A. Kalyanaraman. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), In Press, 2017. doi: 10.1109/TCBB.2017.2737999.
Genome assembly. A. Kalyanaraman. Encyclopedia of Parallel Computing, D. Padua (Ed.), Springer Science+Business Media LLC, pp. 755-768, 2011. doi: 10.1007/978-0-387-09766-4.
The genome of the domesticated apple (Malus domestica Borkh.), R. Velasco, A. Zharkikh, J. Affourtit, A. Dhingra, A. Cestaro, A. Kalyanaraman, et al. Nature Genetics, 42:833-839, 2010. doi: 10:1038/ng.654.
Assembling genomes on large-scale parallel computers. A. Kalyanaraman, S. J. Emrich, P.S. Schnable, S. Aluru. Journal of Parallel and Distributed Computing (JPDC), 67(12):1240-1255, 2007.

Software: FastEtch, PaCE

Key Collaborators: Sriram Krishnamoorthy

Topological Data Analytics with Applications to the Life Sciences

Life science applications are rapidly adopting a wide range of sensing and high-throughput molecular and imaging technologies to generate complex data sets. These data sets are generated, with or without preconceived hypotheses, making the problem of gleaning actionable information from these data difficult. Computational techniques and advanced data mining tools are needed to analyze these complex, high dimensional data sets. Hyppo-X_example

In this project, we are pursuing a novel approach to develop a semi-automated capability for hypothesis extraction from complex, high-dimensional data sets. Our approach uses principles from algebraic topology, which is a branch of computational mathematics that studies shapes and structures, for representing high-dimensional data, and for performing a visual exploration of complex data in order to extract plausible(testable) hypotheses. The Hyppo-X tool is viewed as an instantiation of the Mapper abstraction proposed originally by Carlsson et al. at Stanford. Hyppo-X provides novel capabilities to visually and analytically explore complex data, and in the process, enable the domain expert to identify potentially interesting (hidden) subpopulations that have behave distinctly from the rest of the population. Our research has opened up two new lines of inquiry:
a) understand how to effectively define "interesting" structural features of the data; and
b) efficient algorithms and representations to identify and characterize such interesting structural features.

We are currently applying our Hyppo-X framework on different application use-cases:

Plant phenomics, where the grand challenge question is to understand how different crop genotypes (G) interact with their environments (E) to produce varying degrees of performance in their phenotypes (P); relating to the well known open problem in plant biology of G x E => P), with direct implications on crop breeding and selection.
See NSF-funded project webpage for more details.
Antimicrobial stewardship, where the goal is to understand key factors contributing to antibiotic usage and in the process, help devise mechanisms for effective intervention and prevention of antimicrobial resistance.

Representative Papers:

Towards a scalable framework for complex high-dimensional phenomics data. M. Kamruzzaman, A. Kalyanaraman, B. Krishnamoorthy, P.S. Schnable. arXiV preprint arXiV:1707.04362, 2017.
Detecting divergent subpopulations in phenomics data using interesting flares. M. Kamruzzaman, A. Kalyanaraman, B. Krishnamoorthy. ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), pp. 155-164, 2018.
Interesting paths in Mapper. A. Kalyanaraman, M. Kamruzzaman, B. Krishnamoorthy. arXiV preprint arXiv:1712.10197, 2017.

Software: Hyppo-X

Key Collaborators: Bala Krishnamoorthy, Pat Schnable, Bei Wang Phillips, Zhiwu Zhang, Eric Lofgren, Rebekah Moehring