OMBlast: Alignment Tool for Optical Mapping Using a Seed-and-extend Approach (Bioinformatics, 2017)

Motivation: Optical mapping is a technique for capturing fluorescent signal patterns of long DNA molecules (in the range of 0.1 Mbp to 1 Mbp). Recently, it has been complementing the widely-used short-read sequencing technology by assisting with scaffolding and detecting large and complex structural variations. Here, we introduce a fast, robust and accurate tool called OMBlast for aligning optical maps, the set of signal locations on the molecules generated from optical mapping. Our method is based on the seed-and-extend approach from sequence alignment, with modifications specific to optical mapping.

Results: Experiments with both synthetic and our real data demonstrate that OMBlast has higher accuracy and faster mapping speed than existing alignment methods. Our tool also shows significant improvement when aligning data with structural variations.

Availability and Implementation: OMBlast is implemented for Java 1.7 and is released under a GPL license. OMBlast can be downloaded from https://github.com/aldenleung/OMBlast and run directly on machines equipped with a Java virtual machine.

Contact: kevinyip@cse.cuhk.edu.hk; tf.chan@cuhk.edu.hk

Supplementary Information: Supplementary information is available at Bioinformatics online and the source code is available from the OMBlast website (https://github.com/aldenleung/OMBlast).

Query Modification through External Sources\\to Support Clinical Decisions (TREC, 2014)

For the Clinical Decision Support Track of TREC 2014, we looked into the effect of modifying queries by adding or removing terms. We considered both automatic and manual query modifications that use either external data sources or a domain expert. While each method gave slightly different results, we discovered that the manual method still performed slightly better among the methods we considered. This is despite the fact that the manual queries were formed through just term removal; no new terms were added.

ViralFusionSeq: Accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution (Bioinformatics, 2013)

Summary: Insertional mutagenesis from virus infection is an important pathogenic risk for the development of cancer. Despite the advent of high-throughput sequencing, discovery of viral integration sites and expressed viral fusion events are still limited. Here, we present ViralFusionSeq (VFS), which combines soft-clipping information, read-pair analysis, and targeted de novo assembly to discover and annotate viral-human fusions. VFS was used in an RNA-Seq experiment, simulated DNA-Seq experiment and re-analysis of published DNA-Seq datasets. Our experiments demonstrated that VFS is both sensitive and highly accurate.

Availability: VFS is distributed under GPL version 3 at http://hkbic.cuhk.edu.hk/software/viralfusionseq

Supplementary Information: Supplementary data are available at Bioinformatics Online.

Transformations for the Compression of FASTQ Quality Scores of Next Generation Sequencing Data (Bioinformatics 2012)

Motivation: The growth of next generation sequencing means that more effective and efficient archiving methods are needed to store the generated data for public dissemination and in anticipation of more mature analytical methods later. This paper examines methods for compressing the quality score component of the data to partly address this problem.

Results: We compare several compression policies for quality scores, in terms of both compression effectiveness and overall efficiency. The policies employ lossy and lossless transformations with one of several coding schemes. Experiments show that both lossy and lossless transformations are useful, and that simple coding methods, which consume less computing resources, are highly competitive, especially when random access to reads is needed.

Availability and Implementation: Our C++ implementation, released under the Lesser General Public License, is available for download at http://ihome.cuhk.edu.hk/~b126594/software/qscores-archiver.html.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Sorting Next Generation Sequencing Data Improves Compression Effectiveness (BIBM 2010 workshop)

With the increase usage of next generation sequencing, the problem of effectively storing and transmitting such massive amounts of data will need to be addressed. Current repositories such as the Sequence Read Archive (SRA) currently use the FASTQ format and a general-purpose compression systems (gzip) for data archiving. In this work, we investigate how gzip (and bzip2) can be made more effective for read archiving by pre-sorting the reads. The improvement in compression effectiveness of just the sequences is a reduction of at most 12% and of up to 6% when the original FASTQ data is considered.

HAMSTER: visualizing microarray experiments as a set of minimum spanning trees (SCFBM)

Background: Visualization tools allow researchers to obtain a global view of the interrelationships between the probes or experiments of a gene expression (e.g. microarray) data set. Some existing methods include hierarchical clustering and k-means. In recent years, others have proposed applying minimum spanning trees (MST) for microarray clustering. Although MST-based clustering is formally equivalent to the dendrograms produced by hierarchical clustering under certain conditions; visually they can be quite different.

Methods: HAMSTER (Helpful Abstraction using Minimum Spanning Trees for Expression Relations) is an open source system for generating a set of MSTs from the experiments of a microarray data set. While previous works have generated a single MST from a data set for data clustering, we recursively merge experiments and repeat this process to obtain a set of MSTs for data visualization. Depending on the parameters chosen, each tree is analogous to a snapshot of one step of the hierarchical clustering process. We scored and ranked these trees using one of three proposed schemes. HAMSTER is implemented in C++ and makes use of Graphviz for laying out each MST.

Results: We report on the running time of HAMSTER and demonstrate using data sets from the NCBI Gene Expression Omnibus (GEO) that the images created by HAMSTER offer insights that differ from the dendrograms of hierarchical clustering. In addition to the C++ program which is available as open source, we also provided a web-based version (HAMSTER+) which allows users to apply our system through a web browser without any computer programming knowledge.

Conclusions: Researchers may find it helpful to include HAMSTER in their microarray analysis workflow as it can offer insights that differ from hierarchical clustering. We believe that HAMSTER would be useful for certain types of gradient data sets (e.g time-series data) and data that indicate relationships between cells/tissues. Both the source and the web server variant of HAMSTER are available from http://hamster.cbrc.jp/.

Efficient Probabilistic Latent Semantic Analysis Through Parallelization (AIRS 2009)

Probabilistic latent semantic analysis (PLSA) is considered an effective technique for information retrieval, but has one notable drawback: its dramatic consumption of computing resources, in terms of both execution time and internal memory. This drawback limits the practical application of the technique only to document collections of modest size.

In this paper, we look into the practice of implementing PLSA with the aim of improving its efficiency without changing its output. Recently, Hong et al. [2008] has shown how the execution time of PLSA can be improved by employing OpenMP for shared memory parallelization. We extend their work by also studying the effects from using it in combination with the Message Passing Interface (MPI) for distributed memory parallelization. We show how a more careful implementation of PLSA reduces execution time and memory costs by applying our method on several text collections commonly used in the literature.

Chapter 8: Discovering network motifs in protein interaction networks

This chapter examines some of the available techniques for analyzing a protein interaction network (PIN) when depicted as an undirected graph. Within this graph, algorithms have been developed which identifies "notable" smaller building blocks called network motifs. We examine these algorithms by dividing them into two broad categories based on their definition of "notable": (a) statistically-based methods and (b) frequency-based methods. We describe how these two classes of algorithms differ not only in terms of efficiency, but also in terms of the type of results that they report. Some publicly-available programs are demonstrated as part of our comparison. While most of the techniques are generic and were originally proposed for other types of networks, the focus of this chapter is on the application of these methods and software tools to PINs.

(In X.-L. Li and S.-K. Ng, editors, Biological Data Mining in Protein Interaction Networks)

Term impacts as normalized term frequencies for BM25 similarity scoring (SPIRE 2008)

The BM25 similarity computation has been shown to provide effective document retrieval. In operational terms, the formulae which form the basis for BM25 employ both term frequency and document length normalization. This paper considers an alternative form of normalization using document-centric impacts, and shows that the new normalization simplifies BM25 and reduces the number of tuning parameters. Motivation is provided by a preliminary analysis of a document collection that shows that impacts are more likely to identify documents whose lengths resemble those of the relevant judgments. Experiments on TREC data demonstrate that impact-based BM25 is as good as or better than the original term frequency-based BM25 in terms of retrieval effectiveness.

A framework for determining outlying microarray experiments (IBSB 2008)

Microarrays are high-throughput technologies whose data are known to be noisy. In this work, we propose a graph-based method which first identifies the extent to which a single microarray experiment is noisy and then applies an error function to clean individual expression levels. These two steps are unified within a framework based on a graph representation of a separate data set from some repository.

We demonstrate the utility of our method by comparing our results against statistical methods by applying both techniques to simulated microarray data. Our results are encouraging and indicate one potential use of microarray data from past experiments.

Passage Retrieval with Vector Space and Query-Level Aspect Models (TREC 2007)

This report describes the joint work by Kyoto University and the University of Melbourne for the TREC Genomics Track in 2007. As with 2006, the task for this year was the retrieval of passages from a biomedical document collection. The overall framework of our system from 2006 remains unchanged and is comprised of two parts: a paragraph-level retrieval system and a passage extraction system. These two systems are based on the vector space model and a probabilistic word-based aspect model, respectively. This year, we have adopted numerous changes to our 2007 system which we believe corrected some problems.

Block Merging for Off-Line Compression (JASIST 2007)

To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once -- usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat, which processes large messages as disjoint blocks in order to limit memory consumption. We show that the blocks emitted by Re-Pair can be post-processed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.

Note: Extended and updated version of the CPM 2001 paper and the PhD thesis (Chapter 5).

Combining Vector-Space and Word-based Aspect Models for Passage Retrieval (TREC 2006)

This report summarizes the work done at Kyoto University and the University of Melbourne for the TREC 2006 Genomics Track. The single task for this year was to retrieve passages from a biomedical document collection. We devised a system made of two parts to deal with this problem. The first part was an existing IR system based on the vector-space model. The second part was a newly developed probabilistic word-based aspect model for identifying passages within relevant documents (or paragraphs).

Applying Gaussian Distribution-dependent Criteria to Decision Trees for High-Dimensional Microarray Data (VDMB 2006)

Biological data presents unique problems for data analysis due to its high dimensions. Microarray data is one example of such data which has received much attention in recent years. Machine learning algorithms such as support vector machines (SVM) are ideal for microarray data due to its high classification accuracies. However, sometimes the information being sought is a list of genes which best separates the classes, and not a classification rate.

Decision trees are one alternative which do not perform as well as SVMs, but their output is easily understood by non-specialists. A major obstacle with applying current decision tree implementations for high-dimensional data sets is their tendency to assign the same scores for multiple attributes. In this paper, we propose two distribution-dependant criteria for decision trees to improve their usefulness for microarray classification.

Cleaning Microarray Expression Data using Markov Random Fields Based on Profile Similarity (SAC 2005)

This paper proposes a method for cleaning the noise found in microarray expression data sets. While other methods either concentrate on the image processing phase, or apply normalization techniques to the resulting microarray raw expression values, we introduce a method based on Markov random fields for the data set. The cleaning process is guided by genes with similar expression profiles. As a result, data cleaning is conducted using biological similarity of genes.

Browsing and Searching Compressed Documents (PhD thesis)

Compression and information retrieval are two areas of document management that exist separately due to the conflicting methods of achieving their goals. This research examines a mechanism which provides lossless compression and phrase-based browsing and searching of large document collections. The framework for the investigation is an existing off-line dictionary-based compression algorithm.

An analysis of the algorithm, supported by previous work and experiments, highlights two factors that are important for retrieval: efficient decoding, and a separate dictionary stream. However, three areas of improvement are necessary, prior to the inclusion of the algorithm into a browsing system.

First, in order to accommodate retrieval, the algorithm must produce a dictionary built up on words, rather than characters. A pre-processing stage is introduced which separates the message into words and non-words, along with word modifiers.

Second, the memory requirements of the algorithm prevent the processing of large documents. Earlier work has proposed a solution which separates the message into individual blocks prior to compression. Here, a post-processing stage is proposed which combines the blocks in a series of phases. Experiments show the trade-offs between the number of phases performed and the improvements in compression levels.

Organisations which provide access to an information retrieval system to users are sometimes concerned with the amount of disk space available. But users have a different point of view and may place response time at a higher priority. Indeed, faster computers and network connections translate into plummeting patience levels of users. The last improvement to the compression algorithm is two new coding schemes which replace the entropy coder that was used in previous work. While deploying them sacrifices compression effectiveness, these two mechanisms offer improved efficiency, as shown through experiments.

With the enhancements to the compression algorithm in hand, a technique to efficiently support phrase browsing is presented. Phrase contexts can be searched and progressively refined through the word modifiers. Because of the three changes to the algorithm, phrases are more visually appealing, larger documents can be processed, and response times are improved.

Evaluating Statisitically Generated Phrases (ADCS 2003)

An experimental framework for the evaluation of statistically generated phrases is described. Two methods are considered. The first compares the phrases with those found by a natural language processing system; while the second examines the ability to locate all instances of each phrase. The motivation for this work is to determine the usefulness of statistically generated phrases in a document retrieval system.

Block Merging for Off-Line Compression (CPM 2002)

To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once. In this work we consider the Re-Pair mechanism of Larsson and Moffat [2000], which processes large messages as disjoint blocks. We show that the blocks emitted by Re-Pair can be post-processed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.

Re-Store: A System for Compressing, Browsing, and Searching Large Documents (SPIRE 2001)

We describe a software system for managing text files of up to several hundred megabytes that combines a number of useful facilities. First, the text is stored compressed using a variant of the Re-Pair mechanism described by Larsson and Moffat, with space savings comparable to those obtained by other widely used general-purpose compression systems. Second, we provide, as a byproduct of the compression process, a phrase-based browsing tool that allows users to explore the contents of the source text in a natural and useful manner. And third, once a set of desired phrases has been determined through the use of the browsing tool, the compressed text can be searched to determine locations at which those phrases appear, without decompressing the whole of the stored text, and without use of an additional index. That is, we show how the Re-Pair compression regime can be extended to allow phrase-based browsing and fast interactive searching.

Effective Compression for the Web: Exploiting Document Linkages (ADC 2001)

Providing the infrastructure that supports the World-Wide Web is expensive. The costs incurred in setting up a web site include those associated with the content being served; those associated with the hardware necessary to support the site; and the network costs incurred in transmitting that content to the end consumers. In this work we examine mechanisms for compressing web content so as to reduce the third of these three costs, and describe a scheme that exploits the known connectivities between web pages to derive improved transmission cost savings compared to the obvious approach of simply compressing each page on the site using a standard tool such as GZip. Experiments on a medium-sized web site confirm our claims that considerable reductions in network bandwidth requirements can be achieved.

List of publications