Authors: Christopher Pockrandt, Mai Alzamel, Costas Iliopoulos and Knut Reinert
Abstract: We present a fast and exact algorithm to compute the (k,e)-mappability. Its inverse, the (k,e)-frequency counts the number of occurrences of each k-mer with up to e errors in a sequence. The algorithm we present is a magnitude faster than the widely used algorithm by the GEM suite, and can even compute the mappability for short k-mers on highly repetitive plant genomes. We also show that mappability can be computed on multiple sequences to identify marker genes illustrated by the example of E. coli strains. GenMap allows exporting the mappability information into different formats such as raw output and wig and bed-files. The application and its C++ source code is available on https://github.com/cpockrandt/genmap.
Authors: Charlotte A. Darby, James R. Fitch, Patrick J. Brennan, Benjamin J. Kelly, Natalie Bir, Vincent Magrini, Jeffrey Leonard, Catherine E. Cottrell, Julie M. Gastier-Foster, Richard K. Wilson, Elaine R. Mardis, Peter White, Ben Langmead and Michael C. Schatz
Abstract: We present Samovar, a mosaic single-nucleotide variant (SNV) caller for linked-read whole-genome shotgun sequencing data. Samovar scores candidate sites using a random forest model trained using the input dataset that considers read quality, phasing, and linked-read characteristics. We show Samovar calls mosaic SNVs within a single sample with accuracy comparable to what previously required trios or matched tumor/normal pairs and outperform single-sample mosaic variant callers at MAF 5%-50% with at least 30x coverage. Furthermore, we use Samovar to find somatic variants in whole genome sequencing of both tumor and normal from 13 pediatric cancer cases that can be corroborated with high recall with whole exome sequencing. Samovar is available open-source at https://github.com/cdarby/samovar under the MIT license.
Authors: Alexander Shlemov and Anton Korobeynikov
Abstract:Recently large databases containing profile Hidden Markov Models (pHMMs) emerged. These pHMMs may represent the sequences of antibiotic resistance genes, or allelic variations amongst highly conserved housekeeping genes used for strain typing, etc. The typical application of such a database includes the alignment of contigs to pHMM hoping that the sequence of gene of interest is located within the single contig. Such a condition is often violated for metagenomes preventing the effective use of such databases.
We present PathRacer – a novel tool that aligns profile HMM directly to the assembly graph (performing the codon translation on fly if necessary for amino acid pHMMs). The tool provides the set of most probable paths traversed by a HMM through the whole assembly graph, regardless whether the sequence of interested is encoded on the single contig or scattered across the set of edges, therefore significantly improving the recovery of sequences of interest even from fragmented metagenome assemblies.
Authors: Cuncong Zhong and Shaojie Zhang
Abstract: MicroRNA (miRNA) trans-regulates the stability of many mRNAs and controls their expression levels. The reconstruction of the miRNA-mRNA interactome is key to the understanding of the miRNA regulatory network and related biological processes. However, existing miRNA target prediction methods are limited to canonical miRNA-mRNA interactions and have high false prediction rates; while other experimental methods are low-throughput and cannot be used to probe genome-wide interactions. To address this challenge, the Crosslinking Ligation And Sequencing of Hybrids (CLASH) technology was developed for high-throughput probing of transcriptome-wide microRNA-mRNA interactions in vivo. However, the mapping of the duplex reads, which are chimeras of two ultra-short RNA strands, poses significant computational challenges to current mapping/alignment methods. To address this issue, we developed a novel algorithm called CrossLinked reads ANalysis tool (CLAN), and systematically benchmarked it with other mapping tools including BLAST, BWA, STAR, HISAT2, and BOWTIE2. The results showed that CLAN generated comparable mapping of singular reads as the other tools, and significantly outperformed the other tools in mapping simulated and real CLASH duplex reads. CLAN is implemented in GNU C++, and is freely available from https://sourceforge.net/projects/clan-mapping.
Authors: Giulia Bernardini, Paola Bonizzoni, Luca Denti, Marco Previtali and Alexander Schönhuth
Abstract: The amount of genetic variation discovered and characterized in human populations is huge, and is growing rapidly with the widespread availability of modern sequencing technologies. Such a great deal of variation data, that accounts for human diversity, leads to various challenging computational tasks, including variant calling and genotyping of newly sequenced individuals. The standard pipelines for addressing these problems include read mapping, which is a computationally expensive procedure. A few mapping-free tools were proposed in recent years to speed up the genotyping process. While such tools have highly efficient run-times, they focus on isolated, bi-allelic SNPs, providing limited support for multi-allelic SNPs, indels, and genomic regions with high variant density.
To address these issues, we introduce MALVA, a fast and lightweight mapping-free method to genotype an individual directly from a sample of reads. MALVA is the first mapping-free tool that is able to genotype multi-allelic SNPs and indels, even in high density genomic regions, and to effectively handle a huge number of variants such as those provided by the 1000 Genome Project. An experimental evaluation on whole-genome data shows that MALVA requires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels, MALVA provides even better results than the most widely adopted variant discovery tools.
Authors: Daniel Standage, C. Titus Brown and Fereydoun Hormozdiari
Abstract: Discovery of genetic variants by whole genome sequencing has proven a powerful approach to study the etiology of complex genetic disorders. Elucidation of all variants is a necessary step in identifying causative variants and disease genes. In particular, there is an increased interest in detection of de novo variation and investigation of its role in various disorders. State-of-the-art methods for variant discovery rely on mapping reads from each individual to a reference genome and predicting variants from differences observed between the mapped reads and the reference genome. This process typically results in millions of variant predictions, most of which are inherited and irrelevant to the phenotype of interest. To distinguish between inherited variation and novel variation resulting from de novo germline mutation, whole-genome sequencing of close relatives (especially parents and siblings) is commonly used. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction, which in many cases arises from problems with read mapping. This is a particular challenge in predicting de novo indels and structural variants.
We have developed a mapping-free method, Kevlar, for de novo variant discovery based on direct comparison of sequence content between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest and retrieves the reads containing these k-mers. These reads are easily partitioned into disjoint sets by shared k-mer content for subsequent locus-by-locus processing and variant calling. Kevlar also utilizes a novel probabilistic approach to score and rank the variant predictions to identify the most likely de novo variants. We evaluated Kevlar on simulated and real pedigrees, and demonstrate its ability to detect both de novo SNVs and indels with high sensitivity and specificity.
Authors: Heng Li
Abstract: Motivation: Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.
Results: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3–4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation: https://github.com/lh3/minimap2
Authors: Martin Steinegger, Mirdita Milot and Johannes Söding
Abstract: Sequencing costs have dropped much faster than Moore’s law in the past decade. The analysis of large metagenomic datasets and not its generation is the now the main time and cost bottleneck. We present three methods that together much alleviate the challenges posed by the exploding amount of metagenomics data and that allow us to go from an experiment-by experiment analysis to large-scale analyses of hundreds or thousands of datasets.
MMseqs2 is a protein sequence and profile search method slightly more sensitive than PSI-BLAST and 400 times faster. MMseqs2 can annotate 1.1 billion sequences in 8.3 hours on 28 cores. MMseqs2 offers great potential to increase the fraction of annotatable (meta)genomic sequences.
Linclust is a sequence clustering method whose run time scales linearly with the input set size, not nearly quadratically as in conventional algorithms. It can cluster 1.6 billion metagenomic sequence fragments in 10 hours on a single server to 50% sequence identity, >1000 times faster than has been possible previously.
PLASS is a metagenomic protein sequence assembler whose runtime and memory scale linearly with dataset size. It can assemble ten times more protein sequences from soil metagenomes, and faster than Megahit and other popular nucleotide-level assembler.
Authors: Maël Kerbiriou and Rayan Chikhi
Abstract: Decompressing a file made by the gzip program at an arbitrary location is in principle impossible, due to the nature of the DEFLATE compression algorithm. Consequently, no existing program can take advantage of parallelism to rapidly decompress large gzip-compressed files. This is an unsatisfactory bottleneck, especially for the analysis of large sequencing data experiments. Here we propose a parallel algorithm and an implementation, pugz, for fast and exact decompression of any text file. We show that pugz is an order of magnitude faster than gunzip, and 5x faster than a highly-optimized sequential implementation (libdeflate). We also study random access to compressed DNA sequence data in the FASTQ format. We give simple models and experimental results that shed light into the structure of gzip-compressed files containing DNA. Preliminary results show that random access to sequences within a gzip-compressed FASTQ file is feasible on almost all files with low compression levels, and is approximate at higher compression levels.
Authors: Camille Marchet, Maël Kerbiriou and Antoine Limasset
Abstract: Background: The need to associate information to words is shared among a plethora of applications and methods in high throughput sequence analysis, and could be marked as fundamental. A scalability problem is promptly met when indexing billions of k-mers, as exact associative indexes can be memory expensive. To leverage this challenge, recent works take advantage of the k-mer sets properties. They exploit the overlaps shared among k-mers by using a De Bruijn graph as a compact k-mer set to provide lightweight structures.
Contribution: We propose a scalable and exact index structure able to associate unique identifiers to indexed k-mers and to reject alien k-mers. The proposed index combines an extremely compact representation along with a high throughput. Moreover, it can be efficiently built from the De Bruijn graph sequences. The efficient index implementation we provide, achieved to index the k-mers from the human genome with 8GB within 30 minutes and was able to scale up to the huge axolotl genome with 63 GB within 10 hours. Furthermore, while being memory efficient, the index allows above a million queries per second on a single CPU in our experiments and its throughput can be raised using multiple cores. Finally, we also present the index ability to practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.
Availability: The index is implemented as a header-only library in C++, is open source under AGPL3 license and available at https://github.com/Malfoy/Blight. It was designed as a user-friendly library and comes along with sample code usage.
Authors: Pierre Morisse, Camille Marchet, Antoine Limasset, Thierry Lecroq and Arnaud Lefebvre
Abstract: Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore allow the sequencing of long reads of tens of kbs, that are expected to solve various problems, such as contig and haplotype assembly, scaffolding, and structural variant calling. However, they also reach high error rates of 10 to 30%, and thus require efficient error correction. As first long reads sequencing experiments produced reads displaying error rates higher than 15% on average, most methods relied on the complementary use of short reads data to perform correction, in a hybrid approach. However, these sequencing technologies evolve fast, and the error rate of the long reads is now capped at around 10-12%. As a result, self-correction is now frequently used as a first step of third generation sequencing data analysis projects. As of today, efficient tools allowing to perform self-correction of the long reads are available, and recent observations suggest that avoiding the use of second generation sequencing reads could bypass their inherent bias.
We introduce CONSENT, a new method for the self-correction of long reads that combines different strategies from the state-of-the-art. A multiple sequence alignment strategy is thus combined to the use of local de Bruijn graphs. Moreover, the multiple sequence alignment benefits from an efficient segmentation strategy based on k-mers chaining, allowing to greatly reduce its time footprint. Our experiments show that CONSENT compares well to the latest state-of-the-art self-correction methods, and even outperforms them on real Oxford Nanopore datasets. In particular, they show that CONSENT is the only method able to scale to a human dataset containing Oxford Nanopore ultra-long reads, reaching lengths up to 340 kbp.
CONSENT is implemented is C++, supported on Linux platforms and freely available at https://github.com/morispi/CONSENT.
Authors: Pierre Marijon, Rayan Chikhi and Jean-Stéphane Varré
Abstract: Long-read genome assembly tools are expected to reconstruct bacterial genomes nearly perfectly, however they still produce fragmented assemblies in some cases. It would be beneficial to understand whether these cases are intrinsically impossible to resolve, or if assemblers are at fault, implying that genomes could be refined or even finished with little to no additional experimental cost.
We propose a set of computational techniques to assist inspection of fragmented bacterial genome assemblies, through careful analysis of assembly graphs. By finding paths of overlapping raw reads between pairs of contigs, we recover potential short-range connections between contigs that were lost during the assembly process. We show that our procedure recovers 45% of missed links in fragmented Canu assemblies across samples from the NCTC bacterial sequencing project.
We also observe that a simple procedure based on enumerating weighted Hamiltonian cycles can suggest likely contig orderings. In our tests, the correct contig order is ranked first in half of the cases and within the top-3 predictions in nearly all evaluated cases, providing a direction for finishing fragmented long-read assemblies.
Authors: Robert S. Harris and Paul Medvedev
Abstract: Algorithmic solutions to index and search biological databases are a fundamental part of bioinformatics, providing underlying components to many end-user tools. Inexpensive next generation sequencing has filled publicly available databases such as the Sequence Read Archive beyond the capacity of traditional indexing methods. Recently, the Sequence Bloom Tree (SBT) and its derivatives were proposed as a way to efficiently index such data for queries about transcript presence. We build on the SBT framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index. We evaluate HowDe-SBT by both proving theoretical bounds on its performance and using real RNA-seq data. Compared to previous SBT methods, HowDe-SBT can construct the index in less than 36% the time, and with 39% less space, and can answer small-batch queries at least five times faster. HowDe-SBT is available as a free open source program on https://github.com/medvedevgroup/HowDeSBT. The preprint is available on http://biorxiv.org/cgi/content/short/501452v2.
Authors: Ilia Minkin and Paul Medvedev
Abstract: Multiple whole-genome alignment is a fundamental and challenging problems in bioinformatics. Despite many ongoing successes, today’s methods are not able to keep up with the growing number, length, and complexity of assembled genomes. Approaches based on using compacted de Bruijn graphs to identify and extend anchors into locally collinear blocks hold the potential for scalability, but current algorithms still do not scale to mammalian genomes. We present a novel algorithm SibeliaZ-LCB for identifying collinear blocks in closely related genomes based on the analysis of the de Bruijn graph. We further incorporate it into a multiple whole-genome alignment pipeline called SibeliaZ. SibeliaZ shows drastic run-time improvements over other methods on both simulated and real data, with only a limited decrease in accuracy. On sixteen recently assembled strains of mice, SibeliaZ runs in under 12 hours, while other tools could not run to completion for even eight mice, given a week. SibeliaZ makes a signicant step towards improving scalability of multiple whole-genome alignment and collinear block reconstruction algorithms and will enable many comparative genomics studies in the near future.
Authors: Parsoa Khorsand and Fereydoun Hormozdiari
Abstract: Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping events in repeat regions. Thus we are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches.
Our method Nebula utilizes the changes in the count of kmers to predict the genotype of common structural variations. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping deletions and mobile-element insertions, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event.
Deep learning model to improve speed and accuracy of genome assembly
Authors: Minh Duc Cao, Michael Meyer, Zhizhuo Zhang and Jonathan Rothberg
Abstract: Rapid advances in biotechnology are revolutionizing the genomics landscape. Third-generation sequencing technologies are now able to sequence fragments of DNA that are greater than 10 kb in length. This is essential for applications such as assembling new genomes and calling novel structural variants, a problem which short-read platforms struggle with. This technical advance, together with the high throughput and low costs of these technologies, opens up the opportunity to bring sequencing into the clinic to transform health care and medicine. However, due to the relatively high raw error rates of these technologies (>15%), computational approaches for analyzing these data are still substantially expensive, making a hurdle to bring genomics sequencing into practical use. Here we present an approach that leverages deep learning and cloud computing to substantially reduce the cost of assembling genomes while producing more accurate assemblies than existing assembly pipelines.
Computational methods for assembling long and erroneous reads often use the Overlap-Layout-Consensus (OLC) approach. Reads are first mapped in an all-vs-all fashion to generate an overlapping graph, which is then condensed to produce a layout of the genome. While there are fast and efficient algorithms for these two stages, polishing the draft assembly still presents a big challenge due to the high error rates present in these sequencing platforms. Early platform-specific consensus calling methods (Quiver for Pacbio and Nanopolish for Oxford Nanopore) use Hidden Markov models and are slow. Generic polishing methods such as Racon and Sparc trade accuracy for speed and typically result in low-quality assemblies. Popular assembly pipelines such as Canu and FALCON integrate error-correction in the overlap and layout stages to improve the accuracy at the cost of substantially high computational requirements. No single tool is capable of delivering a fast, high quality, and low-cost genome assembly.
To address this bottleneck, we developed a deep learning model that can learn the specific error profile of a sequencing technology for consensus calling and can quickly and accurately polish the draft assembly. Given a window of the draft assembly, and the alignment of raw reads to the window, the model first constructs the multiple alignment by combining the pairwise alignments of individual reads to the assembly window. It then scans the multiple alignments to identify sites of likely errors on the assembly to correct.
Each window is polished independently. Only data specific to that window need to be loaded into memory. As a result, our approach is highly parallelizable and requires only a small memory footprint. We integrated our deep learning consensus algorithm with minimap2 (for overlapping) and miniasm (for layout) to complete an assembly pipeline. We implemented this in a streaming fashion making it easy to be integrated into a modern industrial distributed system such as Spark. The design also allows us to make use of heterogeneous computational resources such as GPU for deep learning computations, and CPU for bioinformatics processing.
We trained our deep learning models using public datasets, as well as a simulation data, from a Saccharomyces cerevisiae strain. We then applied our method to assemble the genome from a different species, an Escherichia coli strain. We compared the results of our pipeline with that of two existing assemblers, Canu and Racon. We ran minimap2 and miniasm to generate a draft assembly and then applied multiple iterations of polishing using Racon and our novel model. Our method needed only 4 iterations of polishing to obtain an accuracy better than Canu, while requiring less than half of Canu’s CPU time. On the other hand, the assembly polished by Racon converged after 10 iterations but had the lowest accuracy.
In summary, we report here an accurate and efficient pipeline for assembling long read sequencing data, that makes use of deep learning and cloud computing to lower the cost of sequence analysis. We envision that this will contribute to bringing sequencing more accessible to practical applications.
Predicting pathogenicity of missense variants using deep learning
Authors: Haicang Zhang, Jianqi Hong, Wendy Chung and Yufeng Shen
Abstract: Accurate predictions of genetic effects of missense variants is critical to the interpretation of genome sequence. Previously published prediction methods have been used to improve power in genetic studies and to identify pathogenic variants in clinical genetic testing. However, the performance of these methods is suboptimal, due to several issues, including the complexity of the mechanisms of pathogenic variants, false positives in the training data, and suboptimal usage of large training data sets. In addition, lack of direct interpretability and generalization of the predicted scores limit their practical utility. There is an urgent need to make clear relative importance of different features and to understand why a variant is predicted to be pathogenic.
Here we report on a new prediction method MVP2. MVP2 uses a deep learning approach to learn protein-context-specific pathogenicity of amino acid changes from large number of curated pathogenic variants. We define the protein context as a window centered on the missense variant. For each residue in the context, we collect information about structural and biophysical properties, evolutionary conservation, and subgenic coding constraints and gene-level constraints in general population. MVP2 uses convolutional networks to learn salient features without prior knowledge. Using cancer hotspot mutations as a benchmark, MVP2 achieved an area under the receiver operating characteristic curve of 0.9, substantially better than other published methods. To further assess the utility in clinical genetic diagnostic testing, we obtained functional data of ~1800 missense variants (441 deleterious) in BRCA1, ~1800 missense variants (258 deleterious) in PTEN, and ~1600 missense variants (540 deleterious) in TP53 benchmarks. The area under the precision-recall curve of MVP is 0.84, 0.79 and 0.80 in BRCA1, PTEN and TP53, respectively, substantially better than previously published methods (0.77, 0.67 and 0.77, respectively). Setting a threshold to achieve a positive predictive value (PPV) of 0.9, previous methods only correctly classify 20-25% of pathogenic variants, categorizing most of the true pathogenic variants as variants of uncertain significance (VUS). MVP2 is able to identify 40% of true pathogenic variants with the same PPV threshold, substantially reducing the number of VUS. In summary, we show that MVP significantly improves predictions of pathogenic missense variants by capturing protein context-dependent information using deep learning, and it has the potential to significantly increase the power in genetic studies and enhance the utility of clinical genetic testing.
Authors: Qiuchen Meng, Dongyang Luo, Yinqing Li and Xuegong Zhang
Abstract: Perturb-seq (also called CRISP-seq and CROP-seq) uses CRISPR gene-editing technology to perturb specific genes and assess the resulting phenotypes of each perturbation. Its data provide rich information of gene regulation networks as ideally only one gene is perturbed in each cell and the effect of this gene on all other genes can be inferred by computational analysis.
Noises and uncertainties exist in several steps of the Perturb-seq protocol. Among them, one key issue is that the exact original expression profile of each perturbed cell cannot be measured and must be estimated through computation. Some existing work used the assumption that all cells before the perturbation share the same original state when solving the above problem. But this assumption may not be true in many real scenarios. In a population of cells with heterogeneity before the perturbation, differences in gene expressions in the perturbed cells are not only caused by the perturbation, but also caused by intrinsic differences of cell subgroups.
We developed a method to decouple the effect of perturbation from the influence of cell subgroups by solving for the maximum likelihood estimation of the initial gene expression from the observed unperturbed cells’ expression by assuming the two distributions are the same or similar. We conducted a series simulation experiments to investigate the performance of the proposed method. Preliminary results show very promising performances on simulation data with moderate complexity and we are expanding the experiments on more complicated simulation data and real data.
Continuous chromatin state feature annotation of the human epigenome
Authors: Bowen Chen, Neda Shokraneh Kenari, Habibollah Daneshpajouh, Kay Weise and Maxwell W Libbrecht
Abstract: Semi-automated genome annotation (SAGA) methods are widely used to understand genome activity and gene regulation. These methods take as input a set of sequencing-based assays of epigenomic activity (such as ChIP-seq measurements of histone modification and transcription factor binding), and output an annotation of the genome that assigns a chromatin state label to each genomic position. Existing SAGA methods have several limitations caused by the discrete annotation framework: such annotations cannot easily represent varying strengths of genomic elements, and they cannot easily represent combinatorial elements that simultaneously exhibit multiple types of activity. To remedy these limitations, we propose an annotation strategy that instead outputs a vector of chromatin state features at each position rather than a single discrete label. Continuous modeling is common in other fields, such as in topic modeling of text documents. We propose a method, epigenome-ssm, that uses a non-negative Kalman filter state space model to efficiently annotate the genome with chromatin state features. We show that chromatin state features from epigenome-ssm are more useful for several downstream applications than both continuous and discrete alternatives, including their ability to identify expressed genes and enhancers. Therefore, we expect that these continuous chromatin state features will be valuable reference annotations to be used in visualization and downstream analysis.
Dashing: Fast and Accurate Genomic Distances with HyperLogLog
Authors: Daniel Baker and Ben Langmead
Abstract: Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing
Improved linear alignments through selective reference relaxation
Authors: Nae-Chyun Chen, Brad Solomon and Ben Langmead
Abstract: Recent large-scale genome projects have sequenced up to 100k whole-human genomes, creating more comprehensive variant catalogs. Alignment accuracy can be improved by using known genetic variants to remove undesirable alignment penalties. These variants can be taken into account by including them in the reference genome representation, e.g. by using a graph-shaped rather than a linear reference. We observe that the choice of which variants to include in the graph genome substantially affects alignment accuracy. For example, constructing a reference consisting of the major variant alleles is able to recover 35% of the difference between the raw and personalized reference. We also find that some types of reads do not benefit from alignment to a variant-aware genome: particularly reads that align to unique portions of the genome and that do not overlap variants in the population. Here we present a novel strategy, selective reference relaxation, which is capable of recovering 54% of the benefit of a personalized reference. The main innovations of our method are the use of a variant-free major-allele linear reference to produce an unbiased core alignment and an ERG-based algorithm for the gradual addition of variants. By “committing” around 90% of reads aligned to a variant-free linear reference, we are capable of testing a wide range of potential variant sets in a fraction of the standard alignment time and compute resources. We selected a collection of `representative’ variant sets using hierarchical clustering on all the individuals in the 1000 Genomes phase 3 release in order to identify a subset of significantly divergent individuals. Using these variant sets in our proposed method, preliminary tests were capable of recovering up to 62% of the benefit of a personalized reference on average and up to 79% when indels were included.
BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
Authors: Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick and Aydın Buluç
Abstract: De novo assembly is the process of accurately reconstructing a genome sequence using only overlapping, DNA sequence fragments (reads) that redundantly sample a genome. While longer reads improve the contiguity of the assembly, current long-read technologies come with high error rates. A crucial step of de novo genome assembly for long reads consists of finding overlapping reads. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel algorithm for computing overlaps and alignments that balances the goals of recall and precision, consistently performing well on both.
We present a probabilistic model based on the Markov chain model which demonstrates the feasibility of using short, fixed length k-mers for overlap detection of long-read data with high error rates. Given the error rate, our model guides the choice of the optimal k-mer length. However, not all k-mers are created equal in terms of their usefulness for detecting overlaps. For instance, the overwhelming majority of k-mers that occur only once in the dataset are errors (and are also not useful for seeding overlaps between pairs of reads). Similarly, k-mers that occur more frequently than what would be expected given the sequencing depth and the error profile are likely to come from repetitive regions of the genome. It is a common practice to prune the k-mer space using various methodologies. Consequently, we introduce a notion of reliable k-mers based on our probabilistic model. The reliability of a k-mer is defined as its probability of having originated from a unique (non-repetitive) region of the genome. The use of reliable k-mers eliminates both the k-mer set explosion, that would otherwise occur with highly erroneous reads, and the spurious overlaps from k-mers originating in repetitive regions. BELLA’s reliable k-mer detection is achieved through a probabilistic model exploiting the error rate (which BELLA estimates from the mean error probability of the data) and the sequencing depth. Once detected reliable k-mers, BELLA uses Sparse Generalized Matrix Multiplication (SpGEMM) for overlap detection, which allows our algorithm to achieve fast overlapping without using approximate approaches. This method enables the use of high-performance techniques previously not applied in the context of long-read alignment. It also allows continuing performance improvements in this step due to the ever-improving optimized implementations of SpGEMM.
Finally, BELLA’s overlap detection has been coupled with an optimized state-of-the-art seed-and-extend algorithm, meaning the alignment between two reads starts from a shared seed (identified in the previous overlap detection). In this context, we present a new method based on Chernoff bounds to separate true (genomic) overlaps from false positives depending on the alignment score. Using this methodology, we show that the probability of false positives drops exponentially as the length of overlap between sequences increases.
On simulated data with a 15% error rate, BELLA achieved both high precision and recall, thus obtaining the best F1 score on average, while remaining runtime performance-competitive. BELLA outperformed its closest competitor, Minimap2, with an average of 2.7% higher recall, 17.9% higher precision, and 10.9% higher F1 score. On real data from Pacific Biosciences, the top performer on one data set becomes one of the worst on some other data set whereas BELLA’s F1 score is consistently within 6.5% of the top entry. Future work includes a further characterization of real data features, the impact on BELLA on assembly quality, and the development of a high-performance pairwise alignment algorithm to improve BELLA’s runtime performance.