Introduction
Sequencing technologies have come a long way since the first human genome was sequenced in 2001. In the early days, scientists had to laboriously read each letter of DNA one by one. As the field of NGS has advanced, new technologies have dramatically improved the cost and speed of sequencing, allowing for data faster and more efficiently. Today, a single instrument can sequence multiple human genomes in a single day.
Technological improvements in sequencing have been dominated over the last 20 years by short-read technologies that have progressively advanced throughput primarily by multiplexing – being able to read ever larger numbers of short fragments at a time, typically a few hundred nucleotides per individual read.
Another dimension of advancement has been the emergence of long-read sequencing platforms that can sequence longer individual DNA molecules. These so-called “third generation” platforms, such as PacBio and Oxford Nanopore long read technologies, have evolved to a cost per acquired base that now rivals mid-range, short-read platforms.
With these advancements, there are now additional considerations for selecting a sequencing technology – not only in term of what amount of data or genome coverage is needed and how many samples need to be multiplexed together, but also to what read length would be most beneficial and impactful to the research question at hand. One of the key early decisions in designing sequencing projects is increasingly deciding which type of platform to use.
This blog focuses on recent developments in short- and long-read sequencing approaches and how both types of sequencing have their own benefits and drawbacks, depending on the specific goals of the experiment.1
The Workhorse in NGS Labs: Short-Read Sequencing
Short-read sequencing is a powerful tool for generating genomic data. With short read sequencing, DNA or RNA can be sequenced in a shorter amount of time and at a lower cost than traditional methods. This technology has revolutionized biomedical research and led to important discoveries in genomics, evolution, and disease. It has also been used to assemble whole genomes, providing valuable information about the structure and function of genes. Short reads are effective for applications aimed at counting the abundance of specific sequences, identifying variants within otherwise well-conserved sequences, or for profiling the expression of particular transcripts.2
Short-read sequencing has long been considered the workhorse in NGS labs, mainly because it is the best way to obtain high depth and high-quality data for the lowest cost per base. Illumina’s platform dominates this field along with Thermo Fisher Scientific’s Ion Proton technology.
However, just since 2020, multiple new sequencing technologies have been introduced, helping to drive short-read sequencing costs even lower. These include instrument platforms such as Element Biosciences, Ultima Genomics, Complete Genomics, or PacBio’s Onso.
While these technologies use different chemistries, they all generate reads that are at most, a few hundred bases long. Even still, the reads are high quality, and there are a lot of them – anywhere from a few million to hundreds of billions depending on the sequencer. This means researchers can get higher coverage of their genomes or targets of interest and enables high confidence SNP and mutation calling.
New sequencing technologies have made it possible to generate vast amounts of data, but they have also introduced new bottlenecks in how to efficiently process samples into sequencer-ready library quickly and cost-effectively at a scale the matches instrument capabilities. In an era where sequencers can generate 1000s of gigabases of raw sequencing data, library preparation now represents a large and growing portion of the challenge of tackling large sequencing projects.
To make full use of the large capacity of short-read sequencing instruments, researchers have to control the progress of library preparation for more and more samples. A large part of this challenge lies in normalization, how to achieve balanced libraries with highly uniform insert size distributions and uniform representation across samples. This process can be time-consuming and expensive, but if omitted it can lead to inaccurate read counts and uneven insert size distributions that impact the amount of data obtained from different samples. seqWell’s multiplexed library prep technologies such as purePlex and ExpressPlex are designed to address these problems by integrating the process of library preparation, multiplexing and normalization with simple, efficient workflows. ExpressPlex and purePlex allow for projects involving 100s or even 1000s of samples to be tackled with minimal automation or extra steps, unlocking a wide range of multiplexed sequencing applications such as microbial genome sequencing, plasmid and vector QC, and low-pass whole-genome sequencing.
All short-read sequencing technologies have a common limitation – the inability to sequence long stretches of DNA. To sequence a large stretch of DNA using NGS, such as a human genome, the strands have to be fragmented and amplified. Bioinformatic programs are then used to assemble these random fragments into a continuous sequence. Unfortunately, these amplification steps can introduce biases into the samples. Also, short-read sequencing can fail to generate a sufficient overlap between the DNA fragments. Overall, this means that sequencing a highly complex and repetitive genome, like that of a human, can be challenging using these technologies.1
Long Read NGS: More Accessible than Ever
Long read NGS is another powerful tool that can be used to provide insights into a wide variety of genomic applications. As the name implies, long-read sequencers can generate reads with much longer lengths – anywhere from a few thousand to hundreds of thousands of bases. These longer reads allow researchers to more easily identify complex structural variation such as large insertions/deletions, inversions, repeats, duplications, and translocations. This sequencing technology can also be used to phase SNPs into haplotypes, build scaffolds for de novo assembly and resolve splicing events in full length cDNA.
Long read NGS instruments have been on the market for the past decade but historically the lower yield, higher error rate, and higher costs of the instruments have kept them from being more widely adopted. With recent advancements such as the PacBio Revio and Vega, and the ONT GridION and PromethION, the scale and cost of long read sequencing is now comparable base-for-base against many mid-range short-read sequencing instruments.
Another historical downside of long-read sequencing has been a much lower accuracy per read than that of short-read sequencing. Here again, with advancements such as PacBio HiFi sequencing, and improved base-calling capabilities of the ONT platform, this disadvantage is now much less significant than even a few years ago.
Another reason why long-read sequencing platforms are attractive is that they offer the ability to directly read the information of epigenetic base modifications – the so-called “second genome” – that are stored in DNA. Short-read instruments generally require additional specialized library preparation methods to be able to access this type of information that is increasingly recognized as an important aspect of biology.
These significant improvements in long-read sequencing technologies have created the same challenge that started to emerge in short-read sequencing over a decade ago: how to efficiently take advantage of instrument capability with improved, streamlined library preparation methods.
Library preparation for long-read sequencing instruments is uniquely challenging because along with the capability of sequencing long reads comes the technical demands of reliably producing the required high-quality, longer DNA fragment substrates read by these very specialized instruments. Generation of longer DNA fragments of sufficient quality has historically been time consuming and laborious often limiting sequencing throughput. Here, seqWell is again at the forefront of method development, utilizing core capabilities in transposase-based library construction to create a highly efficient enzyme-only long read library workflow, LongPlex, that reduces the need for expensive and cumbersome mechanical shearing and allows for multiplexing of 100s of samples quickly and easily.
The Best of Both Worlds: Mixing Short- and Long Read NGS Data
For many projects, mixing short and long read data together can have its advantages. Researchers can leverage the lower cost per base, high depth and higher quality data of short read sequencing to generate high confidence SNP and mutation calls, then on top of that data, layer information from long-read sequencing to resolve complex SVs and phase haplotypes. This of course requires more sophisticated analysis methods, but for de novo assembly or rare disease sequencings, using both short and long read sequencing technologies can prove highly beneficial, leading to greater understanding of genetic variation.
Conclusion
The debate between short-read and long-read sequencing is an ongoing one, but it is clear that both technologies have their own unique benefits. For the most comprehensive results, it is best to combine the two together. By doing so, you can get the most complete picture of your data while still taking advantage of the speed and affordability of short-read sequencing.