The Long and the Short of DNA Sequencing

Source: Patricia Dimond, Genetic Engineering and Biotechnology News (11/20/12)

"With advances in second-generation sequencing technologies, genome studies have produced an explosion of sequence data at a fraction of earlier costs."

With advances in second-generation sequencing technologies, genome studies have produced an explosion of sequence data at a fraction of earlier costs.

The lowest-cost technology can now generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consists of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences require de novo assembly before most genome analyses can begin. Scientists say that genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information.

Short sequences, researchers report, must be mapped to unique positions in reference genomes. Reads often have sequencing errors, the reference genome has repetitive elements, and the orientation of a read relative to the reference genome is not known.

According to Michael Schatz, Ph.D., of Cold Spring Harbor Laboratories (CSHL), "Short-read sequencing is excellent for producing high-quality deep coverage of small to large genomes." However, he said, "The short read length limits its capability to resolve complex regions with repetitive or heterozygous sequences." As a result important biological sequences like genes or promoter regions are often highly fragmented using short-read sequencing. "The short read length also makes other computations like sequencing entire RNA transcripts or entire 16S rRNA gene sequences in metagenomics projects difficult or impossible."

Long Sequences

One solution to the short-read problem is to produce longer DNA sequences. Third-generation sequencers can directly read a single DNA molecule reportedly provide a clearer view of genomic organization and content. Although these instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly, error rates of single-molecule reads are high, approaching 15%.

Novel approaches to getting to more accurate genomes may include combining short- and long-read sequences.Pacific Biosciences' PacBio® RS High-Resolution Genetic Analyzer, a third-generation sequencer, uses molecular sequencing techniques and advanced analytics, according to the company. The sequencer produces much longer reads than other technologies, up to 100 times longer, thus reportedly providing a more complete picture of genome structure than second-generation technology.

However, the longer sequences, while potentially useful in solving assembling and finishing problems, produced single-pass sequence reads with every eighth or ninth base incorrect.

To get around the incorrect base problem, Dr. Schatz, an assistant professor at Cold Spring Harbor Laboratories, and colleagues at the National Biodefense Analysis and Countermeasures Center and the University of Maryland, worked with PacBio to develop a correction algorithm for the longer sequences generated by third-generation sequencers and an assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences—an approach they dubbed "hybrid error correction."

Dr. Schatz explained, "The longer read lengths have fundamentally more information than the short reads: infinite coverage with short reads simply won't be enough for resolving really complex regions, but just a few long reads in the right spot can solve them. The same is true for phasing haplotypes in the presence of heterozygosity or identifying proper transcript isoforms in the presence of alternative splicing."

The scientists combined sequences generated by more conventional technology made by Illumina to help correct the mistakes in the single-molecule method. The result is "substantially better" than using Pacific Biosciences' technology alone, he said. "The data is basically perfect."

The scientists showed that the approach could successfully be used on reads generated by a PacBio RS instrument from phage, prokaryotic, and eukaryotic whole genomes, including the previously unassembled genome of the parrotMelopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. The scientists reported that their long read correction achieved >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies. In the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies.

As highlighted in two additional papers using similar hybrid strategies, the long read lengths have made automated, single-contig bacterial chromosome assemblies a reality.

Assembly Algorithms

Jonas Korlach, Ph.D., Pacific Biosciences' CSO, told GEN that the company is working on getting around the need to combine sequences generated by both second- and third- generation sequencing platforms. "Going forward," he said, "we don't think this will remain the paradigm. We have had recent success with assembly algorithms that can take just these long reads from our machine and use a hierarchical assembly process to achieve a finished microbial genome. In a nutshell, the reads are already long and accurate enough to allow for de novo assembly from a single, long-insert DNA library."

Dr. Korlach also explained that the company has developed an improved consensus algorithm called Quiver (currently available on its software sharing site) that can achieve a significant reduction in consensus error rate, "yielding a final sequencing result that is over 99.999% accurate at 20x sequencing coverage."

GEN asked Dr. Schatz whether generation of longer sequences will eventually be able to address more complex genomes, such as human genomes.

"Absolutely," he said. "In the paper we used PacBio long reads to improve the de novo assembly of the 1.2 Gbp parrot genome, and we are currently sequencing several species of rice and worm. In the near future, we have plans to sequence the human genome and the wheat genome with long reads. I expect we will see much more of this as the throughput and read lengths from the instruments improve.

"In the last year, PacBio has improved both read length and throughput by a factor of 3 or 4, and their roadmap shows this trend should continue into the next year."

Earlier this month, PacBio announced that enhancements to its DNA sequencing system, its XL release featuring new chemistry and software, will allow long read lengths average 5,000 bases. The company said the new chemistry includes a faster polymerase that reads more bases per second. This release also includes the Stage Start feature, which produces longer reads by enabling sequence data collection to begin when the polymerase is activated. Additionally, PacBio said it has increased the length of time the instrument can record data during the sequencing reaction, also contributing to an increase in read lengths.

The CSHL scientists who were trying to assemble the complex rice genome said the new chemistry produced 9x coverage with long reads—50% of the data came from reads 4,800 base pairs or longer.

"Adding the long reads from PacBio doubled the contig connectivity over the current state-of-the-art ALLPATHS-LG assembler and mate-pair recommendations," Dr. Schatz added.

For more complex genome sequencing assembling and finishing, then, it appears that a combination of longer and shorter works better.

Patricia Dimond
Genetic Engineering and Biotechnology News

Novavax Shares Shoot 65% Higher on Phase 3 COVID-19 Vaccine Trial's 89.3% Effectiveness
Source: Streetwise Reports (01/29/2021)

Novavax Inc. shares established a new 52-week high after the firm's COVID-19 vaccine demonstrated an 89.3% efficacy rate in a Phase 3 trial in the U.K. and also showed efficacy against the South African variant in a Phase 2b trial.

Sorrento Shares Rise 35% After Firm Posts Positive Data from Phase 1b ICU COVID-19 Trial
Source: Streetwise Reports (01/27/2021)

Shares of Sorrento Therapeutics traded higher after the company released positive preliminary results from it of Phase 1b COVI-MSC™ Study for treatment of hospitalized ICU COVID-19 patients.

Life Sciences Co. Receives Large Livestock Feed Order from New Client in Mexico
Source: Streetwise Reports (01/27/2021)

Avivagen Inc. reported it has secured a six tonne order for OxC-beta™ Livestock from a new customer in Queretaro City, Mexico.

Coverage Initiated on Biotech 'Improving Drugs Via Prodrug'
Source: Streetwise Reports (01/27/2021)

KemPharm's key near-term catalyst, lead value driver and other pipeline drug candidates are discussed in a ROTH Capital Partners report.

Vir Shares Climb 48% on Initial Phase 1 Chronic Hepatitis B Trial Data
Source: Streetwise Reports (01/26/2021)

Vir Biotechnology shares traded higher after the company reported data from its Phase 1 Chronic Hepatitis B Study that showed VIR-3434 significantly and rapidly reduced hepatitis B surface antigen.

Aurinia Pharma Shares Open at 52-Week High as FDA Approves Adult Lupus Nephritis Drug
Source: Streetwise Reports (01/25/2021)

Shares of Aurinia Pharmaceuticals traded 30% higher after the company reported that the U.S. Food and Drug Administration approved its LUPKYNIS™ (voclosporin) for use in treating adults with active lupus nephritis.

AzurRx Doses First Two People in Expanded Phase 2b Trial of Cystic Fibrosis Patients
Source: Streetwise Reports (01/22/2021)

AzurRx BioPharma shares traded 45% higher after the firm advised it had dosed the first two patients in its Phase 2b OPTION 2 Extension Study of MS1819 for treatment of exocrine pancreatic insufficiency in patients with cystic fibrosis.

Gritstone Oncology Licenses LNP Platform from Genevant Sciences to Develop COVID-19 Vaccine
Source: Streetwise Reports (01/20/2021)

Shares of Gritstone Oncology reached a new 52-week high price after the company advised it entered into a non-exclusive licensing agreement with Genevant Sciences to develop and commercialize self-amplifying RNA vaccines for use against SARS-CoV-2.

Biotech Given Go-Ahead to Proceed with Phase 3 COVID-19 Study
Source: Streetwise Reports (01/20/2021)

Algernon Pharmaceuticals reported that it has been granted approval from the Data and Safety Monitoring Board to conduct a Phase 3 Study of Ifenprodil for SARS-CoV-2.

Aclaris Shares Triple in Value on Positive Topline Data from Phase 2a Rheumatoid Arthritis Trial
Source: Streetwise Reports (01/19/2021)

Shares of Aclaris Therapeutics Inc. soared to a new 52-week high after the company reported positive preliminary topline data from its Phase 2a trial of oral ATI-450 for moderate to severe rheumatoid arthritis.