HiFi Assembly Protocol: From Start to Finish
High-fidelity (HiFi) sequencing, championed by Pacific Biosciences (PacBio), has revolutionized genomics by delivering long, accurate reads that simplify complex genome assembly and analysis. This technology has enabled researchers to achieve unprecedented completeness and accuracy in genome assemblies, revealing previously hidden genomic features like structural variations, complex repeat regions, and haplotype phasing. This article provides a comprehensive overview of the HiFi assembly process, covering everything from library preparation to polishing and validation, offering a detailed guide for researchers interested in utilizing this powerful technology.
I. Library Preparation: The Foundation of a Successful Assembly
The quality of the HiFi library is paramount for achieving a high-quality assembly. This stage involves several crucial steps:
-
DNA Extraction and Quality Control: The process begins with extracting high-molecular-weight (HMW) DNA. The quality and quantity of DNA are critical. Methods like magnetic bead-based extraction are preferred for minimizing shearing. Pulsed-field gel electrophoresis (PFGE) or similar techniques are used to assess the size distribution and integrity of the extracted DNA. Contamination and degradation should be minimal.
-
Shearing and Size Selection: While HiFi sequencing benefits from long reads, excessively long fragments can complicate the sequencing process. Therefore, controlled shearing is often performed to obtain a desired size distribution, typically in the 15-20 kb range. Size selection is then carried out using techniques like BluePippin or SageELF to enrich for the target fragment size range, maximizing sequencing efficiency.
-
SMRTbell Library Construction: The core of PacBio library preparation is the creation of SMRTbell libraries. This involves repairing DNA damage, creating blunt ends, ligating hairpin adapters (SMRTbell adapters) to both ends of the DNA fragments, and removing any unligated adapters. The circularized SMRTbell structure allows for continuous sequencing of the insert DNA multiple times, generating the HiFi reads.
-
Quality Control of the SMRTbell Library: Before sequencing, the quality and quantity of the SMRTbell library are assessed. Fluorometric assays like Qubit are used for quantification, while techniques like capillary electrophoresis or Agilent Bioanalyzer are employed to determine the size distribution and confirm the successful ligation of SMRTbell adapters.
II. Sequencing: Generating HiFi Reads
The prepared SMRTbell libraries are then sequenced on the PacBio Sequel IIe or Revio systems. These platforms employ Single Molecule, Real-Time (SMRT) sequencing technology.
-
SMRT Sequencing: DNA polymerase, bound to the SMRTbell template, is immobilized in zero-mode waveguides (ZMWs). As the polymerase incorporates fluorescently labeled nucleotides, the emitted light signals are detected in real-time. This process allows for the continuous sequencing of the circularized DNA molecule multiple times, generating multiple subreads.
-
Circular Consensus Sequencing (CCS): The multiple subreads from a single DNA molecule are aligned and combined to produce a highly accurate consensus sequence, known as a HiFi read. This process corrects random errors inherent in single-pass sequencing, resulting in reads with accuracy exceeding 99%. The number of passes required for generating a HiFi read depends on the desired accuracy and the insert size.
III. Genome Assembly: Piecing Together the Puzzle
With the HiFi reads in hand, the genome assembly process can begin. Several assemblers are optimized for HiFi data, offering various advantages and disadvantages:
-
Hierarchical Genome Assembly Process (HGAP): An older but still viable option, HGAP utilizes a combination of overlap-layout-consensus (OLC) and de Bruijn graph approaches. While effective, it can be computationally intensive for larger genomes.
-
Peregrine: A fast and efficient assembler specifically designed for PacBio long reads. It excels in assembling large and complex genomes with high repeat content.
-
HiCanu: This assembler combines the strengths of both OLC and de Bruijn graph methods, offering excellent performance and accuracy for various genome sizes. It’s particularly well-suited for complex genomes with high repeat content.
-
Flye: Known for its speed and ability to handle highly repetitive genomes, Flye utilizes a repeat graph approach. It’s a versatile option for various genome types.
-
Shasta: A fast and scalable assembler that uses a sparse de Bruijn graph approach. It’s particularly suitable for large genomes and offers good performance on commodity hardware.
The chosen assembler constructs an assembly graph by identifying overlaps between the HiFi reads and then traverses the graph to generate contiguous sequences (contigs). The goal is to obtain the fewest and longest contigs possible, ideally representing complete chromosomes.
IV. Polishing: Refining the Assembly
While HiFi reads are highly accurate, minor errors can still occur during sequencing and assembly. Polishing further refines the assembly to improve its accuracy and completeness.
-
Arrow: Developed by PacBio, Arrow uses the raw sequencing data (including pulse information) to correct remaining errors in the assembled contigs. It effectively resolves small indels and base-calling errors.
-
FreeBayes: A widely used variant caller, FreeBayes can be employed to polish the assembly by aligning HiFi reads back to the assembled contigs and identifying discrepancies.
-
Pilon: Another popular polishing tool, Pilon uses aligned Illumina short reads to correct errors and improve the consensus accuracy of the assembly. This hybrid approach leverages the high accuracy of short reads to refine the assembly generated from long reads.
V. Validation and Assessment: Ensuring Assembly Quality
After polishing, the assembled genome undergoes rigorous validation to assess its quality and identify potential errors.
-
Alignment to a Reference Genome (if available): If a closely related reference genome exists, the assembled genome can be aligned to it to identify structural variations, large insertions/deletions, and other discrepancies.
-
BUSCO (Benchmarking Universal Single-Copy Orthologs): BUSCO assesses the completeness of the assembly by identifying the presence and completeness of a set of conserved single-copy orthologs.
-
QUAST (Quality Assessment Tool for Genome Assemblies): QUAST provides a comprehensive set of metrics to evaluate the assembly quality, including contig N50, L50, number of contigs, and genome size.
-
Merqury: Uses k-mer analysis to assess the completeness and accuracy of the assembly and identify potential misassemblies.
-
Manual Inspection and Curation: Visualizing the assembly using genome browsers like IGV or JBrowse allows for manual inspection and curation of potential misassemblies or problematic regions.
VI. Advanced Analysis: Unlocking the Genomic Secrets
Once a high-quality assembly is obtained, various downstream analyses can be performed:
-
Gene Annotation: Identifying and annotating genes within the assembled genome.
-
Variant Calling: Detecting single nucleotide polymorphisms (SNPs), insertions/deletions (indels), and other variants.
-
Structural Variation Analysis: Identifying larger-scale genomic rearrangements like inversions, translocations, and duplications.
-
Haplotype Phasing: Resolving the two haplotypes of diploid organisms.
-
Comparative Genomics: Comparing the assembled genome to other related genomes to identify evolutionary relationships and genomic differences.
VII. Conclusion:
HiFi sequencing and assembly have dramatically advanced the field of genomics, enabling researchers to obtain complete and accurate genome assemblies for a wide range of organisms. The detailed workflow outlined in this article provides a comprehensive guide for researchers embarking on HiFi assembly projects. By carefully executing each step, from library preparation to validation and downstream analysis, researchers can harness the power of HiFi sequencing to unlock the full potential of genomic information and accelerate scientific discovery. As technology continues to evolve, HiFi sequencing promises to play an increasingly important role in understanding the complexities of genomes and their impact on biology, health, and disease.