Parallel VCF to Parquet conversion using DuckDB
Source:R/vcf_duckdb.R
vcf_to_parquet_duckdb_parallel.RdProcesses VCF/BCF file in parallel by splitting work across chromosomes/contigs using the DuckDB bcf_reader extension. Requires an indexed file. Each thread processes a different chromosome, then results are merged into a single Parquet file.
Usage
vcf_to_parquet_duckdb_parallel(
input_file,
output_file,
extension_path = NULL,
threads = parallel::detectCores(),
compression = "zstd",
row_group_size = 100000L,
columns = NULL,
tidy_format = FALSE,
partition_by = NULL,
con = NULL
)Arguments
- input_file
Path to input VCF/BCF file (must be indexed)
- output_file
Path for output Parquet file
- extension_path
Path to the bcf_reader.duckdb_extension file.
- threads
Number of parallel threads (default: auto-detect)
- compression
Parquet compression codec
- row_group_size
Row group size
- columns
Optional character vector of columns to include
- tidy_format
Logical, if TRUE exports data in tidy (long) format. Default FALSE.
- partition_by
Optional character vector of columns to partition by (Hive-style). Creates directory structure like
output_dir/SAMPLE_ID=HG00098/data_0.parquet. Note: When using partition_by, each contig's data is partitioned separately then merged into the final partitioned output.- con
Optional existing DuckDB connection (with extension loaded).
Details
This function:
Checks for index (required for parallel processing)
Extracts contig names from header
Processes each contig in parallel using multiple R processes
Writes each contig to a temporary Parquet file
Merges all temporary files into final output using DuckDB
Contigs that return no variants are skipped automatically.
When partition_by is specified, the function creates a Hive-partitioned directory
structure. This is especially useful with tidy_format = TRUE and
partition_by = "SAMPLE_ID" for efficient per-sample queries on large cohorts.
DuckDB auto-generates Bloom filters for VARCHAR columns like SAMPLE_ID.
See also
vcf_to_parquet_duckdb for single-threaded conversion
Examples
if (FALSE) { # \dontrun{
ext_path <- bcf_reader_build(tempdir())
# Use 8 threads
vcf_to_parquet_duckdb_parallel("wgs.vcf.gz", "wgs.parquet", ext_path, threads = 8)
# With specific columns
vcf_to_parquet_duckdb_parallel(
"wgs.vcf.gz", "wgs.parquet", ext_path,
threads = 16,
columns = c("CHROM", "POS", "REF", "ALT")
)
# Tidy format output
vcf_to_parquet_duckdb_parallel("wgs.vcf.gz", "wgs_tidy.parquet", ext_path,
threads = 8, tidy_format = TRUE
)
# Tidy format with Hive partitioning by SAMPLE_ID
vcf_to_parquet_duckdb_parallel("wgs_cohort.vcf.gz", "wgs_partitioned/", ext_path,
threads = 8, tidy_format = TRUE, partition_by = "SAMPLE_ID"
)
} # }