Parallel VCF to Parquet conversion — vcf_to_parquet_parallel

Processes VCF/BCF file in parallel by splitting work across chromosomes/contigs. Requires an indexed file. Each thread processes a different chromosome, then results are merged into a single Parquet file.

Usage

vcf_to_parquet_parallel_arrow(
  input_vcf,
  output_parquet,
  threads = parallel::detectCores(),
  compression = "zstd",
  row_group_size = 100000L,
  streaming = FALSE,
  index = NULL,
  ...
)

Arguments

input_vcf: Path to input VCF/BCF file (must be indexed)
output_parquet: Path for output Parquet file
threads: Number of parallel threads (default: auto-detect)
compression: Compression codec
row_group_size: Row group size
streaming: Use streaming mode
index: Optional explicit index path
...: Additional arguments passed to vcf_open_arrow

Value

Invisibly returns the output path

Details

This function:

Checks for index (required for parallel processing)
Extracts contig names from header
Processes each contig in parallel using multiple R processes
Writes each contig to a temporary Parquet file
Merges all temporary files into final output using DuckDB

Contigs that return no variants are skipped automatically.

Examples

if (FALSE) { # \dontrun{
# Use 8 threads
vcf_to_parquet_parallel_arrow("wgs.vcf.gz", "wgs.parquet", threads = 8)

# With streaming mode for large files
vcf_to_parquet_parallel_arrow(
    "huge.vcf.gz", "huge.parquet",
    threads = 16, streaming = TRUE
)
} # }