Processes VCF/BCF file in parallel by splitting work across chromosomes/contigs. Requires an indexed file. Each thread processes a different chromosome, then results are merged into a single Parquet file.
Usage
vcf_to_parquet_parallel_arrow(
input_vcf,
output_parquet,
threads = parallel::detectCores(),
compression = "zstd",
row_group_size = 100000L,
streaming = FALSE,
index = NULL,
...
)Arguments
- input_vcf
Path to input VCF/BCF file (must be indexed)
- output_parquet
Path for output Parquet file
- threads
Number of parallel threads (default: auto-detect)
- compression
Compression codec
- row_group_size
Row group size
- streaming
Use streaming mode
- index
Optional explicit index path
- ...
Additional arguments passed to vcf_open_arrow
Details
This function:
Checks for index (required for parallel processing)
Extracts contig names from header
Processes each contig in parallel using multiple R processes
Writes each contig to a temporary Parquet file
Merges all temporary files into final output using DuckDB
Contigs that return no variants are skipped automatically.