Converts a VCF/BCF file to Apache Parquet format for efficient storage and querying with tools like DuckDB, Spark, or Python pandas/polars.
Usage
vcf_to_parquet_arrow(
input_vcf,
output_parquet,
compression = "zstd",
row_group_size = 100000L,
streaming = FALSE,
threads = 1L,
index = NULL,
...
)Arguments
- input_vcf
Path to input VCF or BCF file
- output_parquet
Path for output Parquet file
- compression
Compression codec: "snappy", "gzip", "zstd", "lz4", "uncompressed"
- row_group_size
Number of rows per row group (default: 100000)
- streaming
Use streaming mode for large files. When TRUE, writes to a temporary Arrow IPC file first (via nanoarrow), then converts to Parquet via DuckDB. This avoids loading the entire VCF into R memory. Requires the DuckDB nanoarrow community extension. Default is FALSE.
- threads
Number of parallel threads for processing (default: 1). When threads > 1 and file is indexed, uses parallel processing by splitting work across chromosomes/contigs. Each thread processes different regions simultaneously. Requires indexed file. See
vcf_to_parquet_parallel_arrowfor details.- index
Optional explicit index file path
- ...
Additional arguments passed to vcf_open_arrow
Details
Processing Modes:
Standard mode (
streaming = FALSE, threads = 1): Loads entire VCF into memory as data.frame before writing. Fast for small-medium files.Streaming mode (
streaming = TRUE, threads = 1): Two-stage streaming via temporary Arrow IPC file. Minimal memory usage for large files.Parallel mode (
threads > 1): Requires indexed file. Splits work by chromosomes, processing multiple regions simultaneously. Near-linear speedup with thread count. Best for whole-genome VCFs.
Examples
if (FALSE) { # \dontrun{
# Standard mode (fast, loads into memory)
vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet")
# Streaming mode for large files (low memory)
vcf_to_parquet_arrow("huge.vcf.gz", "huge.parquet", streaming = TRUE)
# Parallel mode for whole-genome VCF (requires index)
vcf_to_parquet_arrow("wgs.vcf.gz", "wgs.parquet", threads = 8)
# Parallel + streaming for massive files
vcf_to_parquet_arrow("wgs.vcf.gz", "wgs.parquet", threads = 16, streaming = TRUE)
# With zstd compression
vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet", compression = "zstd")
# Query with DuckDB
library(duckdb)
con <- dbConnect(duckdb())
dbGetQuery(con, "SELECT CHROM, POS, REF FROM 'variants.parquet' WHERE CHROM = 'chr1'")
} # }