Convert a VCF/BCF file to Parquet format for fast subsequent queries.
Usage
vcf_to_parquet_duckdb(
input_file,
output_file,
extension_path = NULL,
columns = NULL,
region = NULL,
compression = "zstd",
row_group_size = 100000L,
threads = 1L,
tidy_format = FALSE,
partition_by = NULL,
include_metadata = TRUE,
con = NULL
)Arguments
- input_file
Path to input VCF, VCF.GZ, or BCF file
- output_file
Path to output Parquet file or directory (when using partition_by)
- extension_path
Path to the bcf_reader.duckdb_extension file.
- columns
Optional character vector of columns to include. NULL for all.
- region
Optional genomic region to export (requires index)
- compression
Parquet compression: "snappy", "zstd", "gzip", or "none"
- row_group_size
Number of rows per row group (default: 100000)
- threads
Number of parallel threads for processing (default: 1). When threads > 1 and file is indexed, uses parallel processing by splitting work across chromosomes/contigs. See
vcf_to_parquet_duckdb_parallel.- tidy_format
Logical, if TRUE exports data in tidy (long) format with one row per variant-sample combination and a SAMPLE_ID column. Default FALSE.
- partition_by
Optional character vector of columns to partition by (Hive-style). Creates a directory structure like
output_dir/SAMPLE_ID=HG00098/data_0.parquet. Particularly useful withtidy_format = TRUEto partition by SAMPLE_ID for efficient per-sample queries. DuckDB auto-generates Bloom filters for VARCHAR columns like SAMPLE_ID, enabling fast row group pruning.- include_metadata
Logical, if TRUE embeds the full VCF header as Parquet key-value metadata. Default TRUE. This preserves all VCF schema information (INFO, FORMAT, FILTER definitions, contigs, samples) enabling round-trip back to VCF format. Use
parquet_kv_metadatato read the header back. Note: Not supported withpartition_by(Parquet limitation for partitioned writes).- con
Optional existing DuckDB connection (with extension loaded).
Examples
if (FALSE) { # \dontrun{
ext_path <- bcf_reader_build(tempdir())
# Export entire file with metadata
vcf_to_parquet_duckdb("variants.vcf.gz", "variants.parquet", ext_path)
# Read back the embedded metadata
parquet_kv_metadata("variants.parquet")
# Export specific columns
vcf_to_parquet_duckdb("variants.vcf.gz", "variants_slim.parquet", ext_path,
columns = c("CHROM", "POS", "REF", "ALT", "INFO_AF")
)
# Export a region
vcf_to_parquet_duckdb("variants.vcf.gz", "chr22.parquet", ext_path,
region = "chr22"
)
# Export in tidy format (one row per variant-sample)
vcf_to_parquet_duckdb("cohort.vcf.gz", "cohort_tidy.parquet", ext_path,
tidy_format = TRUE
)
# Tidy format with Hive partitioning by SAMPLE_ID (efficient per-sample queries)
vcf_to_parquet_duckdb("cohort.vcf.gz", "cohort_partitioned/", ext_path,
tidy_format = TRUE,
partition_by = "SAMPLE_ID"
)
# Partition by both CHROM and SAMPLE_ID for large cohorts
vcf_to_parquet_duckdb("wgs_cohort.vcf.gz", "wgs_partitioned/", ext_path,
tidy_format = TRUE,
partition_by = c("CHROM", "SAMPLE_ID")
)
# Parallel mode for whole-genome VCF (requires index)
vcf_to_parquet_duckdb("wgs.vcf.gz", "wgs.parquet", ext_path, threads = 8)
} # }