Export VCF/BCF to Parquet using DuckDB — vcf_to_parquet

Convert a VCF/BCF file to Parquet format for fast subsequent queries.

Usage

vcf_to_parquet_duckdb(
  input_file,
  output_file,
  extension_path = NULL,
  columns = NULL,
  region = NULL,
  compression = "zstd",
  row_group_size = 100000L,
  threads = 1L,
  tidy_format = FALSE,
  partition_by = NULL,
  include_metadata = TRUE,
  con = NULL
)

Arguments

input_file: Path to input VCF, VCF.GZ, or BCF file
output_file: Path to output Parquet file or directory (when using partition_by)
extension_path: Path to the bcf_reader.duckdb_extension file.
columns: Optional character vector of columns to include. NULL for all.
region: Optional genomic region to export (requires index)
compression: Parquet compression: "snappy", "zstd", "gzip", or "none"
row_group_size: Number of rows per row group (default: 100000)
threads: Number of parallel threads for processing (default: 1). When threads > 1 and file is indexed, uses parallel processing by splitting work across chromosomes/contigs. See vcf_to_parquet_duckdb_parallel.
tidy_format: Logical, if TRUE exports data in tidy (long) format with one row per variant-sample combination and a SAMPLE_ID column. Default FALSE.
partition_by: Optional character vector of columns to partition by (Hive-style). Creates a directory structure like output_dir/SAMPLE_ID=HG00098/data_0.parquet. Particularly useful with tidy_format = TRUE to partition by SAMPLE_ID for efficient per-sample queries. DuckDB auto-generates Bloom filters for VARCHAR columns like SAMPLE_ID, enabling fast row group pruning.
include_metadata: Logical, if TRUE embeds the full VCF header as Parquet key-value metadata. Default TRUE. This preserves all VCF schema information (INFO, FORMAT, FILTER definitions, contigs, samples) enabling round-trip back to VCF format. Use parquet_kv_metadata to read the header back. Note: Not supported with partition_by (Parquet limitation for partitioned writes).
con: Optional existing DuckDB connection (with extension loaded).

Value

Invisible path to output file/directory

Examples

if (FALSE) { # \dontrun{
ext_path <- bcf_reader_build(tempdir())

# Export entire file with metadata
vcf_to_parquet_duckdb("variants.vcf.gz", "variants.parquet", ext_path)

# Read back the embedded metadata
parquet_kv_metadata("variants.parquet")

# Export specific columns
vcf_to_parquet_duckdb("variants.vcf.gz", "variants_slim.parquet", ext_path,
  columns = c("CHROM", "POS", "REF", "ALT", "INFO_AF")
)

# Export a region
vcf_to_parquet_duckdb("variants.vcf.gz", "chr22.parquet", ext_path,
  region = "chr22"
)

# Export in tidy format (one row per variant-sample)
vcf_to_parquet_duckdb("cohort.vcf.gz", "cohort_tidy.parquet", ext_path,
  tidy_format = TRUE
)

# Tidy format with Hive partitioning by SAMPLE_ID (efficient per-sample queries)
vcf_to_parquet_duckdb("cohort.vcf.gz", "cohort_partitioned/", ext_path,
  tidy_format = TRUE,
  partition_by = "SAMPLE_ID"
)

# Partition by both CHROM and SAMPLE_ID for large cohorts
vcf_to_parquet_duckdb("wgs_cohort.vcf.gz", "wgs_partitioned/", ext_path,
  tidy_format = TRUE,
  partition_by = c("CHROM", "SAMPLE_ID")
)

# Parallel mode for whole-genome VCF (requires index)
vcf_to_parquet_duckdb("wgs.vcf.gz", "wgs.parquet", ext_path, threads = 8)
} # }