Load VCF into DuckLake (ETL + Registration) — ducklake_load

Converts VCF/BCF to Parquet using the fast bcf_reader extension, then registers the Parquet file in a DuckLake catalog table.

Usage

ducklake_load_vcf(
  con,
  table,
  vcf_path,
  extension_path,
  output_path = NULL,
  threads = parallel::detectCores(),
  compression = "zstd",
  row_group_size = 100000L,
  region = NULL,
  columns = NULL,
  overwrite = FALSE,
  allow_evolution = FALSE,
  tidy_format = FALSE,
  partition_by = NULL
)

Arguments

con: DuckDB connection with DuckLake attached.
table: Target table name (optionally qualified, e.g., "lake.variants").
vcf_path: Path/URI to VCF/BCF file.
extension_path: Path to bcf_reader.duckdb_extension (required).
output_path: Optional Parquet output path. If NULL, uses DuckLake's DATA_PATH.
threads: Number of threads for conversion.
compression: Parquet compression codec.
row_group_size: Parquet row group size.
region: Optional region filter (e.g., "chr1:1000-2000").
columns: Optional character vector of columns to include.
overwrite: Logical, drop existing table first.
allow_evolution: Logical, evolve table schema by adding new columns from VCF. Default: FALSE. When TRUE, new columns found in the VCF are added via ALTER TABLE before insertion, making all columns queryable. Useful for combining VCFs with different annotations (e.g., VEP columns) or different samples (FORMAT_*_SampleName).
tidy_format: Logical, if TRUE exports data in tidy (long) format with one row per variant-sample combination and a SAMPLE_ID column. Default FALSE. Ideal for cohort analysis and combining multiple single-sample VCFs.
partition_by: Optional character vector of columns to partition by (Hive-style). Creates directory structure like output_dir/SAMPLE_ID=HG00098/data_0.parquet. Note: DuckLake registration currently requires single Parquet files; when using partition_by, the output_path should point to the partition directory and files should be registered separately.

Value

Invisibly returns the path to the created Parquet file.

Details

This is the recommended function for loading VCF data into DuckLake. It uses the bcf_reader DuckDB extension for fast VCF→Parquet conversion, which is significantly faster than the nanoarrow streaming path.

Workflow:

VCF → Parquet via vcf_to_parquet_duckdb() (bcf_reader)
Register Parquet in DuckLake catalog

Schema Evolution (allow_evolution = TRUE): When loading multiple VCFs with different schemas (e.g., different samples or different annotation fields), enable allow_evolution to automatically add new columns to the table schema. This uses DuckLake's ALTER TABLE ADD COLUMN which preserves existing data files without rewriting.

Tidy Format (tidy_format = TRUE): When building cohort tables from multiple single-sample VCFs, use tidy_format = TRUE to get one row per variant-sample combination with a SAMPLE_ID column. This format is ideal for downstream analysis and MERGE/UPSERT operations on DuckLake tables.

Partitioning (partition_by): When using partition_by, the output is a Hive-partitioned directory structure. This is useful for large cohorts where you want efficient per-sample queries. DuckDB auto-generates Bloom filters for VARCHAR columns like SAMPLE_ID. Note: For DuckLake, partitioned output requires manual file registration.

Examples

if (FALSE) { # \dontrun{
# Build extension
ext_path <- bcf_reader_build(tempdir())

# Setup DuckLake
con <- duckdb::dbConnect(duckdb::duckdb())
ducklake_load(con)
ducklake_attach(con, "catalog.ducklake", "/data/parquet/", alias = "lake")
DBI::dbExecute(con, "USE lake")

# Load first VCF
ducklake_load_vcf(con, "variants", "sample1.vcf.gz", ext_path, threads = 8)

# Load second VCF with different annotations, evolving schema
ducklake_load_vcf(con, "variants", "sample2_vep.vcf.gz", ext_path,
  allow_evolution = TRUE
)

# Load VCF in tidy format (one row per variant-sample)
ducklake_load_vcf(con, "variants_tidy", "cohort.vcf.gz", ext_path,
  tidy_format = TRUE
)

# Query - all columns from both VCFs are available
DBI::dbGetQuery(con, "SELECT CHROM, COUNT(*) FROM variants GROUP BY CHROM")
} # }