Creates a DuckDB connection with the VCF data loaded as a table or view. Supports in-memory or file-backed databases, tidy format output, parallel loading by chromosome, column selection, and optional Hive partitioning.
Usage
vcf_open_duckdb(
file,
extension_path,
table_name = "variants",
as_view = TRUE,
dbdir = ":memory:",
columns = NULL,
region = NULL,
tidy_format = FALSE,
threads = 1L,
partition_by = NULL,
overwrite = FALSE,
config = list()
)Arguments
- file
Path to VCF, VCF.GZ, or BCF file
- extension_path
Path to the bcf_reader.duckdb_extension file.
- table_name
Name for the table/view (default: "variants")
- as_view
Logical, create a VIEW instead of materializing a TABLE (default: TRUE). Views are instant to create but queries re-read the VCF each time. Tables are slower to create but subsequent queries are fast.
- dbdir
Database directory. Default ":memory:" for in-memory database. Use a file path for persistent storage (e.g., "variants.duckdb").
- columns
Optional character vector of columns to include. NULL for all.
- region
Optional genomic region filter (e.g., "chr1:1000-2000"). Requires an indexed VCF.
- tidy_format
Logical, if TRUE loads data in tidy (long) format with one row per variant-sample combination and a SAMPLE_ID column. Default FALSE.
- threads
Number of threads for parallel loading (default: 1). When > 1 and VCF is indexed:
For views (as_view = TRUE): Creates a UNION ALL view of per-contig bcf_read() calls. DuckDB parallelizes execution at query time.
For tables (as_view = FALSE): Loads each chromosome in parallel then unions into a single table.
- partition_by
Optional character vector of columns to partition by when creating a table (ignored for views). Creates a partitioned table for efficient filtering. Only supported for file-backed databases.
- overwrite
Logical, drop existing table/view if it exists (default: FALSE).
- config
Named list of DuckDB configuration options.
Value
A list with:
- con
DuckDB connection with extension loaded
- table
Name of the created table/view
- is_view
Logical indicating if a view was created
- file
Path to the source VCF file
- dbdir
Database directory
- tidy_format
Whether tidy format was used
- row_count
Number of rows (NULL for views)
Examples
if (FALSE) { # \dontrun{
ext_path <- bcf_reader_build(tempdir())
# Open as lazy view (default - instant creation, re-reads VCF each query)
vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path)
DBI::dbGetQuery(vcf$con, "SELECT * FROM variants WHERE CHROM = '22'")
vcf_close_duckdb(vcf)
# Parallel view (UNION ALL of per-contig reads, parallelized at query time)
vcf <- vcf_open_duckdb("wgs.vcf.gz", ext_path, threads = 8)
# Open as materialized table (slower to create, fast repeated queries)
vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path, as_view = FALSE)
DBI::dbGetQuery(vcf$con, "SELECT COUNT(*) FROM variants")
# Tidy format with specific columns
vcf <- vcf_open_duckdb("cohort.vcf.gz", ext_path,
tidy_format = TRUE,
columns = c("CHROM", "POS", "REF", "ALT", "SAMPLE_ID", "FORMAT_GT")
)
# Parallel table loading for large files
vcf <- vcf_open_duckdb("wgs.vcf.gz", ext_path, as_view = FALSE, threads = 8)
# Persistent file-backed database
vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path,
dbdir = "my_variants.duckdb"
)
# Partitioned table for efficient sample queries
vcf <- vcf_open_duckdb("cohort.vcf.gz", ext_path,
dbdir = "cohort.duckdb",
tidy_format = TRUE,
partition_by = "SAMPLE_ID"
)
} # }