RBCFTools provides R bindings to bcftools and htslib, the standard tools for reading and manipulating VCF/BCF files. The package bundles these libraries and command-line tools (bcftools, bgzip, tabix), so no external installation is required. When compiled with libcurl, remote file access from S3, GCS, and HTTP URLs is supported. The package also includes support for streaming VCF/BCF to Apache Arrow (IPC) format via nanoarrow, with export to Parquet format using duckdb via either the DuckDB nanoarrow extension or a bcf_reader extension bundled in this package.
Installation
You can install the development version of RBCFTools from r-universe for unix-alikes (we do not support windows)
install.packages(
'RBCFTools',
repos = c(
'https://rgenomicsetl.r-universe.dev',
'https://cloud.r-project.org')
)Version Information
library(RBCFTools)
#> Loading required package: parallel
# Get library versions
bcftools_version()
#> [1] "1.23"
htslib_version()
#> [1] "1.23"Tool Paths
RBCFTools bundles bcftools and htslib command-line tools. Use the path functions to locate the executables
bcftools_path()
#> [1] "/usr/local/lib/R/site-library/RBCFTools/bcftools/bin/bcftools"
bgzip_path()
#> [1] "/usr/local/lib/R/site-library/RBCFTools/htslib/bin/bgzip"
tabix_path()
#> [1] "/usr/local/lib/R/site-library/RBCFTools/htslib/bin/tabix"
# List all available tools
bcftools_tools()
#> [1] "bcftools" "color-chrs.pl" "gff2gff" "gff2gff.py"
#> [5] "guess-ploidy.py" "plot-roh.py" "plot-vcfstats" "roh-viz"
#> [9] "run-roh.pl" "vcfutils.pl" "vrfs-variances"
htslib_tools()
#> [1] "annot-tsv" "bgzip" "htsfile" "ref-cache" "tabix"Capabilities
Check which features were compiled into htslib
# Get all capabilities as a named list
htslib_capabilities()
#> $configure
#> [1] TRUE
#>
#> $plugins
#> [1] TRUE
#>
#> $libcurl
#> [1] TRUE
#>
#> $s3
#> [1] TRUE
#>
#> $gcs
#> [1] TRUE
#>
#> $libdeflate
#> [1] TRUE
#>
#> $lzma
#> [1] TRUE
#>
#> $bzip2
#> [1] TRUE
#>
#> $htscodecs
#> [1] TRUE
# Human-readable feature string
htslib_feature_string()
#> [1] "build=configure libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes plugins=yes plugin-path=/usr/local/lib/R/site-library/RBCFTools/htslib/libexec/htslib: htscodecs=1.6.5"Feature Constants
Use HTS_FEATURE_* constants to check for specific features
# Check individual features
htslib_has_feature(HTS_FEATURE_LIBCURL) # Remote file access via libcurl
#> [1] TRUE
htslib_has_feature(HTS_FEATURE_S3)
#> [1] TRUE
htslib_has_feature(HTS_FEATURE_GCS) # Google Cloud Storage
#> [1] TRUE
htslib_has_feature(HTS_FEATURE_LIBDEFLATE)
#> [1] TRUE
htslib_has_feature(HTS_FEATURE_LZMA)
#> [1] TRUE
htslib_has_feature(HTS_FEATURE_BZIP2)
#> [1] TRUEThese are useful for conditionally enabling features in your code.
Example Query Of Remote VCF from S3 using bcftools
With libcurl support, bcftools can directly query remote files. Here we count variants in a small region from the 1000 Genomes cohort VCF on S3:
# Setup environment for remote file access (S3/GCS)
setup_hts_env()
# Build S3 URL for 1000 Genomes cohort VCF
s3_base <- "s3://1000genomes-dragen-v3.7.6/data/cohorts/"
s3_path <- "gvcf-genotyper-dragen-3.7.6/hg19/3202-samples-cohort/"
s3_vcf_file <- "3202_samples_cohort_gg_chr22.vcf.gz"
s3_vcf_uri <- paste0(s3_base, s3_path, s3_vcf_file)
# Use processx instead of system2
res <- processx::run(
bcftools_path(),
c("index", "-n", s3_vcf_uri),
error_on_status = FALSE,
echo = FALSE
)
res$stdou
#> [1] "1910766\n"VCF to Arrow Streams and Duckdb bcf_reader extension
RBCFTools provides streaming VCF/BCF to Apache Arrow Stream conversion via nanoarrow. This enables integration with tools like DuckDB and Parquet format conversion when serializing to Arrow IPC. We also support the bcf_reader DuckDB extension to read directly into DuckDB.
The nanoarrow stream conversion and bcf_reader perform VCF spec conformance checks on headers (similar to htslib’s bcf_hdr_check_sanity()) and emit R warnings or DuckDB logs when correcting non-conformant fields
Read VCF as Arrow Stream
Open BCF as Arrow array stream and read the batches
bcf_file <- system.file("extdata", "1000G_3samples.bcf", package = "RBCFTools")
stream <- vcf_open_arrow(bcf_file, batch_size = 100L)
batch <- stream$get_next()
#> Warning in x$get_schema(): FORMAT/AD should be declared as Number=R per VCF
#> spec; correcting schema
#> Warning in x$get_schema(): FORMAT/GQ should be Type=Integer per VCF spec, but
#> header declares Type=Float; using header type
#> Warning in x$get_schema(): FORMAT/GT should be declared as Number=1 per VCF
#> spec; correcting schema
#> Warning in x$get_schema(): FORMAT/AD should be declared as Number=R per VCF
#> spec; correcting schema
#> Warning in x$get_schema(): FORMAT/GQ should be Type=Integer per VCF spec, but
#> header declares Type=Float; using header type
#> Warning in x$get_schema(): FORMAT/GT should be declared as Number=1 per VCF
#> spec; correcting schema
head(nanoarrow::convert_array(batch))
#> CHROM POS ID REF ALT QUAL FILTER INFO.DP INFO.AF INFO.CB
#> 1 1 10583 rs58108140 G A NA PASS 1557 0.162 UM,BI,BC,NCBI
#> 2 1 11508 <NA> A G NA PASS 23 0.72 BI,BC
#> 3 1 11565 <NA> G T NA PASS 45 0 BI,BC
#> 4 1 13116 <NA> T G NA PASS 2400 0.016 UM,BI
#> 5 1 13327 <NA> G C NA PASS 2798 0.01 BI,BC
#> 6 1 14699 <NA> C G NA PASS 408 0.02 BI,BC
#> INFO.EUR_R2 INFO.AFR_R2 INFO.ASN_R2 INFO.AC INFO.AN samples.HG00098.AD
#> 1 0.248 NA NA 0 6 NULL
#> 2 0.001 NA NA 6 6 NULL
#> 3 NA 0.003 NA 0 6 NULL
#> 4 0.106 0.286 NA 0 6 NULL
#> 5 0.396 0.481 NA 0 6 NULL
#> 6 0.061 0.184 NA 0 6 NULL
#> samples.HG00098.DP samples.HG00098.GL samples.HG00098.GQ samples.HG00098.GT
#> 1 NA NULL 3.28 0|0
#> 2 NA NULL 2.22 1|1
#> 3 NA NULL 1.48 0|0
#> 4 NA NULL 9.17 0|0
#> 5 NA NULL 15.80 0|0
#> 6 NA NULL 6.29 0|0
#> samples.HG00098.GD samples.HG00098.OG samples.HG00100.AD samples.HG00100.DP
#> 1 NA ./. NULL NA
#> 2 NA ./. NULL NA
#> 3 NA ./. NULL NA
#> 4 NA ./. NULL NA
#> 5 NA ./. NULL NA
#> 6 NA ./. NULL NA
#> samples.HG00100.GL samples.HG00100.GQ samples.HG00100.GT samples.HG00100.GD
#> 1 NULL 3.28 0|0 NA
#> 2 NULL 2.22 1|1 NA
#> 3 NULL 1.48 0|0 NA
#> 4 NULL 9.17 0|0 NA
#> 5 NULL 15.80 0|0 NA
#> 6 NULL 6.29 0|0 NA
#> samples.HG00100.OG samples.HG00106.AD samples.HG00106.DP samples.HG00106.GL
#> 1 ./. NULL NA NULL
#> 2 ./. NULL NA NULL
#> 3 ./. NULL NA NULL
#> 4 ./. NULL NA NULL
#> 5 ./. 2, 0 2 0.00, -0.60, -8.78
#> 6 ./. NULL NA NULL
#> samples.HG00106.GQ samples.HG00106.GT samples.HG00106.GD samples.HG00106.OG
#> 1 3.28 0|0 NA ./.
#> 2 2.22 1|1 NA ./.
#> 3 1.48 0|0 NA ./.
#> 4 9.17 0|0 NA ./.
#> 5 21.74 0|0 NA ./.
#> 6 6.29 0|0 NA ./.
stream$release()Convert to Data Frame
Convert entire BCF to data.frame
options(warn = -1) # Suppress warnings
df <- vcf_to_arrow(bcf_file, as = "data.frame")
df[, c("CHROM", "POS", "REF", "ALT", "QUAL")] |> head()
#> CHROM POS REF ALT QUAL
#> 1 1 10583 G A NA
#> 2 1 11508 A G NA
#> 3 1 11565 G T NA
#> 4 1 13116 T G NA
#> 5 1 13327 G C NA
#> 6 1 14699 C G NAWrite to Arrow IPC
Arrow IPC (.arrows) format for interoperability with other Arrow tools using nanoarrow’s native streaming writer
# Convert BCF to Arrow IPC
ipc_file <- tempfile(fileext = ".arrows")
vcf_to_arrow_ipc(bcf_file, ipc_file)
# Read back with nanoarrow
ipc_data <- as.data.frame(nanoarrow::read_nanoarrow(ipc_file))
ipc_data[, c("CHROM", "POS", "REF", "ALT")] |> head()
#> CHROM POS REF ALT
#> 1 1 10583 G A
#> 2 1 11508 A G
#> 3 1 11565 G T
#> 4 1 13116 T G
#> 5 1 13327 G C
#> 6 1 14699 C GWrite VCF Streams to Parquet
Using DuckDB to convert BCF to Parquet file and perform queries on the Parquet file. This involves VCF stream conversion to data.frame
parquet_file <- tempfile(fileext = ".parquet")
vcf_to_parquet_arrow(bcf_file, parquet_file, compression = "snappy")
#> Wrote 11 rows to /tmp/RtmpUoEUOx/file5131f2e9524f8.parquet
con <- duckdb::dbConnect(duckdb::duckdb())
pq_bcf <- DBI::dbGetQuery(con, sprintf("SELECT * FROM '%s' LIMIT 100", parquet_file))
pq_me <- DBI::dbGetQuery(
con,
sprintf("SELECT * FROM parquet_metadata('%s')",
parquet_file))
duckdb::dbDisconnect(con, shutdown = TRUE)
pq_bcf[, c("CHROM", "POS", "REF", "ALT")] |>
head()
#> CHROM POS REF ALT
#> 1 1 10583 G A
#> 2 1 11508 A G
#> 3 1 11565 G T
#> 4 1 13116 T G
#> 5 1 13327 G C
#> 6 1 14699 C G
pq_me |> head()
#> file_name row_group_id row_group_num_rows
#> 1 /tmp/RtmpUoEUOx/file5131f2e9524f8.parquet 0 11
#> 2 /tmp/RtmpUoEUOx/file5131f2e9524f8.parquet 0 11
#> 3 /tmp/RtmpUoEUOx/file5131f2e9524f8.parquet 0 11
#> 4 /tmp/RtmpUoEUOx/file5131f2e9524f8.parquet 0 11
#> 5 /tmp/RtmpUoEUOx/file5131f2e9524f8.parquet 0 11
#> 6 /tmp/RtmpUoEUOx/file5131f2e9524f8.parquet 0 11
#> row_group_num_columns row_group_bytes column_id file_offset num_values
#> 1 36 3135 0 0 11
#> 2 36 3135 1 0 11
#> 3 36 3135 2 0 11
#> 4 36 3135 3 0 11
#> 5 36 3135 4 0 11
#> 6 36 3135 5 0 11
#> path_in_schema type stats_min stats_max stats_null_count
#> 1 CHROM BYTE_ARRAY 1 1 0
#> 2 POS DOUBLE 10583.0 28376.0 0
#> 3 ID BYTE_ARRAY rs58108140 rs58108140 10
#> 4 REF BYTE_ARRAY A T 0
#> 5 ALT, list, element BYTE_ARRAY A T NA
#> 6 QUAL DOUBLE <NA> <NA> 11
#> stats_distinct_count stats_min_value stats_max_value compression
#> 1 1 1 1 SNAPPY
#> 2 NA 10583.0 28376.0 SNAPPY
#> 3 1 rs58108140 rs58108140 SNAPPY
#> 4 NA A T SNAPPY
#> 5 NA A T SNAPPY
#> 6 NA <NA> <NA> SNAPPY
#> encodings index_page_offset dictionary_page_offset data_page_offset
#> 1 PLAIN_DICTIONARY NA 4 24
#> 2 PLAIN NA NA 52
#> 3 PLAIN_DICTIONARY NA 146 175
#> 4 PLAIN NA NA 208
#> 5 PLAIN NA NA 268
#> 6 PLAIN NA NA 335
#> total_compressed_size total_uncompressed_size key_value_metadata
#> 1 48 44 NULL
#> 2 94 113 NULL
#> 3 62 84 NULL
#> 4 60 78 NULL
#> 5 67 85 NULL
#> 6 25 23 NULL
#> bloom_filter_offset bloom_filter_length min_is_exact max_is_exact
#> 1 2261 47 TRUE TRUE
#> 2 NA NA TRUE TRUE
#> 3 2308 47 TRUE TRUE
#> 4 NA NA TRUE TRUE
#> 5 NA NA TRUE TRUE
#> 6 NA NA NA NA
#> row_group_compressed_bytes geo_bbox.xmin geo_bbox.xmax geo_bbox.ymin
#> 1 1 NA NA NA
#> 2 1 NA NA NA
#> 3 1 NA NA NA
#> 4 1 NA NA NA
#> 5 1 NA NA NA
#> 6 1 NA NA NA
#> geo_bbox.ymax geo_bbox.zmin geo_bbox.zmax geo_bbox.mmin geo_bbox.mmax
#> 1 NA NA NA NA NA
#> 2 NA NA NA NA NA
#> 3 NA NA NA NA NA
#> 4 NA NA NA NA NA
#> 5 NA NA NA NA NA
#> 6 NA NA NA NA NA
#> geo_types
#> 1 NULL
#> 2 NULL
#> 3 NULL
#> 4 NULL
#> 5 NULL
#> 6 NULLUse streaming = TRUE to avoid loading the entire VCF into R memory. This streams VCF to Arrow IPC (nanoarrow) and then to Parquet via DuckDB, trading memory for serialization overhead using the DuckDB nanoarrow extension
bcf_larger <- system.file("extdata", "1000G.ALL.2of4intersection.20100804.genotypes.bcf", package = "RBCFTools")
outfile <- tempfile(fileext = ".parquet")
vcf_to_parquet_arrow(
bcf_larger,
outfile,
streaming = TRUE,
batch_size = 10000L,
row_group_size = 100000L,
compression = "zstd"
)
#> Wrote 11 rows to /tmp/RtmpUoEUOx/file5131f42823959.parquet (streaming mode)Query VCF with duckdb after converting the Stream
SQL queries on BCF using DuckDB package. For now this is somewhat limited due to conversion from Arrow streams to data frame
vcf_query_arrow(bcf_file, "SELECT CHROM, COUNT(*) as n FROM vcf GROUP BY CHROM")
#> CHROM n
#> 1 1 11
# Filter variants by position
vcf_query_arrow(bcf_file, "SELECT CHROM, POS, REF, ALT FROM vcf LIMIT 5")
#> CHROM POS REF ALT
#> 1 1 10583 G A
#> 2 1 11508 A G
#> 3 1 11565 G T
#> 4 1 13116 T G
#> 5 1 13327 G CQuery VCF with DuckDB Extension
A native DuckDB extension (bcf_reader) for direct SQL queries on VCF/BCF files without Arrow conversion overhead. The extension uses the DuckDB C API and should be compatible with DuckDB v1.2.0+
# Build extension (uses package's bundled htslib)
build_dir <- file.path(tempdir(), "bcf_reader")
ext_path <- bcf_reader_build(build_dir, verbose = FALSE)
# Connect and query a VCF.gz file
con <- vcf_duckdb_connect(ext_path)
vcf_file <- system.file("extdata", "test_deep_variant.vcf.gz", package = "RBCFTools")
# Describe schema
DBI::dbGetQuery(con, sprintf("DESCRIBE SELECT * FROM bcf_read('%s')", vcf_file))
#> column_name column_type null key default extra
#> 1 CHROM VARCHAR YES <NA> <NA> <NA>
#> 2 POS BIGINT YES <NA> <NA> <NA>
#> 3 ID VARCHAR YES <NA> <NA> <NA>
#> 4 REF VARCHAR YES <NA> <NA> <NA>
#> 5 ALT VARCHAR[] YES <NA> <NA> <NA>
#> 6 QUAL DOUBLE YES <NA> <NA> <NA>
#> 7 FILTER VARCHAR[] YES <NA> <NA> <NA>
#> 8 INFO_END INTEGER YES <NA> <NA> <NA>
#> 9 FORMAT_GT_test_deep_variant VARCHAR YES <NA> <NA> <NA>
#> 10 FORMAT_GQ_test_deep_variant INTEGER YES <NA> <NA> <NA>
#> 11 FORMAT_DP_test_deep_variant INTEGER YES <NA> <NA> <NA>
#> 12 FORMAT_MIN_DP_test_deep_variant INTEGER YES <NA> <NA> <NA>
#> 13 FORMAT_AD_test_deep_variant INTEGER[] YES <NA> <NA> <NA>
#> 14 FORMAT_VAF_test_deep_variant FLOAT[] YES <NA> <NA> <NA>
#> 15 FORMAT_PL_test_deep_variant INTEGER[] YES <NA> <NA> <NA>
#> 16 FORMAT_MED_DP_test_deep_variant INTEGER YES <NA> <NA> <NA>
# Aggregate query
DBI::dbGetQuery(con, sprintf("
SELECT CHROM, COUNT(*) as n_variants,
MIN(POS) as min_pos, MAX(POS) as max_pos
FROM bcf_read('%s')
GROUP BY CHROM
ORDER BY n_variants DESC
LIMIT 5", vcf_file))
#> CHROM n_variants min_pos max_pos
#> 1 19 35918 111129 59084689
#> 2 1 35846 536895 249211717
#> 3 17 27325 6102 81052229
#> 4 11 24472 180184 134257519
#> 5 2 22032 42993 242836470
# Export directly to Parquet
parquet_out <- tempfile(fileext = ".parquet")
DBI::dbExecute(con, sprintf("
COPY (SELECT * FROM bcf_read('%s'))
TO '%s' (FORMAT PARQUET, COMPRESSION ZSTD)", vcf_file, parquet_out))
#> [1] 368319
# Verify parquet file size
file.info(parquet_out)$size
#> [1] 3998815
# Same query on Parquet file
DBI::dbGetQuery(con, sprintf("
SELECT CHROM, COUNT(*) as n_variants,
MIN(POS) as min_pos, MAX(POS) as max_pos
FROM '%s'
GROUP BY CHROM
ORDER BY n_variants DESC
LIMIT 5", parquet_out))
#> CHROM n_variants min_pos max_pos
#> 1 19 35918 111129 59084689
#> 2 1 35846 536895 249211717
#> 3 17 27325 6102 81052229
#> 4 11 24472 180184 134257519
#> 5 2 22032 42993 242836470
DBI::dbDisconnect(con)Open VCF as DuckDB Table/View
The vcf_open_duckdb() function provides a convenient interface to open VCF/BCF files as DuckDB tables or views, similar to vcf_open_arrow() for Arrow streams. By default it creates a lazy view that reads from the VCF on each query - instant to create with no memory overhead.
# Build extension
ext_path <- bcf_reader_build(tempdir(), verbose = FALSE)
vcf_file <- system.file("extdata", "1000G_3samples.vcf.gz", package = "RBCFTools")
# Open as lazy view (default) - instant creation
vcf <- vcf_open_duckdb(vcf_file, ext_path)
#> Created view 'variants' from 1000G_3samples.vcf.gz
print(vcf)
#> VCF DuckDB Connection
#> ---------------------
#> Source file: 1000G_3samples.vcf.gz
#> Table name: variants
#> Type: VIEW (lazy)
#> Database: in-memory
#> Tidy format: FALSE
#>
#> Use DBI::dbGetQuery(vcf$con, 'SELECT ...') to query
#> Use vcf_close_duckdb(vcf) when done
# Query the view
DBI::dbGetQuery(vcf$con, "SELECT CHROM, POS, REF, ALT FROM variants LIMIT 5")
#> CHROM POS REF ALT
#> 1 1 10583 G A
#> 2 1 11508 A G
#> 3 1 11565 G T
#> 4 1 13116 T G
#> 5 1 13327 G C
# Close connection
vcf_close_duckdb(vcf)For repeated queries on the same data, materialize to a table with as_view = FALSE:
# Materialize to in-memory table (slower to create, fast queries)
vcf <- vcf_open_duckdb(vcf_file, ext_path, as_view = FALSE)
#> Created table 'variants' with 11 rows from 1000G_3samples.vcf.gz
print(vcf)
#> VCF DuckDB Connection
#> ---------------------
#> Source file: 1000G_3samples.vcf.gz
#> Table name: variants
#> Type: TABLE (materialized)
#> Database: in-memory
#> Tidy format: FALSE
#> Row count: 11
#>
#> Use DBI::dbGetQuery(vcf$con, 'SELECT ...') to query
#> Use vcf_close_duckdb(vcf) when done
# Multiple fast queries on materialized data
DBI::dbGetQuery(vcf$con, "SELECT COUNT(*) as n FROM variants")
#> n
#> 1 11
DBI::dbGetQuery(vcf$con, "SELECT CHROM, COUNT(*) as n FROM variants GROUP BY CHROM")
#> CHROM n
#> 1 1 11
vcf_close_duckdb(vcf)For large VCF files, use tidy format and parallel loading:
# Tidy format with column selection
vcf <- vcf_open_duckdb(
vcf_file, ext_path,
tidy_format = TRUE,
columns = c("CHROM", "POS", "REF", "ALT", "SAMPLE_ID", "FORMAT_GT")
)
#> Created view 'variants' from 1000G_3samples.vcf.gz
DBI::dbGetQuery(vcf$con, "SELECT * FROM variants LIMIT 6")
#> CHROM POS REF ALT SAMPLE_ID FORMAT_GT
#> 1 1 10583 G A HG00098 0|0
#> 2 1 10583 G A HG00100 0|0
#> 3 1 10583 G A HG00106 0|0
#> 4 1 11508 A G HG00098 1|1
#> 5 1 11508 A G HG00100 1|1
#> 6 1 11508 A G HG00106 1|1
vcf_close_duckdb(vcf)Tidy (Long) Format Export
Export VCF/BCF to a “tidy” format where each row is one variant-sample combination. The native tidy_format parameter in the bcf_reader extension transforms wide FORMAT_<field>_<sample> columns into a single SAMPLE_ID column plus FORMAT_<field> columns - much faster than SQL-level UNNEST.
# Build extension
ext_path <- bcf_reader_build(tempdir(), verbose = FALSE)
# Multi-sample VCF: 3 samples x 11 variants = 33 rows
vcf_3samples <- system.file("extdata", "1000G_3samples.vcf.gz", package = "RBCFTools")
tidy_out <- tempfile(fileext = ".parquet")
# Use tidy_format parameter directly
vcf_to_parquet_duckdb(vcf_3samples, tidy_out, extension_path = ext_path, tidy_format = TRUE)
#> Wrote: /tmp/RtmpUoEUOx/file5131f7fa8f670.parquet
# Query the tidy output
con <- duckdb::dbConnect(duckdb::duckdb())
DBI::dbGetQuery(con, sprintf("
SELECT CHROM, POS, REF, SAMPLE_ID, FORMAT_GT
FROM '%s'
LIMIT 9", tidy_out))
#> CHROM POS REF SAMPLE_ID FORMAT_GT
#> 1 1 10583 G HG00098 0|0
#> 2 1 10583 G HG00100 0|0
#> 3 1 10583 G HG00106 0|0
#> 4 1 11508 A HG00098 1|1
#> 5 1 11508 A HG00100 1|1
#> 6 1 11508 A HG00106 1|1
#> 7 1 11565 G HG00098 0|0
#> 8 1 11565 G HG00100 0|0
#> 9 1 11565 G HG00106 0|0
# Verify row expansion: variants * samples
DBI::dbGetQuery(con, sprintf("
SELECT COUNT(*) as total_rows,
COUNT(DISTINCT SAMPLE_ID) as n_samples
FROM '%s'", tidy_out))
#> total_rows n_samples
#> 1 33 3
DBI::dbDisconnect(con, shutdown = TRUE)This format is ideal for combining single-sample VCFs into cohort tables or appending to DuckLake tables with MERGE operations.
VCF Header Metadata in Parquet
When exporting VCF to Parquet, the full VCF header is embedded as Parquet key-value metadata by default. This preserves all schema information (INFO, FORMAT, FILTER definitions, contigs, samples) enabling round-trip back to VCF format.
# Build extension
ext_path <- bcf_reader_build(tempdir(), verbose = FALSE)
vcf_file <- system.file("extdata", "1000G_3samples.vcf.gz", package = "RBCFTools")
# Export with embedded VCF header (default)
parquet_out <- tempfile(fileext = ".parquet")
vcf_to_parquet_duckdb(vcf_file, parquet_out, ext_path)
#> Wrote: /tmp/RtmpUoEUOx/file5131f727fd4f8.parquet
# Read back the metadata
meta <- parquet_kv_metadata(parquet_out)
print(meta)
#> key
#> 1 vcf_header
#> 2 RBCFTools_version
#> 3 tidy_format
#> value
#> 1 ##fileformat=VCFv4.0\\x0A##FILTER=<ID=PASS,Description=\\x22All filters passed\\x22>\\x0A##filedat=20101112\\x0A##datarelease=20100804\\x0A##samples=629\\x0A##contig=<ID=1,length=249250621>\\x0A##description=\\x22Where BI calls are present, genotypes and alleles are from BI. In there absence, UM genotypes are used. If neither are available, no genotype information is present and the alleles are from the NCBI calls.\\x22\\x0A##FORMAT=<ID=AD,Number=A,Type=Integer,Description=\\x22Allelic depths for the ref and alt alleles in the order listed\\x22>\\x0A##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\\x22Read Depth (only filtered reads used for calling)\\x22>\\x0A##FORMAT=<ID=GL,Number=G,Type=Float,Description=\\x22Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic\\x22>\\x0A##FORMAT=<ID=GQ,Number=1,Type=Float,Description=\\x22Genotype Quality\\x22>\\x0A##FORMAT=<ID=GT,Number=A,Type=String,Description=\\x22Genotype\\x22>\\x0A##FORMAT=<ID=GD,Number=1,Type=Float,Description=\\x22Genotype dosage. Expected count of non-ref alleles [0,2]\\x22>\\x0A##FORMAT=<ID=OG,Number=1,Type=String,Description=\\x22Original Genotype input to Beagle\\x22>\\x0A##INFO=<ID=AF,Number=.,Type=Float,Description=\\x22Allele Frequency, for each ALT allele, in the same order as listed\\x22>\\x0A##INFO=<ID=DP,Number=1,Type=Integer,Description=\\x22Total Depth\\x22>\\x0A##INFO=<ID=CB,Number=.,Type=String,Description=\\x22List of centres that called, UM (University of Michigan), BI (Broad Institute), BC (Boston College), NCBI\\x22>\\x0A##INFO=<ID=EUR_R2,Number=1,Type=Float,Description=\\x22R2 From Beagle based on European Samples\\x22>\\x0A##INFO=<ID=AFR_R2,Number=1,Type=Float,Description=\\x22R2 From Beagle based on AFRICAN Samples\\x22>\\x0A##INFO=<ID=ASN_R2,Number=1,Type=Float,Description=\\x22R2 From Beagle based on Asian Samples\\x22>\\x0A##bcftools_viewVersion=1.9-321-g5774f32+htslib-1.10.2-22-gbfc9f0d\\x0A##bcftools_viewCommand=view -O b -o 1000G.ALL.2of4intersection.20100804.genotypes.bcf 1000G.ALL.2of4intersection.20100804.genotypes.vcf; Date=Fri Apr 24 20:53:38 2020\\x0A##INFO=<ID=AC,Number=A,Type=Integer,Description=\\x22Allele count in genotypes\\x22>\\x0A##INFO=<ID=AN,Number=1,Type=Integer,Description=\\x22Total number of alleles in called genotypes\\x22>\\x0A##bcftools_viewVersion=1.23+htslib-1.23\\x0A##bcftools_viewCommand=view -s HG00098,HG00100,HG00106 -O b -o inst/extdata/1000G_3samples.bcf inst/extdata/1000G.ALL.2of4intersection.20100804.genotypes.bcf; Date=Sat Jan 3 14:41:16 2026\\x0A##bcftools_viewCommand=view -O z -o 1000G_3samples.vcf.gz 1000G_3samples.bcf; Date=Wed Jan 7 14:39:56 2026\\x0A##bcftools_viewCommand=view -h /usr/local/lib/R/site-library/RBCFTools/extdata/1000G_3samples.vcf.gz; Date=Mon Jan 19 21:21:38 2026\\x0A#CHROM\\x09POS\\x09ID\\x09REF\\x09ALT\\x09QUAL\\x09FILTER\\x09INFO\\x09FORMAT\\x09HG00098\\x09HG00100\\x09HG00106
#> 2 1.23.0.0.3
#> 3 false
# Extract the VCF header (stored with escaped newlines)
vcf_header <- meta[meta$key == "vcf_header", "value"]
substr(vcf_header, 1, 200)
#> [1] "##fileformat=VCFv4.0\\x0A##FILTER=<ID=PASS,Description=\\x22All filters passed\\x22>\\x0A##filedat=20101112\\x0A##datarelease=20100804\\x0A##samples=629\\x0A##contig=<ID=1,length=249250621>\\x0A##description="Stream Remote VCF using Arrow or DuckDB Extension
Stream remote VCF region directly from S3 using either nanoarrow based vcf_stream or duckb extension
s3_vcf_uri <- paste0(s3_base, s3_path, s3_vcf_file)
# Arrow stream
stream <- vcf_open_arrow(
s3_vcf_uri,
region = "chr22:16050000-16050500",
batch_size = 1000L
)
df <- as.data.frame(nanoarrow::convert_array_stream(stream))
df[, c("CHROM", "POS", "REF", "ALT")] |> head()
#> CHROM POS REF ALT
#> 1 chr22 16050036 A C , <NON_REF>
#> 2 chr22 16050151 T G , <NON_REF>
#> 3 chr22 16050213 C T , <NON_REF>
#> 4 chr22 16050219 C A , <NON_REF>
#> 5 chr22 16050224 A C , <NON_REF>
#> 6 chr22 16050229 C A , <NON_REF>
# Query remote VCF with bcf_reader extension
vcf_query_duckdb(
s3_vcf_uri,
ext_path,
region = "chr22:16050000-16050500",
query = "SELECT CHROM, POS, REF, ALT FROM vcf LIMIT 5"
)
#> CHROM POS REF ALT
#> 1 chr22 16050036 A C, <NON_REF>
#> 2 chr22 16050151 T G, <NON_REF>
#> 3 chr22 16050213 C T, <NON_REF>
#> 4 chr22 16050219 C A, <NON_REF>
#> 5 chr22 16050224 A C, <NON_REF>Custom Index File Path
For region queries, an index file is required. By default, RBCFTools looks for .tbi (tabix) or .csi indexes using standard naming conventions. For non-standard index locations such as presigned URLs or custom paths, use the index parameter. Index files are only required for region queries; when reading an entire file without a region filter, no index is needed. VCF files try .tbi first then fall back to .csi, while BCF files use .csi only.
# Explicit index file path
bcf_file <- system.file("extdata", "1000G_3samples.bcf", package = "RBCFTools")
csi_index <- system.file("extdata", "1000G_3samples.bcf.csi", package = "RBCFTools")
stream <- vcf_open_arrow(bcf_file, index = csi_index)
batch <- stream$get_next()
nanoarrow::convert_array(batch)[, c("CHROM", "POS", "REF")] |> head(3)
#> CHROM POS REF
#> 1 1 10583 G
#> 2 1 11508 A
#> 3 1 11565 G
# Alternative: htslib ##idx## syntax in filename
filename_with_idx <- paste0(bcf_file, "##idx##", csi_index)
stream2 <- vcf_open_arrow(filename_with_idx)
batch2 <- stream2$get_next()
nanoarrow::convert_array(batch2)[, c("CHROM", "POS", "REF")] |> head(3)
#> CHROM POS REF
#> 1 1 10583 G
#> 2 1 11508 A
#> 3 1 11565 GExample Application: DuckLake ETL
DuckLake is an integrated data lake and catalog format. It has two components: a Parquet file storage and a metadata database layer.
DuckLake ETL to MinIO Storage backend
To mimic an S3 backend for DuckLake storage we use MinIO, we convert VCF files to Parquet and insert them into DuckLake and query the lake. The lake has a local DuckDB metadata database in this example
Setup up minio storage backend
We start by seting up and configuring a minio server
# DuckLake with S3-compatible storage using local MinIO
bin_dir <- file.path(tempdir(), "ducklake_bins")
dir.create(bin_dir, recursive = TRUE, showWarnings = FALSE)
# Use installed MinIO and MC binaries
minio_bin <- Sys.which('minio')
mc_bin <- Sys.which('mc')
# Start MinIO (ephemeral) and configure mc
data_dir <- file.path(tempdir(), "ducklake_minio")
dir.create(data_dir, recursive = TRUE, showWarnings = FALSE)
port <- 9000
endpoint <- sprintf("127.0.0.1:%d", port)
# Start MinIO server in background
cmd <- sprintf(
"%s server %s --address %s > /dev/null 2>&1 & echo $!",
shQuote(minio_bin),
shQuote(data_dir),
endpoint
)
pid_output <- processx::run("sh", c("-c", cmd), echo = FALSE)$stdout
pid <- as.integer(pid_output)
pid
#> [1] 333002
# Give MinIO time to start
Sys.sleep(10)
# Configure mc alias
# remove previous alias
processx::run(
mc_bin,
c("alias", "remove", "ducklake_local"),
error_on_status = FALSE,
echo = FALSE
)
#> $status
#> [1] 0
#>
#> $stdout
#> [1] "Removed `ducklake_local` successfully.\n"
#>
#> $stderr
#> [1] ""
#>
#> $timeout
#> [1] FALSE
mc_cmd_args <- c("alias", "set", "ducklake_local",
paste0("http://", endpoint), "minioadmin", "minioadmin")
processx::run(mc_bin, mc_cmd_args, echo = FALSE)
#> $status
#> [1] 0
#>
#> $stdout
#> [1] "Added `ducklake_local` successfully.\n"
#>
#> $stderr
#> [1] ""
#>
#> $timeout
#> [1] FALSE
# Create bucket with unique name
bucket <- sprintf("readme-demo-%d", as.integer(Sys.time()))
bucket_cmd_args <- c("mb", paste0("ducklake_local/", bucket))
processx::run(mc_bin, bucket_cmd_args, echo = FALSE)
#> $status
#> [1] 0
#>
#> $stdout
#> [1] "Bucket created successfully `ducklake_local/readme-demo-1768854112`.\n"
#>
#> $stderr
#> [1] ""
#>
#> $timeout
#> [1] FALSE
# Store variables for later use
minio_endpoint <- endpoint
bucket_name <- bucket
data_root_s3 <- paste0("s3://", bucket_name)
data_path_s3 <- paste0(data_root_s3, "/data/")
mc_path <- mc_binAttach a DuckLake
con <- duckdb::dbConnect(duckdb::duckdb(config = list(
allow_unsigned_extensions = "true",
enable_external_access = "true"
)))
stopifnot(DBI::dbIsValid(con))
DBI::dbExecute(con, "INSTALL httpfs")
#> [1] 0
DBI::dbExecute(con, "LOAD httpfs")
#> [1] 0
DBI::dbExecute(con, "INSTALL ducklake FROM core_nightly")
#> [1] 0
DBI::dbExecute(con, "LOAD ducklake")
#> [1] 0
# Provide S3 credentials/endpoints for MinIO
Sys.setenv(
AWS_ACCESS_KEY_ID = "minioadmin",
AWS_SECRET_ACCESS_KEY = "minioadmin",
AWS_DEFAULT_REGION = "us-east-1",
AWS_S3_ENDPOINT = paste0("http://", minio_endpoint),
AWS_S3_USE_HTTPS = "FALSE",
AWS_S3_FORCE_PATH_STYLE = "TRUE",
AWS_EC2_METADATA_DISABLED = "TRUE"
)
ducklake_create_s3_secret(
con,
name = "ducklake_minio",
key_id = "minioadmin",
secret = "minioadmin",
endpoint = minio_endpoint,
region = "us-east-1",
use_ssl = FALSE
)
metadata_path <- file.path(tempdir(), sprintf("ducklake_meta_%s.ducklake", bucket_name))
ducklake_attach(
con,
metadata_path = metadata_path,
data_path = data_path_s3,
alias = "lake",
extra_options = list(OVERRIDE_DATA_PATH = TRUE)
)
# Work inside the attached lake database
DBI::dbExecute(con, "USE lake")
#> [1] 0Write a VCF into minio and register to the Lake
# Ensure we are operating inside the DuckLake catalog
DBI::dbExecute(con, "USE lake")
#> [1] 0
# Load variants via fast VCF to Parquet conversion
vcf_file <- system.file("extdata", "test_deep_variant.vcf.gz", package = "RBCFTools")
ext_path <- bcf_reader_build(tempdir())
#> bcf_reader extension already exists at: /tmp/RtmpUoEUOx/build/bcf_reader.duckdb_extension
#> Use force=TRUE to rebuild.
ducklake_load_vcf(
con,
table = "variants",
vcf_path = vcf_file,
extension_path = ext_path,
threads = 1,
tidy_format = TRUE
)
#> Wrote: /tmp/RtmpUoEUOx/variants_20260119_212152.parquet
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#> target signature 'duckdb_connection#Id'.
#> "duckdb_connection#ANY" would also be validAdd another VCF file with different schema
DuckLake supports schema evolution via ALTER TABLE. Use allow_evolution = TRUE to automatically add new columns from incoming files.
DBI::dbExecute(con, "USE lake")
#> [1] 0
variants_count <- DBI::dbGetQuery(con, "SELECT COUNT(*) as n FROM variants")$n[1]
variants_count
#> [1] 368319
vcf_file2 <- system.file("extdata", "test_vep.vcf", package = "RBCFTools")
local_parquet2 <- tempfile(fileext = ".parquet")
vcf_to_parquet_duckdb(vcf_file2, local_parquet2, extension_path = ext_path, tidy_format = TRUE)
#> Wrote: /tmp/RtmpUoEUOx/file5131f2d13d660.parquet
DBI::dbGetQuery(con, sprintf("SELECT COUNT(*) as n FROM read_parquet('%s')", local_parquet2))
#> n
#> 1 802
s3_parquet2 <- paste0(data_path_s3, "variants/variants_vep.parquet")
mc_cmd_args <- c("cp", local_parquet2, paste0("ducklake_local/", bucket_name, "/data/variants/variants_vep.parquet"))
processx::run(mc_bin, mc_cmd_args, echo = FALSE)
#> $status
#> [1] 0
#>
#> $stdout
#> [1] "`/tmp/RtmpUoEUOx/file5131f2d13d660.parquet` -> `ducklake_local/readme-demo-1768854112/data/variants/variants_vep.parquet`\n┌────────────┬─────────────┬──────────┬────────────┐\n│ Total │ Transferred │ Duration │ Speed │\n│ 122.33 KiB │ 122.33 KiB │ 00m00s │ 8.80 MiB/s │\n└────────────┴─────────────┴──────────┴────────────┘\n"
#>
#> $stderr
#> [1] ""
#>
#> $timeout
#> [1] FALSE
DBI::dbGetQuery(con, sprintf("SELECT COUNT(*) as n FROM read_parquet('%s')", s3_parquet2))
#> n
#> 1 802
ducklake_register_parquet(
con,
table = "lake.variants",
parquet_files = s3_parquet2,
create_table = FALSE,
allow_evolution = TRUE
)
variants_count_after <- DBI::dbGetQuery(con, "SELECT COUNT(*) as n FROM variants")$n[1]
variants_count_after
#> [1] 369121List lake content
DBI::dbExecute(con, "USE lake")
#> [1] 0
DBI::dbGetQuery(con, "
SELECT
COUNT(*) as total_variants,
COUNT(DISTINCT CHROM) as n_chromosomes,
MIN(POS) as min_position,
MAX(POS) as max_position
FROM variants
")
#> total_variants n_chromosomes min_position max_position
#> 1 369121 25 152 249211717
DBI::dbGetQuery(con, "DESCRIBE variants") |>
head(20)
#> column_name column_type null key default extra
#> 1 CHROM VARCHAR YES <NA> <NA> <NA>
#> 2 POS BIGINT YES <NA> <NA> <NA>
#> 3 ID VARCHAR YES <NA> <NA> <NA>
#> 4 REF VARCHAR YES <NA> <NA> <NA>
#> 5 ALT VARCHAR[] YES <NA> <NA> <NA>
#> 6 QUAL DOUBLE YES <NA> <NA> <NA>
#> 7 FILTER VARCHAR[] YES <NA> <NA> <NA>
#> 8 INFO_END INTEGER YES <NA> <NA> <NA>
#> 9 SAMPLE_ID VARCHAR YES <NA> <NA> <NA>
#> 10 FORMAT_GT VARCHAR YES <NA> <NA> <NA>
#> 11 FORMAT_GQ INTEGER YES <NA> <NA> <NA>
#> 12 FORMAT_DP INTEGER YES <NA> <NA> <NA>
#> 13 FORMAT_MIN_DP INTEGER YES <NA> <NA> <NA>
#> 14 FORMAT_AD INTEGER[] YES <NA> <NA> <NA>
#> 15 FORMAT_VAF FLOAT[] YES <NA> <NA> <NA>
#> 16 FORMAT_PL INTEGER[] YES <NA> <NA> <NA>
#> 17 FORMAT_MED_DP INTEGER YES <NA> <NA> <NA>
#> 18 VEP_Allele VARCHAR[] YES <NA> <NA> <NA>
#> 19 VEP_Consequence VARCHAR[] YES <NA> <NA> <NA>
#> 20 VEP_IMPACT VARCHAR[] YES <NA> <NA> <NA>
ducklake_list_files(con, "lake", "variants")
#> data_file
#> 1 s3://readme-demo-1768854112/data/main/variants/ducklake-019bd7eb-cae5-76fb-8753-ac68273e8a4c.parquet
#> 2 s3://readme-demo-1768854112/data/variants/variants_vep.parquet
#> data_file_size_bytes data_file_footer_size data_file_encryption_key
#> 1 5751964 6113 NULL
#> 2 125269 16362 NULL
#> delete_file delete_file_size_bytes delete_file_footer_size
#> 1 <NA> NA NA
#> 2 <NA> NA NA
#> delete_file_encryption_key
#> 1 NULL
#> 2 NULLSnapshots and time travel
ducklake_snapshots(con, "lake") |> head()
#> snapshot_id snapshot_time schema_version
#> 1 0 2026-01-19 20:21:52 0
#> 2 1 2026-01-19 20:21:52 1
#> 3 2 2026-01-19 20:21:53 2
#> 4 3 2026-01-19 20:21:53 3
#> 5 4 2026-01-19 20:21:53 4
#> 6 5 2026-01-19 20:21:53 5
#> changes author commit_message
#> 1 schemas_created, main <NA> <NA>
#> 2 tables_created, tables_inserted_into, main.variants, 1 <NA> <NA>
#> 3 tables_altered, 1 <NA> <NA>
#> 4 tables_altered, 1 <NA> <NA>
#> 5 tables_altered, 1 <NA> <NA>
#> 6 tables_altered, 1 <NA> <NA>
#> commit_extra_info
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 <NA>
#> 6 <NA>
ducklake_snapshots(con, "lake") |> tail()
#> snapshot_id snapshot_time schema_version changes
#> 84 83 2026-01-19 20:21:54 83 tables_altered, 1
#> 85 84 2026-01-19 20:21:54 84 tables_altered, 1
#> 86 85 2026-01-19 20:21:54 85 tables_altered, 1
#> 87 86 2026-01-19 20:21:54 86 tables_altered, 1
#> 88 87 2026-01-19 20:21:54 87 tables_altered, 1
#> 89 88 2026-01-19 20:21:54 87 tables_inserted_into, 1
#> author commit_message commit_extra_info
#> 84 <NA> <NA> <NA>
#> 85 <NA> <NA> <NA>
#> 86 <NA> <NA> <NA>
#> 87 <NA> <NA> <NA>
#> 88 <NA> <NA> <NA>
#> 89 <NA> <NA> <NA>
ducklake_current_snapshot(con, "lake")
#> [1] 88Configuration
ducklake_options(con, "lake")
#> option_name description
#> 1 created_by Tool used to write the DuckLake
#> 2 data_path Path to data files
#> 3 encrypted Whether or not to encrypt Parquet files written to the data path
#> 4 version DuckLake format version
#> value scope scope_entry
#> 1 DuckDB d1dc88f950 GLOBAL <NA>
#> 2 s3://readme-demo-1768854112/data/ GLOBAL <NA>
#> 3 false GLOBAL <NA>
#> 4 0.3 GLOBAL <NA>
ducklake_set_option(con, "lake", "parquet_compression", "zstd")
ducklake_options(con, "lake")
#> option_name
#> 1 created_by
#> 2 data_path
#> 3 encrypted
#> 4 parquet_compression
#> 5 version
#> description
#> 1 Tool used to write the DuckLake
#> 2 Path to data files
#> 3 Whether or not to encrypt Parquet files written to the data path
#> 4 Compression algorithm for Parquet files (uncompressed, snappy, gzip, zstd, brotli, lz4, lz4_raw)
#> 5 DuckLake format version
#> value scope scope_entry
#> 1 DuckDB d1dc88f950 GLOBAL <NA>
#> 2 s3://readme-demo-1768854112/data/ GLOBAL <NA>
#> 3 false GLOBAL <NA>
#> 4 zstd GLOBAL <NA>
#> 5 0.3 GLOBAL <NA>
DBI::dbDisconnect(con, shutdown = TRUE)
tools::pskill(pid)Supported Metadata Databases
DuckLake supports multiple catalog backends
-
DuckDB:ducklake:path/to/catalog.ducklake -
SQLite:ducklake:sqlite:path/to/catalog.db
-
PostgreSQL:ducklake:postgresql://user:pass@host:5432/db -
MySQL:ducklake:mysql://user:pass@host:3306/db
Direct Connection
# DuckDB backend
ducklake_connect_catalog(
con,
backend = "duckdb",
connection_string = "catalog.ducklake",
data_path = "s3://bucket/data/",
alias = "lake"
)
# PostgreSQL backend
ducklake_connect_catalog(
con,
backend = "postgresql",
connection_string = "user:pass@host:5432/db",
data_path = "s3://bucket/data/",
alias = "lake"
)Secret-based Connection:**
# Create catalog secret
ducklake_create_catalog_secret(
con,
name = "pg_catalog",
backend = "postgresql",
connection_string = "user:pass@host:5432/db",
data_path = "s3://bucket/data/"
)
# Connect using secret
ducklake_connect_catalog(
con,
secret_name = "pg_catalog",
alias = "lake"
)Command-Line utilities
CLI tools are provided for VCF to Parquet conversion and querying, including threaded chunking based on contigs using either the bcf_reader extension or via Arrow IPC
# Get paths using system.file
SCRIPT=$(Rscript -e "cat(system.file('scripts', 'vcf2parquet_duckdb.R', package='RBCFTools'))")
BCF=$(Rscript -e "cat(system.file('extdata', 'test_deep_variant.vcf.gz', package='RBCFTools'))")
OUT_PQ=$(mktemp --suffix=.parquet)
Log=$(mktemp --suffix=.log)
# Convert BCF to Parquet
time $SCRIPT convert --quiet --tidy -i $BCF -o $OUT_PQ -t 4 > $Log 2>&1
cat $Log
# Query with DuckDB SQL
$SCRIPT query -i $OUT_PQ -q "SELECT * FROM parquet_scan('$OUT_PQ') LIMIT 5"
# Describe table structure
$SCRIPT query -i $OUT_PQ -q "DESCRIBE SELECT * FROM parquet_scan('$OUT_PQ')"
# Show schema
$SCRIPT schema --quiet -i $BCF
# File info
$SCRIPT info -i $OUT_PQ
rm -f $OUT_PQ
#>
#> real 0m1.751s
#> user 0m3.903s
#> sys 0m2.164s
#> Building bcf_reader extension...
#> Build directory: /tmp/RtmpjfS19e
#> Building bcf_reader extension...
#> Build directory: /tmp/RtmpjfS19e
#> Using htslib from: /usr/local/lib/R/site-library/RBCFTools/htslib/lib
#> Running: make with explicit htslib paths
#> make[1]: Entering directory '/tmp/RtmpjfS19e'
#> rm -rf build
#> make[1]: Leaving directory '/tmp/RtmpjfS19e'
#> make[1]: Entering directory '/tmp/RtmpjfS19e'
#> mkdir -p build
#> gcc -O2 -Wall -Wextra -Wno-unused-parameter -fPIC -I/usr/local/lib/R/site-library/RBCFTools/htslib/include -I. -c bcf_reader.c -o build/bcf_reader.o
#> gcc -O2 -Wall -Wextra -Wno-unused-parameter -fPIC -I/usr/local/lib/R/site-library/RBCFTools/htslib/include -I. -c vep_parser.c -o build/vep_parser.o
#> gcc -shared -fPIC -o build/libbcf_reader.so build/bcf_reader.o build/vep_parser.o -L/usr/local/lib/R/site-library/RBCFTools/htslib/lib -Wl,-rpath,/usr/local/lib/R/site-library/RBCFTools/htslib/lib -lhts
#> Creating DuckDB extension with metadata...
#> Created: build/bcf_reader.duckdb_extension
#> Platform: linux_amd64
#> DuckDB Version: v1.2.0
#> Extension Version: 1.0.0
#> make[1]: Leaving directory '/tmp/RtmpjfS19e'
#> Extension built: /tmp/RtmpjfS19e/build/bcf_reader.duckdb_extension
#> ✓ Extension ready: /tmp/RtmpjfS19e/build/bcf_reader.duckdb_extension
#>
#> Converting VCF to Parquet (DuckDB mode)...
#> Input: /usr/local/lib/R/site-library/RBCFTools/extdata/test_deep_variant.vcf.gz
#> Output: /tmp/tmp.zTX6Iw0YdB.parquet
#> Compression: zstd
#> Row group size: 100000
#> Threads: 4
#> Format: tidy (one row per variant-sample)
#> Processing 25 contigs (out of 86 in header) using 4 threads (DuckDB mode)
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0004.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0003.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0002.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0001.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0008.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0007.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0006.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0005.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0012.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0010.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0011.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0009.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0016.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0014.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0013.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0015.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0018.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0020.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0017.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0024.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0019.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0022.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0021.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0023.parquet
#> Wrote: /tmp/RtmpjfS19e/vcf_duckdb_parallel_515d278e301d9/contig_0025.parquet
#> Merging temporary Parquet files... to /tmp/tmp.zTX6Iw0YdB.parquet
#> Merged 25 parquet files -> tmp.zTX6Iw0YdB.parquet (368319 rows)
#>
#> ✓ Conversion complete!
#> Time: 1.06 seconds
#> Output size: 3.77 MB
#> Running query on Parquet file...
#> CHROM POS ID REF ALT QUAL FILTER INFO_END SAMPLE_ID FORMAT_GT
#> 1 1 536895 <NA> T C 0.1 RefCall NA test_deep_variant ./.
#> 2 1 536924 <NA> G A 0.1 RefCall NA test_deep_variant ./.
#> 3 1 536948 <NA> A G 0.0 RefCall NA test_deep_variant 0/0
#> 4 1 536986 <NA> G T 0.0 RefCall NA test_deep_variant 0/0
#> 5 1 544490 <NA> A G 0.8 RefCall NA test_deep_variant ./.
#> FORMAT_GQ FORMAT_DP FORMAT_MIN_DP FORMAT_AD FORMAT_VAF FORMAT_PL
#> 1 18 106 NA 46, 52 0.490566 0, 19, 25
#> 2 18 118 NA 55, 60 0.508475 0, 21, 20
#> 3 31 104 NA 91, 13 0.125 0, 31, 38
#> 4 23 115 NA 55, 60 0.521739 0, 23, 29
#> 5 8 10 NA 6, 4 0.4 0, 7, 15
#> FORMAT_MED_DP
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> Running query on Parquet file...
#> column_name column_type null key default extra
#> 1 CHROM VARCHAR YES <NA> <NA> <NA>
#> 2 POS BIGINT YES <NA> <NA> <NA>
#> 3 ID VARCHAR YES <NA> <NA> <NA>
#> 4 REF VARCHAR YES <NA> <NA> <NA>
#> 5 ALT VARCHAR[] YES <NA> <NA> <NA>
#> 6 QUAL DOUBLE YES <NA> <NA> <NA>
#> 7 FILTER VARCHAR[] YES <NA> <NA> <NA>
#> 8 INFO_END INTEGER YES <NA> <NA> <NA>
#> 9 SAMPLE_ID VARCHAR YES <NA> <NA> <NA>
#> 10 FORMAT_GT VARCHAR YES <NA> <NA> <NA>
#> 11 FORMAT_GQ INTEGER YES <NA> <NA> <NA>
#> 12 FORMAT_DP INTEGER YES <NA> <NA> <NA>
#> 13 FORMAT_MIN_DP INTEGER YES <NA> <NA> <NA>
#> 14 FORMAT_AD INTEGER[] YES <NA> <NA> <NA>
#> 15 FORMAT_VAF FLOAT[] YES <NA> <NA> <NA>
#> 16 FORMAT_PL INTEGER[] YES <NA> <NA> <NA>
#> 17 FORMAT_MED_DP INTEGER YES <NA> <NA> <NA>
#> Building bcf_reader extension...
#> Build directory: /tmp/RtmpjWo9GE
#> Building bcf_reader extension...
#> Build directory: /tmp/RtmpjWo9GE
#> Using htslib from: /usr/local/lib/R/site-library/RBCFTools/htslib/lib
#> Running: make with explicit htslib paths
#> make[1]: Entering directory '/tmp/RtmpjWo9GE'
#> rm -rf build
#> make[1]: Leaving directory '/tmp/RtmpjWo9GE'
#> make[1]: Entering directory '/tmp/RtmpjWo9GE'
#> mkdir -p build
#> gcc -O2 -Wall -Wextra -Wno-unused-parameter -fPIC -I/usr/local/lib/R/site-library/RBCFTools/htslib/include -I. -c bcf_reader.c -o build/bcf_reader.o
#> gcc -O2 -Wall -Wextra -Wno-unused-parameter -fPIC -I/usr/local/lib/R/site-library/RBCFTools/htslib/include -I. -c vep_parser.c -o build/vep_parser.o
#> gcc -shared -fPIC -o build/libbcf_reader.so build/bcf_reader.o build/vep_parser.o -L/usr/local/lib/R/site-library/RBCFTools/htslib/lib -Wl,-rpath,/usr/local/lib/R/site-library/RBCFTools/htslib/lib -lhts
#> Creating DuckDB extension with metadata...
#> Created: build/bcf_reader.duckdb_extension
#> Platform: linux_amd64
#> DuckDB Version: v1.2.0
#> Extension Version: 1.0.0
#> make[1]: Leaving directory '/tmp/RtmpjWo9GE'
#> Extension built: /tmp/RtmpjWo9GE/build/bcf_reader.duckdb_extension
#> ✓ Extension ready: /tmp/RtmpjWo9GE/build/bcf_reader.duckdb_extension
#>
#> VCF DuckDB Schema for: /usr/local/lib/R/site-library/RBCFTools/extdata/test_deep_variant.vcf.gz
#>
#> column_name column_type
#> CHROM character
#> POS numeric
#> ID character
#> REF character
#> ALT list
#> QUAL numeric
#> FILTER list
#> INFO_END integer
#> FORMAT_GT_test_deep_variant character
#> FORMAT_GQ_test_deep_variant integer
#> FORMAT_DP_test_deep_variant integer
#> FORMAT_MIN_DP_test_deep_variant integer
#> FORMAT_AD_test_deep_variant list
#> FORMAT_VAF_test_deep_variant list
#> FORMAT_PL_test_deep_variant list
#> FORMAT_MED_DP_test_deep_variant integer
#> Parquet File Information: /tmp/tmp.zTX6Iw0YdB.parquet
#>
#> File size: 3.77 MB
#> Total rows: 368319
#> Number of columns: 28
#>
#> Schema (top-level columns):
#> name type
#> CHROM BYTE_ARRAY
#> POS INT64
#> ID BYTE_ARRAY
#> REF BYTE_ARRAY
#> ALT <NA>
#> QUAL DOUBLE
#> FILTER <NA>
#> INFO_END INT32
#> SAMPLE_ID BYTE_ARRAY
#> FORMAT_GT BYTE_ARRAY
#> FORMAT_GQ INT32
#> FORMAT_DP INT32
#> FORMAT_MIN_DP INT32
#> FORMAT_AD <NA>
#> FORMAT_VAF <NA>
#> FORMAT_PL <NA>
#> FORMAT_MED_DP INT32