Opens a VCF or BCF file and creates an Arrow array stream that produces record batches. This enables efficient, streaming access to variant data in Arrow format.
Usage
vcf_open_arrow(
filename,
batch_size = 10000L,
region = NULL,
samples = NULL,
include_info = TRUE,
include_format = TRUE,
index = NULL,
threads = 0L,
parse_vep = FALSE,
vep_tag = NULL,
vep_columns = NULL,
vep_transcript = c("first", "all")
)Arguments
- filename
Path to VCF or BCF file
- batch_size
Number of records per batch (default: 10000)
- region
Optional region string for filtering (e.g., "chr1:1000-2000")
- samples
Optional sample filter (comma-separated names or "-" prefixed to exclude)
- include_info
Include INFO fields in output (default: TRUE)
- include_format
Include FORMAT/sample data in output (default: TRUE)
- index
Optional index file path. If NULL (default), uses auto-detection: VCF files try .tbi first, then .csi; BCF files use .csi only. Useful for non-standard index locations or presigned URLs with different paths. Alternatively, use htslib ##idx## syntax in filename (e.g., "file.vcf.gz##idx##custom.tbi"). Note: Index is only required for region queries; whole-file streaming needs no index.
- threads
Number of decompression threads (default: 0 = auto)
- parse_vep
Enable VEP/BCSQ/ANN annotation parsing (default: FALSE). When TRUE, annotation fields are parsed and added as typed columns.
- vep_tag
Annotation tag to parse ("CSQ", "BCSQ", "ANN") or NULL for auto-detect.
- vep_columns
Character vector of VEP fields to extract, or NULL for all fields.
- vep_transcript
Which transcript to extract: "first" (default) or "all". "first" returns scalar columns (one value per variant). "all" returns list columns (all transcripts per variant).
Examples
if (FALSE) { # \dontrun{
# Basic usage
stream <- vcf_open_arrow("variants.vcf.gz")
# Read batches
while (!is.null(batch <- stream$get_next())) {
# Process batch...
print(nanoarrow::convert_array(batch))
}
# With region filter
stream <- vcf_open_arrow("variants.vcf.gz", region = "chr1:1-1000000")
# With custom index file (useful for presigned URLs or non-standard locations)
stream <- vcf_open_arrow("variants.vcf.gz", index = "custom_path.tbi", region = "chr1")
# Convert to data frame
df <- vcf_to_arrow("variants.vcf.gz", as = "data.frame")
# Write to parquet (uses DuckDB, no arrow package needed)
vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet")
} # }