Skip to contents

Opens a VCF or BCF file and creates an Arrow array stream that produces record batches. This enables efficient, streaming access to variant data in Arrow format.

Usage

vcf_open_arrow(
  filename,
  batch_size = 10000L,
  region = NULL,
  samples = NULL,
  include_info = TRUE,
  include_format = TRUE,
  index = NULL,
  threads = 0L,
  parse_vep = FALSE,
  vep_tag = NULL,
  vep_columns = NULL,
  vep_transcript = c("first", "all")
)

Arguments

filename

Path to VCF or BCF file

batch_size

Number of records per batch (default: 10000)

region

Optional region string for filtering (e.g., "chr1:1000-2000")

samples

Optional sample filter (comma-separated names or "-" prefixed to exclude)

include_info

Include INFO fields in output (default: TRUE)

include_format

Include FORMAT/sample data in output (default: TRUE)

index

Optional index file path. If NULL (default), uses auto-detection: VCF files try .tbi first, then .csi; BCF files use .csi only. Useful for non-standard index locations or presigned URLs with different paths. Alternatively, use htslib ##idx## syntax in filename (e.g., "file.vcf.gz##idx##custom.tbi"). Note: Index is only required for region queries; whole-file streaming needs no index.

threads

Number of decompression threads (default: 0 = auto)

parse_vep

Enable VEP/BCSQ/ANN annotation parsing (default: FALSE). When TRUE, annotation fields are parsed and added as typed columns.

vep_tag

Annotation tag to parse ("CSQ", "BCSQ", "ANN") or NULL for auto-detect.

vep_columns

Character vector of VEP fields to extract, or NULL for all fields.

vep_transcript

Which transcript to extract: "first" (default) or "all". "first" returns scalar columns (one value per variant). "all" returns list columns (all transcripts per variant).

Value

A nanoarrow_array_stream object

Examples

if (FALSE) { # \dontrun{
# Basic usage
stream <- vcf_open_arrow("variants.vcf.gz")

# Read batches
while (!is.null(batch <- stream$get_next())) {
  # Process batch...
  print(nanoarrow::convert_array(batch))
}

# With region filter
stream <- vcf_open_arrow("variants.vcf.gz", region = "chr1:1-1000000")

# With custom index file (useful for presigned URLs or non-standard locations)
stream <- vcf_open_arrow("variants.vcf.gz", index = "custom_path.tbi", region = "chr1")

# Convert to data frame
df <- vcf_to_arrow("variants.vcf.gz", as = "data.frame")

# Write to parquet (uses DuckDB, no arrow package needed)
vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet")
} # }