Skip to contents

CRAN StatusR-universe version

Rduckhts provides an R interface to a DuckDB HTS (High Throughput Sequencing) file reader extension. This enables reading common bioinformatics file formats such as VCF/BCF, SAM/BAM/CRAM, FASTA, FASTQ, GFF, GTF, and tabix-indexed files directly from R using SQL queries via duckhts.

How it works

Following RBCFTools, tables are created and returned instead of data frames. VCF/BCF, SAM/BAM/CRAM, FASTA, FASTQ, GFF, GTF, and tabix formats can be queried. We support region queries for indexed files, and we target Linux, macOS, and RTools. htslib 1.23 is bundled so build dependencies stay minimal. The extensnion is built by adapting the generic extension infracstructure by using only makefiles unlike the submitted communtity extension duckhts.

Installation

The package can be installed from r-universe

# Install 'Rduckhts' in R:
install.packages('Rduckhts', repos = c('https://rgenomicsetl.r-universe.dev', 'https://cloud.r-project.org'))
# When on CRAN
install.packages("Rduckhts")

System Requirements

Installation requires htslib dependencies such ad zlib and libbz2, and optionally for full functionally liblzma, libcurl, and openssl. The package requires GNU make. On Windows’s Rtools, htslib plugins are not enabled.

Quick Start

The extension is loaded with rduckhts_load(con, extension_path = NULL). The wrappers break down into:

Start with one reader, then materialize tables and compose the richer helpers around them.

library(DBI)
library(duckdb)
library(Rduckhts)


fasta_path <- system.file("extdata", "ce.fa", package = "Rduckhts")
fastq_r1 <- system.file("extdata", "r1.fq", package = "Rduckhts")
fastq_r2 <- system.file("extdata", "r2.fq", package = "Rduckhts")
con <- dbConnect(duckdb::duckdb(config = list(allow_unsigned_extensions = "true")))
rduckhts_load(con)
#> [1] TRUE

rduckhts_fasta(con, "sequences", fasta_path, overwrite = TRUE)
rduckhts_fastq(con, "reads", fastq_r1, mate_path = fastq_r2, overwrite = TRUE)

dbGetQuery(con, "SELECT COUNT(*) AS n FROM sequences")
#>   n
#> 1 7
dbGetQuery(con, "SELECT COUNT(*) AS n FROM reads")
#>    n
#> 1 10

Function Catalog

Use rduckhts_functions() inside R to inspect the generated extension catalog.

Extension Function Catalog

This section is generated from functions.yaml.

Readers

Function Kind Returns R helper Description
read_bcf table table rduckhts_bcf Read VCF and BCF variant data with typed INFO, FORMAT, typed CSQ/ANN/BCSQ subfields, optional tidy sample output, and optional bcftools-style CSQ type overrides.
read_bam table table rduckhts_bam Read SAM, BAM, and CRAM alignments with optional typed SAMtags and auxiliary tag maps. Use sequence_encoding := ‘nt16’ to return SEQ as UTINYINT[] and quality_representation := ‘phred’ to return QUAL as UTINYINT[] instead of VARCHAR.
read_fasta table table rduckhts_fasta Read FASTA records or indexed FASTA regions as sequence rows. Use sequence_encoding := ‘nt16’ to return SEQUENCE as UTINYINT[] (htslib nt16 4-bit codes) instead of VARCHAR.
read_bed table table rduckhts_bed Read BED3-BED12 interval files with canonical typed columns and optional tabix-backed region filtering.
fasta_nuc table table rduckhts_fasta_nuc Compute bedtools nuc-style nucleotide composition for supplied BED intervals or generated fixed-width bins over a FASTA reference.
read_fastq table table rduckhts_fastq Read single-end, paired-end, or interleaved FASTQ files with optional legacy quality decoding. By default, FASTQ qualities are interpreted as modern Phred+33 input. Use sequence_encoding := ‘nt16’ to return SEQUENCE as UTINYINT[] and quality_representation := ‘phred’ to return QUALITY as UTINYINT[] instead of VARCHAR. input_quality_encoding accepts ‘phred33’, ‘auto’, ‘phred64’, or ‘solexa64’.
read_gff table table rduckhts_gff Read GFF annotations with optional parsed attribute maps and indexed region filtering.
read_gtf table table rduckhts_gtf Read GTF annotations with optional parsed attribute maps and indexed region filtering.
read_tabix table table rduckhts_tabix Read generic tabix-indexed text data with optional header handling and type inference.
fasta_index table table rduckhts_fasta_index Build a FASTA index (.fai) and return a single row with columns success (BOOLEAN) and index_path (VARCHAR).

Metadata

Function Kind Returns R helper Description
detect_quality_encoding table table rduckhts_detect_quality_encoding Inspect a FASTQ file’s observed quality ASCII range and report compatible legacy encodings with a heuristic guessed encoding.
read_hts_header table table rduckhts_hts_header Inspect HTS headers in parsed, raw, or combined form across supported formats.
read_hts_index table table rduckhts_hts_index Inspect high-level HTS index metadata such as sequence names and mapped counts.
read_hts_index_spans table_macro table rduckhts_hts_index_spans Expand index metadata into span and chunk rows suitable for low-level index inspection.
read_hts_index_raw table_macro table rduckhts_hts_index_raw Return the raw on-disk HTS index blob together with basic identifying metadata.

Compression

Function Kind Returns R helper Description
bgzip table table rduckhts_bgzip Compress a plain file to BGZF and return the created output path and byte counts.
bgunzip table table rduckhts_bgunzip Decompress a BGZF-compressed file and return the created output path and byte counts.

Indexing

Function Kind Returns R helper Description
bam_index table table rduckhts_bam_index Build a BAM or CRAM index and report the written index path and format.
bcf_index table table rduckhts_bcf_index Build a TBI or CSI index for a VCF or BCF file and report the written index path and format.
tabix_index table table rduckhts_tabix_index Build a tabix index for a BGZF-compressed text file using a preset or explicit coordinate columns.

Variants

Function Kind Returns R helper Description
bcftools_liftover scalar STRUCT rduckhts_liftover Row-oriented liftover kernel intended to mirror bcftools +liftover semantics as closely as possible while returning one STRUCT per input row with fields: src_chrom, src_pos, src_ref, src_alt, dest_chrom, dest_pos, dest_end, dest_ref, dest_alt, mapped, reverse_complemented, swap, reject_reason, and note. Set no_left_align := true to skip post-liftover left-alignment of lifted indels (mirrors –no-left-align in bcftools +liftover).
duckdb_liftover table_macro table rduckhts_liftover DuckDB-specific wrapper over bcftools_liftover that takes either a table name or a derived-table expression plus column-name strings for chrom/pos/ref/alt and returns the lifted table. The no_left_align parameter mirrors –no-left-align in bcftools +liftover.
bcftools_score table table rduckhts_score Compute polygenic scores from one genotype BCF/VCF and one summary-statistics file with bcftools +score-compatible GT/DS/HDS/AP/GP/AS dosage semantics, sample subsetting, and region/target/FILTER-string controls.
bcftools_munge_row scalar STRUCT Normalize one summary-statistics row into GWAS-VCF-style fields (chrom/pos/ref/alt/effect metrics), resolving REF/ALT orientation against a FASTA reference and applying swap-aware sign/frequency/count transforms. The output flag alleles_swapped means REF/ALT orientation was swapped to match the FASTA reference.
duckdb_munge table_macro table rduckhts_munge DuckDB macro wrapper over bcftools_munge_row that maps source columns (via preset or explicit map) and returns normalized GWAS-VCF-style rows with lean outputs and explicit alleles_swapped semantics. Output columns: chrom, pos, id, ref, alt, alleles_swapped, filter, ns, ez, nc, es, se, lp, af, ac, ne (16 columns). For METAL meta-analysis output with SI/I2/CQ/ED columns, use duckdb_munge_metal.
duckdb_munge_metal table_macro table rduckhts_munge Extended munge macro with METAL meta-analysis output columns. Same as duckdb_munge but additionally emits: si (imputation info, from INFO input), i2 (Cochran’s I² heterogeneity, from HET_I2), cq (Cochran’s Q -log10 p, from HET_LP or -log10(HET_P)), and ed (effect direction string, from DIRE; +/- flipped on allele swap). The R wrapper rduckhts_munge() auto-dispatches to this macro when metal keys (INFO, HET_I2, HET_P, HET_LP, DIRE) are present in the resolved column map.

Sequence UDFs

Function Kind Returns R helper Description
seq_revcomp scalar VARCHAR Compute the reverse complement of a DNA sequence using A, C, G, T, and N bases.
seq_canonical scalar VARCHAR Return the lexicographically smaller of a sequence and its reverse complement.
seq_hash_2bit scalar UBIGINT Encode a short DNA sequence as a 2-bit unsigned integer hash.
seq_encode_4bit scalar UTINYINT[] Encode an IUPAC DNA sequence as a list of 4-bit base codes, preserving ambiguity symbols including N.
seq_decode_4bit scalar VARCHAR Decode a list of 4-bit IUPAC DNA base codes back into a sequence string.
seq_gc_content scalar DOUBLE Compute GC fraction for a DNA sequence as a value between 0 and 1.
seq_kmers table table Expand a sequence into positional k-mers with optional canonicalization.

SAM Flag UDFs

Function Kind Returns R helper Description
sam_flag_bits scalar STRUCT Decode a SAM flag into a struct of boolean bit fields using explicit SAM-oriented names such as is_paired, is_proper_pair, is_next_segment_unmapped, and is_supplementary.
sam_flag_has scalar BOOLEAN Test whether any bits from the provided SAM flag mask are set in a flag value.
is_forward_aligned scalar BOOLEAN Test whether a mapped segment is aligned to the forward strand. Returns NULL for unmapped segments because SAM flag 0x10 does not define genomic strand when 0x4 is set.
is_paired scalar BOOLEAN Test whether the SAM flag indicates that the template has multiple segments in sequencing (0x1).
is_proper_pair scalar BOOLEAN Test whether the SAM flag indicates that each segment is properly aligned according to the aligner (0x2).
is_unmapped scalar BOOLEAN Test whether the read itself is unmapped according to the SAM flag.
is_next_segment_unmapped scalar BOOLEAN Test whether the next segment in the template is flagged as unmapped (0x8).
is_reverse_complemented scalar BOOLEAN Test whether SEQ is stored reverse complemented (0x10); for mapped reads this corresponds to reverse-strand alignment.
is_next_segment_reverse_complemented scalar BOOLEAN Test whether SEQ of the next segment in the template is stored reverse complemented (0x20).
is_first_segment scalar BOOLEAN Test whether the read is marked as the first segment in the template.
is_last_segment scalar BOOLEAN Test whether the read is marked as the last segment in the template.
is_secondary scalar BOOLEAN Test whether the alignment is marked as secondary.
is_qc_fail scalar BOOLEAN Test whether the read failed vendor or pipeline quality checks.
is_duplicate scalar BOOLEAN Test whether the alignment is flagged as a duplicate.
is_supplementary scalar BOOLEAN Test whether the alignment is marked as supplementary.

CIGAR Utils

Function Kind Returns R helper Description
cigar_has_soft_clip scalar BOOLEAN Test whether a CIGAR string contains any soft-clipped segment (S).
cigar_has_hard_clip scalar BOOLEAN Test whether a CIGAR string contains any hard-clipped segment (H).
cigar_left_soft_clip scalar BIGINT Return the left-end soft-clipped length from a CIGAR string, or zero if the alignment does not start with S.
cigar_right_soft_clip scalar BIGINT Return the right-end soft-clipped length from a CIGAR string, or zero if the alignment does not end with S.
cigar_query_length scalar BIGINT Return the query-consuming length from a CIGAR string, counting M, I, S, =, and X.
cigar_aligned_query_length scalar BIGINT Return the aligned query length from a CIGAR string, counting M, =, and X but excluding clips and insertions.
cigar_reference_length scalar BIGINT Return the reference-consuming length from a CIGAR string, counting M, D, N, =, and X.
cigar_has_op scalar BOOLEAN Test whether a CIGAR string contains at least one instance of the requested operator.

Common Workflows

Region-aware variant and alignment queries

bcf_path <- system.file("extdata", "vcf_file.bcf", package = "Rduckhts")
bcf_index_path <- system.file("extdata", "vcf_file.bcf.csi", package = "Rduckhts")
bam_path <- system.file("extdata", "range.bam", package = "Rduckhts")
bam_index_path <- system.file("extdata", "range.bam.bai", package = "Rduckhts")

rduckhts_bcf(
  con, "variants_idx", bcf_path,
  region = "1:3000150-3000151",
  index_path = bcf_index_path,
  overwrite = TRUE
)
dbGetQuery(con, "SELECT CHROM, POS, REF, ALT FROM variants_idx")
#>   CHROM     POS REF ALT
#> 1     1 3000150   C   T
#> 2     1 3000151   C   T

rduckhts_bam(
  con, "bam_idx_reads", bam_path,
  region = "CHROMOSOME_I:1-1000",
  index_path = bam_index_path,
  overwrite = TRUE
)
dbGetQuery(con, "SELECT QNAME, FLAG, POS, MAPQ FROM bam_idx_reads")
#>                           QNAME FLAG POS MAPQ
#> 1 HS18_09653:4:1315:19857:61712  145 914   23
#> 2 HS18_09653:4:1308:11522:27107  161 934    0

Interval + reference helpers

bed_path <- system.file("extdata", "targets.bed", package = "Rduckhts")
fai_path <- tempfile("duckhts_readme_", fileext = ".fai")
rduckhts_fasta_index(con, fasta_path, index_path = fai_path)
#>   success                                        index_path
#> 1    TRUE /tmp/Rtmp6jhy2i/duckhts_readme_3bf93a50fe717b.fai

rduckhts_bed(con, "targets", bed_path, overwrite = TRUE)
dbGetQuery(con, "SELECT chrom, start, \"end\", name, block_count FROM targets")
#>            chrom start end    name block_count
#> 1   CHROMOSOME_I     0  10 target1           2
#> 2   CHROMOSOME_I    10  20 target2           1
#> 3  CHROMOSOME_II     0   8 target3          NA
#> 4 CHROMOSOME_III     0   6 target4           1

rduckhts_fasta_nuc(con, fasta_path, bed_path = bed_path, index_path = fai_path)
#>            chrom start end pct_at pct_gc num_a num_c num_g num_t num_n
#> 1   CHROMOSOME_I     0  10  0.400  0.600     2     4     2     2     0
#> 2   CHROMOSOME_I    10  20  0.500  0.500     4     3     2     1     0
#> 3  CHROMOSOME_II     0   8  0.375  0.625     2     4     1     1     0
#> 4 CHROMOSOME_III     0   6  0.500  0.500     2     2     1     1     0
#>   num_other seq_len
#> 1         0      10
#> 2         0      10
#> 3         0       8
#> 4         0       6
rduckhts_fasta_nuc(con, fasta_path, bin_width = 10, region = "CHROMOSOME_I:1-20", index_path = fai_path)
#>          chrom start end pct_at pct_gc num_a num_c num_g num_t num_n num_other
#> 1 CHROMOSOME_I     0  10    0.4    0.6     2     4     2     2     0         0
#> 2 CHROMOSOME_I    10  20    0.5    0.5     4     3     2     1     0         0
#>   seq_len
#> 1      10
#> 2      10
unlink(fai_path)

Liftover score-style rows

lift_src <- tempfile("duckhts_liftover_src_", fileext = ".fa")
lift_dst <- tempfile("duckhts_liftover_dst_", fileext = ".fa")
lift_chain <- tempfile("duckhts_liftover_", fileext = ".chain")

writeLines(c(
  ">chrF",
  "ACGTACGTAA",
  ">chrR",
  "AACCGGTTAA"
), lift_src)
writeLines(c(
  ">chrLiftF",
  "ACGTACGTAA",
  ">chrLiftR",
  "TTAACCGGTT"
), lift_dst)
writeLines(c(
  "chain 100 chrF 10 + 0 10 chrLiftF 10 + 0 10 1",
  "10",
  "",
  "chain 100 chrR 10 + 0 10 chrLiftR 10 - 0 10 2",
  "10"
), lift_chain)

rduckhts_fasta_index(con, lift_src, index_path = paste0(lift_src, ".fai"))
#>   success                                                 index_path
#> 1    TRUE /tmp/Rtmp6jhy2i/duckhts_liftover_src_3bf93a13f914ae.fa.fai
rduckhts_fasta_index(con, lift_dst, index_path = paste0(lift_dst, ".fai"))
#>   success                                                 index_path
#> 1    TRUE /tmp/Rtmp6jhy2i/duckhts_liftover_dst_3bf93a70dc46e7.fa.fai

lifted <- rduckhts_liftover(
  con,
  query = paste(
    "SELECT * FROM (VALUES",
    "('chrF', 2, 'C', 'T'),",
    "('chrR', 2, 'A', 'G'),",
    "('chrF', 11, 'A', 'T')",
    ") AS t(chrom, pos, ref, alt)"
  ),
  chain_path = lift_chain,
  dst_fasta_ref = lift_dst,
  ref_col = "ref",
  alt_col = "alt",
  src_fasta_ref = lift_src
)

lifted[, c(
  "src_chrom", "src_pos", "dest_chrom", "dest_pos",
  "dest_ref", "dest_alt", "mapped", "reverse_complemented",
  "reject_reason", "note"
)]
#>   src_chrom src_pos dest_chrom dest_pos dest_ref dest_alt mapped
#> 1      chrF       2   chrLiftF        2        C        T   TRUE
#> 2      chrR       2   chrLiftR        9        T        C   TRUE
#> 3      chrF      11   chrLiftF       10        A    AA,AT   TRUE
#>   reverse_complemented reject_reason   note
#> 1                FALSE          <NA>   <NA>
#> 2                 TRUE          <NA>   <NA>
#> 3                FALSE          <NA> Padded

unlink(c(lift_src, paste0(lift_src, ".fai"), lift_dst, paste0(lift_dst, ".fai"), lift_chain))

Munge score-style rows

munge_fasta <- tempfile("duckhts_munge_", fileext = ".fa")
writeLines(c(
  ">chrF",
  "ACGTACGTAA"
), munge_fasta)
rduckhts_fasta_index(con, munge_fasta, index_path = paste0(munge_fasta, ".fai"))
#>   success                                          index_path
#> 1    TRUE /tmp/Rtmp6jhy2i/duckhts_munge_3bf93a1d442d89.fa.fai

munge_out <- rduckhts_munge(
  con,
  query = paste(
    "SELECT * FROM (VALUES",
    "('rs1', 2, 'chrF', 'A', 'C', 0.01, 1.10, 0.20, 0.98, 0.10, 0.01, 1000),",
    "('rs2', 2, 'chrF', 'C', 'A', 0.02, 0.90, -0.20, 0.98, 0.90, 0.01, 1000)",
    ") AS t(SNP, BP, CHR, A1, A2, P, OR_VALUE, BETA, INFO, FRQ, SE, N)"
  ),
  fasta_ref = munge_fasta,
  column_map = c(
    SNP = "SNP", BP = "BP", CHR = "CHR", A1 = "A1", A2 = "A2",
    P = "P", OR = "OR_VALUE", BETA = "BETA", INFO = "INFO", FRQ = "FRQ", SE = "SE", N = "N"
  )
)

munge_out[, c("chrom", "pos", "id", "ref", "alt", "alleles_swapped", "filter", "af", "es", "ns")]
#>   chrom pos  id ref alt alleles_swapped filter  af  es   ns
#> 1  chrF   2 rs2   C   A            TRUE   <NA> 0.1 0.2 1000
#> 2  chrF   2 rs1   C   A           FALSE   <NA> 0.1 0.2 1000

unlink(c(munge_fasta, paste0(munge_fasta, ".fai")))

Polygenic risk scoring

rduckhts_score() computes per-sample polygenic risk scores (PRS) from a genotype VCF/BCF and a GWAS summary statistics file, wrapping bcftools_score.

vcf_path    <- system.file("extdata", "score_input.vcf",        package = "Rduckhts")
dosage_path <- system.file("extdata", "score_dosage.vcf",       package = "Rduckhts")
sumf_path   <- system.file("extdata", "score_summary.tsv",      package = "Rduckhts")
gwas_path   <- system.file("extdata", "score_gwas_summary.vcf", package = "Rduckhts")

# Hard-call (GT) PRS with PLINK-format summary statistics
# S1: 0×0.5 + 1×(−0.2) + 2×1.0 = 1.8
# S2: 1×0.5 + 2×(−0.2) + 0×1.0 = 0.1
gt_prs <- rduckhts_score(con, vcf_path, sumf_path, use = "GT", columns = "PLINK")
gt_prs[, c("SAMPLE", "score_summary")]
#>   SAMPLE score_summary
#> 1     S1    1.80000000
#> 2     S2    0.09999999

# Dosage-based PRS (DS field) for imputed genotypes
# S1: 0.1×0.5 + 0.8×(−0.2) + 1.8×1.0 = 1.69
# S2: 1.0×0.5 + 1.9×(−0.2) + 0.2×1.0 = 0.32
ds_prs <- rduckhts_score(con, dosage_path, sumf_path, use = "DS", columns = "PLINK")
ds_prs[, c("SAMPLE", "score_summary")]
#>   SAMPLE score_summary
#> 1     S1          1.69
#> 2     S2          0.32

# GWAS-VCF multi-PRS: each FORMAT/ES sample column becomes a separate PRS track
gwas_prs <- rduckhts_score(con, vcf_path, gwas_path, use = "GT")
gwas_prs[, c("SAMPLE", "PRS_A", "PRS_B")]
#>   SAMPLE      PRS_A PRS_B
#> 1     S1 1.80000000   1.0
#> 2     S2 0.09999999   0.3

Compression + tabix round-trips

tmp_bed <- tempfile("duckhts_targets_", fileext = ".bed")
tmp_bgz <- paste0(tmp_bed, ".gz")
tmp_tbi <- paste0(tmp_bgz, ".tbi")
writeLines(c("chr1\t0\t10\ta", "chr1\t10\t20\tb"), tmp_bed)

rduckhts_bgzip(con, tmp_bed, output_path = tmp_bgz, keep = TRUE, overwrite = TRUE)
#>   success                                           output_path bytes_in
#> 1    TRUE /tmp/Rtmp6jhy2i/duckhts_targets_3bf93a55ef740e.bed.gz       25
#>   bytes_out
#> 1        84
rduckhts_tabix_index(con, tmp_bgz, preset = "bed", index_path = tmp_tbi, threads = 1)
#>   success                                                index_path
#> 1    TRUE /tmp/Rtmp6jhy2i/duckhts_targets_3bf93a55ef740e.bed.gz.tbi
#>   index_format
#> 1          TBI
rduckhts_bed(con, "targets_idx", tmp_bgz, region = "chr1:1-20", index_path = tmp_tbi, overwrite = TRUE)
dbGetQuery(con, "SELECT * FROM targets_idx")
#>   chrom start end name score strand thick_start thick_end item_rgb block_count
#> 1  chr1     0  10    a  <NA>   <NA>          NA        NA     <NA>          NA
#> 2  chr1    10  20    b  <NA>   <NA>          NA        NA     <NA>          NA
#>   block_sizes block_starts extra
#> 1        <NA>         <NA>  <NA>
#> 2        <NA>         <NA>  <NA>

unlink(c(tmp_bed, tmp_bgz, tmp_tbi))

Sequence UDFs

The extension also exposes sequence utility UDFs directly in DuckDB SQL, including 4-bit IUPAC DNA encode/decode helpers. These can be applied to SEQUENCE columns from FASTA and FASTQ scans.

dbGetQuery(
  con,
  "SELECT
     NAME,
     seq_hash_2bit(substr(SEQUENCE, 1, 12)) AS hash_2bit_prefix,
     seq_encode_4bit(substr(SEQUENCE, 1, 16)) AS codes,
     seq_decode_4bit(seq_encode_4bit(substr(SEQUENCE, 1, 16))) AS roundtrip
   FROM sequences
   LIMIT 2"
)
#>            NAME hash_2bit_prefix                                          codes
#> 1  CHROMOSOME_I          9898352 4, 2, 2, 8, 1, 1, 4, 2, 2, 8, 1, 1, 4, 2, 2, 8
#> 2 CHROMOSOME_II          6038978 2, 2, 8, 1, 1, 4, 2, 2, 8, 1, 1, 4, 2, 2, 8, 1
#>          roundtrip
#> 1 GCCTAAGCCTAAGCCT
#> 2 CCTAAGCCTAAGCCTA

dbGetQuery(
  con,
  "SELECT
     NAME,
     MATE,
     seq_encode_4bit(substr(SEQUENCE, 1, 12)) AS codes,
     seq_decode_4bit(seq_encode_4bit(substr(SEQUENCE, 1, 12))) AS roundtrip
   FROM reads
   LIMIT 2"
)
#>                              NAME MATE                              codes
#> 1 HS25_09827:2:1201:1505:59795#49    1 2, 2, 4, 8, 8, 1, 4, 1, 4, 2, 1, 8
#> 2 HS25_09827:2:1201:1505:59795#49    2 1, 1, 4, 4, 1, 1, 1, 4, 1, 1, 4, 4
#>      roundtrip
#> 1 CCGTTAGAGCAT
#> 2 AAGGAAAGAAGG

FASTA region queries

read_fasta supports indexed region queries via rduckhts_fasta(..., region = ...).

fai_path <- tempfile("duckhts_readme_", fileext = ".fai")
fai_info <- rduckhts_fasta_index(con, fasta_path, index_path = fai_path)
fai_info
#>   success                                        index_path
#> 1    TRUE /tmp/Rtmp6jhy2i/duckhts_readme_3bf93a16f0ce0c.fai

rduckhts_fasta(
  con, "fasta_region", fasta_path,
  region = "CHROMOSOME_I:1-25",
  overwrite = TRUE
)
dbGetQuery(con, "SELECT NAME, length(SEQUENCE) AS n FROM fasta_region")
#>           NAME  n
#> 1 CHROMOSOME_I 25
unlink(fai_path)

Examples

Region Queries

Region queries can use implicit sidecar indexes or an explicit index_path for custom index names/locations.

bcf_path <- system.file("extdata", "vcf_file.bcf", package = "Rduckhts")
bcf_index_path <- system.file("extdata", "vcf_file.bcf.csi", package = "Rduckhts")
rduckhts_bcf(con, "variants", bcf_path, overwrite = TRUE)
variants <- dbGetQuery(con, "SELECT * FROM variants LIMIT 5")
variants
#>   CHROM     POS    ID  REF  ALT  QUAL FILTER INFO_TEST   INFO_DP4 INFO_AC
#> 1     1 3000150  <NA>    C    T  59.2   PASS        NA       NULL       2
#> 2     1 3000151  <NA>    C    T  59.2   PASS        NA       NULL       2
#> 3     1 3062915  id3D GTTT    G  12.9    q10        NA 1, 2, 3, 4       2
#> 4     1 3062915 idSNP    G T, C  12.6   test         5 1, 2, 3, 4    1, 1
#> 5     1 3106154  <NA> CAAA    C 342.0   PASS        NA       NULL       2
#>   INFO_AN INFO_INDEL INFO_STR FORMAT_TT_A FORMAT_GT_A FORMAT_GQ_A FORMAT_DP_A
#> 1       4      FALSE     <NA>        NULL         0/1         245          NA
#> 2       4      FALSE     <NA>        NULL         0/1         245          32
#> 3       4       TRUE     test        NULL         0/1         409          35
#> 4       3      FALSE     <NA>        0, 1         0/1         409          35
#> 5       4      FALSE     <NA>        NULL         0/1         245          32
#>                  FORMAT_GL_A FORMAT_TT_B FORMAT_GT_B FORMAT_GQ_B FORMAT_DP_B
#> 1                       NULL        NULL         0/1         245          NA
#> 2                       NULL        NULL         0/1         245          32
#> 3               -20, -5, -20        NULL         0/1         409          35
#> 4 -20, -5, -20, -20, -5, -20        0, 1           2         409          35
#> 5                       NULL        NULL         0/1         245          32
#>    FORMAT_GL_B
#> 1         NULL
#> 2         NULL
#> 3 -20, -5, -20
#> 4 -20, -5, -20
#> 5         NULL

rduckhts_bcf(
  con, "variants_idx", bcf_path,
  region = "1:3000150-3000151",
  index_path = bcf_index_path,
  overwrite = TRUE
)
dbGetQuery(con, "SELECT count(*) AS n FROM variants_idx")
#>   n
#> 1 2

# Span-oriented index view from the same file
index_spans_preview <- rduckhts_hts_index_spans(con, bcf_path, index_path = bcf_index_path)
head(index_spans_preview[, c("seqname", "tid", "index_type", "chunk_beg_vo", "chunk_end_vo")], 5)
#>   seqname tid index_type chunk_beg_vo chunk_end_vo
#> 1       1   0        CSI           NA           NA
#> 2       2   1        CSI           NA           NA
#> 3       3   2        CSI           NA           NA
#> 4       4   3        CSI           NA           NA

Remote VCF on S3

S3 files can be query when htslib is built with plugins enable. This is not the case on RTools

# Enable htslib plugins for remote access (S3/GCS/HTTP)
setup_hts_env()

# Example S3 URL (1000 Genomes cohort VCF)
s3_base <- "s3://1000genomes-dragen-v3.7.6/data/cohorts/"
s3_path <- "gvcf-genotyper-dragen-3.7.6/hg19/3202-samples-cohort/"
s3_vcf_file <- "3202_samples_cohort_gg_chr22.vcf.gz"
s3_vcf_uri <- paste0(s3_base, s3_path, s3_vcf_file)

rduckhts_bcf(con, "s3_variants", s3_vcf_uri, region = "chr22:16050000-16050500", overwrite = TRUE)
dbGetQuery(con, "SELECT CHROM, COUNT(*) AS n FROM s3_variants GROUP BY CHROM")
#>   CHROM  n
#> 1 chr22 11

FASTQ files

Three modes for fastq files, single, paired and interleaved

r1 <- system.file("extdata", "r1.fq", package = "Rduckhts")
r2 <- system.file("extdata", "r2.fq", package = "Rduckhts")
interleaved <- system.file("extdata", "interleaved.fq", package = "Rduckhts")
rduckhts_fastq(con, "paired_reads", r1, mate_path = r2, overwrite = TRUE)
rduckhts_fastq(con, "interleaved_reads", interleaved, interleaved = TRUE, overwrite = TRUE)
pairs <- dbGetQuery(con, "SELECT * FROM paired_reads WHERE MATE = 1 LIMIT 5")
pairs
#>                              NAME DESCRIPTION
#> 1 HS25_09827:2:1201:1505:59795#49        <NA>
#> 2 HS25_09827:2:1201:1559:70726#49        <NA>
#> 3 HS25_09827:2:1201:1564:39627#49        <NA>
#> 4 HS25_09827:2:1201:1565:91731#49        <NA>
#> 5 HS25_09827:2:1201:1624:69925#49        <NA>
#>                                                                                               SEQUENCE
#> 1 CCGTTAGAGCATTTGTTGAAAATGCTTTCCTTGCTCCATGTGATGACTCTGGTGCCCTTGTCAAAAGCCAGCTGGGCCTATTCGTGTGGGTCTGTTTCTG
#> 2 TTGTTAAAATGACCATACCCAAAGTGATCTACAGACTCAATACAATTTCTATTGAAATACCAATCACACTCTTCACAGAACTAGAAAAACAGTTCTAAAA
#> 3 ACGCGGCAATCCAATGTGTGAGTTGAGAAGCGGTGAGGAGGGAATCCTAATTTTATGAGCAGGTCAGGACCGTGGGAGATACCTGACACCTGAGATGGTA
#> 4 GACATGCCATAACATTCATGTTTTATGTGTACAAGTCAATGAATTTTAGTATATTTACAGAGTTGTATGACTGTCTCCACAATCTAATTTTAGGTTTCCA
#> 5 GCCAGCCTCCTTCTCAATGGTCTTTTTAAACATTATATGAAAACCAGACATTTACATTTGATTTCTTTTTCAATACTATACAGTTCTAAGAGAAAAAACA
#>                                                                                                QUALITY
#> 1 CABCFGDEEFFEFHGHGGFFGDIGIJFIFHHGHEIFGHBCGHDIFBE9GIAICGGICFIBFGGHGDGGGHE?GIGDFGGHEGIEJG>;FG<GGHACEFGH
#> 2 CABEFGFFGFHGGGGJGGFFGKIHHJFIEHHHGIEGGEHJGHDHFGHIGICIJEFIFGIF8GGHKFHGGFEI6GGGFIGHGGIE>EFCFHGGGHEJEAJE
#> 3 BACCFGBFGFHGGJGHGGFEGHIGIJHFEH:HHEHGHHBGGH9IAGHGFHIFJFFAFGIFDIGHKEIG<C>F,CGD66?7EFI5EEG>EGGGGD5=HH6E
#> 4 CABFFGFFJFHEGEGJGGDG?FIGHHHBGHHHGIIGHGHGGHDGHFHIDFCIKEGIFHGGII9HFFGGGEEIGGEEHGGEEGDEHFH>FGGGGHAFAHGE
#> 5 CABEFGFGIFGGGJGHGGFH?FDHGHDHGHEHHJCGHHFHDHDHFGHIGHIFFHGHFGGGI9GHF@IGGH;FICGEFEIHGGIEEFC:DEGGGBDJHHFF
#>   MATE                         PAIR_ID
#> 1    1 HS25_09827:2:1201:1505:59795#49
#> 2    1 HS25_09827:2:1201:1559:70726#49
#> 3    1 HS25_09827:2:1201:1564:39627#49
#> 4    1 HS25_09827:2:1201:1565:91731#49
#> 5    1 HS25_09827:2:1201:1624:69925#49

FASTQ quality decoding and per-position histograms

FASTQ quality handling has two separate knobs:

  • input_quality_encoding controls how incoming FASTQ ASCII is decoded. The default is modern phred33; use phred64, solexa64, or auto only for legacy data.
  • quality_representation controls how qualities are returned to DuckDB: canonical Phred+33 text ("string") or numeric Phred arrays ("phred").

The flow is:

  1. Decode FASTQ ASCII using input_quality_encoding.
  2. Normalize to numeric Phred qualities.
  3. Return either numeric arrays or canonical Phred+33 text.

BAM/CRAM reads skip the text-decoding step because qualities are already stored numerically.

legacy_fastq <- system.file("extdata", "legacy_phred64.fq", package = "Rduckhts")

rduckhts_detect_quality_encoding(con, legacy_fastq)
#>   format observed_ascii_min observed_ascii_max records_sampled
#> 1  fastq                104                104               1
#>       compatible_encodings guessed_encoding is_ambiguous
#> 1 phred33,phred64,solexa64          phred64         TRUE

quality_hist <- dbGetQuery(
  con,
  sprintf(
    "WITH q AS (
       SELECT NAME, QUALITY
       FROM read_fastq('%s', quality_representation := 'phred')
     ),
     expanded AS (
       SELECT
         NAME,
         generate_subscripts(QUALITY, 1) AS pos,
         unnest(QUALITY) AS q
       FROM q
     )
     SELECT pos, q AS phred, count(*) AS n_reads
     FROM expanded
     GROUP BY pos, phred
     ORDER BY pos, phred
     LIMIT 12",
    fastq_r1
  )
)
quality_hist
#>    pos phred n_reads
#> 1    1    33       1
#> 2    1    34       4
#> 3    2    32       5
#> 4    3    33       4
#> 5    3    34       1
#> 6    4    34       2
#> 7    4    36       2
#> 8    4    37       1
#> 9    5    37       5
#> 10   6    38       5
#> 11   7    33       1
#> 12   7    35       1

GFF files

These can be open with or with attributes maps

gff_path <- system.file("extdata", "gff_file.gff.gz", package = "Rduckhts")
rduckhts_gff(con, "genes", gff_path, attributes_map = TRUE, overwrite = TRUE)
gene_annotations <- dbGetQuery(con, "SELECT seqname, start, \"end\" FROM genes WHERE feature = 'gene' LIMIT 5")
gene_annotations
#>   seqname   start     end
#> 1       X 2934816 2964270

BAM/CRAM

When built with htslib codec, CRAM can be opened in addition to BAM files. index_path can also be passed for region scans with non-standard index names.

cram_path <- system.file("extdata", "range.cram", package = "Rduckhts")
ref_path <- system.file("extdata", "ce.fa", package = "Rduckhts")
bam_path <- system.file("extdata", "range.bam", package = "Rduckhts")
bam_index_path <- system.file("extdata", "range.bam.bai", package = "Rduckhts")

rduckhts_bam(con, "cram_reads", cram_path, reference = ref_path, overwrite = TRUE)
cram_reads <- dbGetQuery(con, "SELECT QNAME, FLAG, POS, MAPQ FROM cram_reads LIMIT 5")
cram_reads
#>                           QNAME FLAG  POS MAPQ
#> 1 HS18_09653:4:1315:19857:61712  145  914   23
#> 2 HS18_09653:4:1308:11522:27107  161  934    0
#> 3 HS18_09653:4:2314:14991:85680   83 1020   10
#> 4 HS18_09653:4:2108:14085:93656  147 1122   60
#> 5  HS18_09653:4:1303:4347:38100   83 1137   37

rduckhts_bam(
  con, "bam_idx_reads", bam_path,
  region = "CHROMOSOME_I:1-1000",
  index_path = bam_index_path,
  overwrite = TRUE
)
dbGetQuery(con, "SELECT count(*) AS n FROM bam_idx_reads")
#>   n
#> 1 2

SAMtags + auxiliary tags

Standard SAMtags can be exposed as typed columns, and any remaining tags are available via AUXILIARY_TAGS:

aux_path <- system.file("extdata", "aux_tags.sam.gz", package = "Rduckhts")
rduckhts_bam(con, "aux_reads", aux_path, standard_tags = TRUE, auxiliary_tags = TRUE, overwrite = TRUE)
dbGetQuery(con, "SELECT RG, NM, map_extract(AUXILIARY_TAGS, 'XZ') AS XZ FROM aux_reads LIMIT 1")
#>   RG NM  XZ
#> 1 x1  2 foo

Tabix headers + types

Use header = TRUE to use the first non-meta row as column names, and auto_detect = TRUE / column_types to control column typing:

tabix_header <- system.file("extdata", "header_tabix.tsv.gz", package = "Rduckhts")
tabix_meta <- system.file("extdata", "meta_tabix.tsv.gz", package = "Rduckhts")

rduckhts_tabix(con, "header_tabix", tabix_header, header = TRUE, overwrite = TRUE)
dbGetQuery(con, "SELECT chrom, pos FROM header_tabix LIMIT 2")
#>   chrom pos
#> 1  chr1   1
#> 2  chr1   2

rduckhts_tabix(con, "typed_tabix", tabix_meta, auto_detect = TRUE, overwrite = TRUE)
dbGetQuery(con, "SELECT typeof(column1) AS column1_type FROM typed_tabix LIMIT 1")
#>   column1_type
#> 1       BIGINT

rduckhts_tabix(con, "typed_tabix_explicit", tabix_header,
               header = TRUE,
               column_types = c("VARCHAR", "BIGINT", "VARCHAR"),
               overwrite = TRUE)
dbGetQuery(con, "SELECT pos + 1 AS pos_plus_one FROM typed_tabix_explicit LIMIT 1")
#>   pos_plus_one
#> 1            2

HTS header and index metadata

Use metadata helpers to inspect parsed headers, raw header lines, index summaries, span-oriented index views, and raw index blobs.

header_meta <- rduckhts_hts_header(con, bcf_path)
head(header_meta[, c("record_type", "id", "number", "value_type")], 5)
#>   record_type   id number value_type
#> 1  fileformat <NA>   <NA>       <NA>
#> 2      FILTER PASS   <NA>       <NA>
#> 3        INFO TEST      1    Integer
#> 4      FORMAT   TT      A    Integer
#> 5        INFO  DP4      4    Integer

header_raw <- rduckhts_hts_header(con, bcf_path, mode = "raw")
head(header_raw[, c("idx", "raw")], 5)
#>   idx
#> 1   0
#> 2   1
#> 3   2
#> 4   3
#> 5   4
#>                                                                                                                                                          raw
#> 1                                                                                                                                       ##fileformat=VCFv4.1
#> 2                                                                                                        ##FILTER=<ID=PASS,Description="All filters passed">
#> 3                                                                                           ##INFO=<ID=TEST,Number=1,Type=Integer,Description="Testing Tag">
#> 4 ##FORMAT=<ID=TT,Number=A,Type=Integer,Description="Testing Tag, with commas and \\"escapes\\" and escaped escapes combined with \\\\\\"quotes\\\\\\\\\\"">
#> 5                       ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">

index_meta <- rduckhts_hts_index(con, bcf_path, index_path = bcf_index_path)
head(index_meta[, c("seqname", "mapped", "unmapped", "index_type")], 5)
#>   seqname mapped unmapped index_type
#> 1       1     11        0        CSI
#> 2       2      1        0        CSI
#> 3       3      1        0        CSI
#> 4       4      2        0        CSI

index_spans <- rduckhts_hts_index_spans(con, bcf_path, index_path = bcf_index_path)
head(index_spans[, c("seqname", "tid", "index_type", "chunk_beg_vo", "chunk_end_vo")], 5)
#>   seqname tid index_type chunk_beg_vo chunk_end_vo
#> 1       1   0        CSI           NA           NA
#> 2       2   1        CSI           NA           NA
#> 3       3   2        CSI           NA           NA
#> 4       4   3        CSI           NA           NA

index_raw <- rduckhts_hts_index_raw(con, bcf_path, index_path = bcf_index_path)
head(index_raw, 1)
#> [1] index_type                                                       
#> [2] '/usr/local/lib/R/site-library/Rduckhts/extdata/vcf_file.bcf.csi'
#> [3] raw                                                              
#> <0 rows> (or 0-length row.names)

Remote GTEx tabix example

GTEx eQTL matrices on EBI are tabix-indexed

gtex_url <- "http://ftp.ebi.ac.uk/pub/databases/spot/eQTL/imported/GTEx_V8/ge/Brain_Cerebellar_Hemisphere.tsv.gz" 
rduckhts_tabix(con, "gtex_eqtl", gtex_url, region = "1:11868-14409",
                 header = TRUE, auto_detect = TRUE, overwrite = TRUE)
dbGetQuery(con, "SELECT * FROM gtex_eqtl LIMIT 5")
#>          variant r2    pvalue molecular_trait_object_id molecular_trait_id
#> 1 chr1_13550_G_A NA 0.0204520           ENSG00000188290    ENSG00000188290
#> 2 chr1_13550_G_A NA 0.0303633           ENSG00000230699    ENSG00000230699
#> 3 chr1_13550_G_A NA 0.1057900           ENSG00000177757    ENSG00000177757
#> 4 chr1_13550_G_A NA 0.1617190           ENSG00000241860    ENSG00000241860
#> 5 chr1_13550_G_A NA 0.1919580           ENSG00000198744    ENSG00000198744
#>         maf         gene_id median_tpm      beta       se  an ac chromosome
#> 1 0.0114286 ENSG00000188290  6.3960000  0.633986 0.270285 350  4          1
#> 2 0.0114286 ENSG00000230699  0.0674459 -0.980082 0.447861 350  4          1
#> 3 0.0114286 ENSG00000177757  1.2659000  0.631359 0.387738 350  4          1
#> 4 0.0114286 ENSG00000241860  0.1081970 -0.791695 0.562674 350  4          1
#> 5 0.0114286 ENSG00000198744 21.6284000 -0.592354 0.451705 350  4          1
#>   position ref alt type        rsid
#> 1    13550   G   A  SNP rs554008981
#> 2    13550   G   A  SNP rs554008981
#> 3    13550   G   A  SNP rs554008981
#> 4    13550   G   A  SNP rs554008981
#> 5    13550   G   A  SNP rs554008981
dbDisconnect(con, shutdown = TRUE)

License

GPL-3.