Rduckhts provides an R interface to a DuckDB HTS (High Throughput Sequencing) file reader extension. This enables reading common bioinformatics file formats such as VCF/BCF, SAM/BAM/CRAM, FASTA, FASTQ, GFF, GTF, and tabix-indexed files directly from R using SQL queries via duckhts.
How it works
Following RBCFTools, tables are created and returned instead of data frames. VCF/BCF, SAM/BAM/CRAM, FASTA, FASTQ, GFF, GTF, and tabix formats can be queried. We support region queries for indexed files, and we target Linux, macOS, and RTools. htslib 1.23 is bundled so build dependencies stay minimal. The extensnion is built by adapting the generic extension infracstructure by using only makefiles unlike the submitted communtity extension duckhts.
Installation
The package can be installed from r-universe
# Install 'Rduckhts' in R:
install.packages('Rduckhts', repos = c('https://rgenomicsetl.r-universe.dev', 'https://cloud.r-project.org'))
# When on CRAN
install.packages("Rduckhts")System Requirements
Installation requires htslib dependencies such ad zlib and libbz2, and optionally for full functionally liblzma, libcurl, and openssl. The package requires GNU make. On Windows’s Rtools, htslib plugins are not enabled.
Quick Start
The extension is loaded with rduckhts_load(con, extension_path = NULL). The wrappers break down into:
- readers:
rduckhts_bcf(),rduckhts_bam(),rduckhts_fasta(),rduckhts_fastq(),rduckhts_gff(),rduckhts_gtf(),rduckhts_tabix(),rduckhts_bed() - reference helpers:
rduckhts_fasta_index(),rduckhts_fasta_nuc() - compression/indexing:
rduckhts_bgzip(),rduckhts_bgunzip(),rduckhts_bam_index(),rduckhts_bcf_index(),rduckhts_tabix_index() - metadata helpers:
rduckhts_hts_header(),rduckhts_hts_index(),rduckhts_hts_index_spans(),rduckhts_hts_index_raw()
Start with one reader, then materialize tables and compose the richer helpers around them.
library(DBI)
library(duckdb)
library(Rduckhts)
fasta_path <- system.file("extdata", "ce.fa", package = "Rduckhts")
fastq_r1 <- system.file("extdata", "r1.fq", package = "Rduckhts")
fastq_r2 <- system.file("extdata", "r2.fq", package = "Rduckhts")
con <- dbConnect(duckdb::duckdb(config = list(allow_unsigned_extensions = "true")))
rduckhts_load(con)
#> [1] TRUE
rduckhts_fasta(con, "sequences", fasta_path, overwrite = TRUE)
rduckhts_fastq(con, "reads", fastq_r1, mate_path = fastq_r2, overwrite = TRUE)
dbGetQuery(con, "SELECT COUNT(*) AS n FROM sequences")
#> n
#> 1 7
dbGetQuery(con, "SELECT COUNT(*) AS n FROM reads")
#> n
#> 1 10Function Catalog
Use rduckhts_functions() inside R to inspect the generated extension catalog.
Extension Function Catalog
This section is generated from functions.yaml.
Readers
| Function | Kind | Returns | R helper | Description |
|---|---|---|---|---|
read_bcf |
table | table | rduckhts_bcf |
Read VCF and BCF variant data with typed INFO, FORMAT, typed CSQ/ANN/BCSQ subfields, optional tidy sample output, and optional bcftools-style CSQ type overrides. |
read_bam |
table | table | rduckhts_bam |
Read SAM, BAM, and CRAM alignments with optional typed SAMtags and auxiliary tag maps. Use sequence_encoding := ‘nt16’ to return SEQ as UTINYINT[] and quality_representation := ‘phred’ to return QUAL as UTINYINT[] instead of VARCHAR. |
read_fasta |
table | table | rduckhts_fasta |
Read FASTA records or indexed FASTA regions as sequence rows. Use sequence_encoding := ‘nt16’ to return SEQUENCE as UTINYINT[] (htslib nt16 4-bit codes) instead of VARCHAR. |
read_bed |
table | table | rduckhts_bed |
Read BED3-BED12 interval files with canonical typed columns and optional tabix-backed region filtering. |
fasta_nuc |
table | table | rduckhts_fasta_nuc |
Compute bedtools nuc-style nucleotide composition for supplied BED intervals or generated fixed-width bins over a FASTA reference. |
read_fastq |
table | table | rduckhts_fastq |
Read single-end, paired-end, or interleaved FASTQ files with optional legacy quality decoding. By default, FASTQ qualities are interpreted as modern Phred+33 input. Use sequence_encoding := ‘nt16’ to return SEQUENCE as UTINYINT[] and quality_representation := ‘phred’ to return QUALITY as UTINYINT[] instead of VARCHAR. input_quality_encoding accepts ‘phred33’, ‘auto’, ‘phred64’, or ‘solexa64’. |
read_gff |
table | table | rduckhts_gff |
Read GFF annotations with optional parsed attribute maps and indexed region filtering. |
read_gtf |
table | table | rduckhts_gtf |
Read GTF annotations with optional parsed attribute maps and indexed region filtering. |
read_tabix |
table | table | rduckhts_tabix |
Read generic tabix-indexed text data with optional header handling and type inference. |
fasta_index |
table | table | rduckhts_fasta_index |
Build a FASTA index (.fai) and return a single row with columns success (BOOLEAN) and index_path (VARCHAR). |
Metadata
| Function | Kind | Returns | R helper | Description |
|---|---|---|---|---|
detect_quality_encoding |
table | table | rduckhts_detect_quality_encoding |
Inspect a FASTQ file’s observed quality ASCII range and report compatible legacy encodings with a heuristic guessed encoding. |
read_hts_header |
table | table | rduckhts_hts_header |
Inspect HTS headers in parsed, raw, or combined form across supported formats. |
read_hts_index |
table | table | rduckhts_hts_index |
Inspect high-level HTS index metadata such as sequence names and mapped counts. |
read_hts_index_spans |
table_macro | table | rduckhts_hts_index_spans |
Expand index metadata into span and chunk rows suitable for low-level index inspection. |
read_hts_index_raw |
table_macro | table | rduckhts_hts_index_raw |
Return the raw on-disk HTS index blob together with basic identifying metadata. |
Compression
| Function | Kind | Returns | R helper | Description |
|---|---|---|---|---|
bgzip |
table | table | rduckhts_bgzip |
Compress a plain file to BGZF and return the created output path and byte counts. |
bgunzip |
table | table | rduckhts_bgunzip |
Decompress a BGZF-compressed file and return the created output path and byte counts. |
Indexing
| Function | Kind | Returns | R helper | Description |
|---|---|---|---|---|
bam_index |
table | table | rduckhts_bam_index |
Build a BAM or CRAM index and report the written index path and format. |
bcf_index |
table | table | rduckhts_bcf_index |
Build a TBI or CSI index for a VCF or BCF file and report the written index path and format. |
tabix_index |
table | table | rduckhts_tabix_index |
Build a tabix index for a BGZF-compressed text file using a preset or explicit coordinate columns. |
Variants
| Function | Kind | Returns | R helper | Description |
|---|---|---|---|---|
bcftools_liftover |
scalar | STRUCT | rduckhts_liftover |
Row-oriented liftover kernel intended to mirror bcftools +liftover semantics as closely as possible while returning one STRUCT per input row with fields: src_chrom, src_pos, src_ref, src_alt, dest_chrom, dest_pos, dest_end, dest_ref, dest_alt, mapped, reverse_complemented, swap, reject_reason, and note. Set no_left_align := true to skip post-liftover left-alignment of lifted indels (mirrors –no-left-align in bcftools +liftover). |
duckdb_liftover |
table_macro | table | rduckhts_liftover |
DuckDB-specific wrapper over bcftools_liftover that takes either a table name or a derived-table expression plus column-name strings for chrom/pos/ref/alt and returns the lifted table. The no_left_align parameter mirrors –no-left-align in bcftools +liftover. |
bcftools_score |
table | table | rduckhts_score |
Compute polygenic scores from one genotype BCF/VCF and one summary-statistics file with bcftools +score-compatible GT/DS/HDS/AP/GP/AS dosage semantics, sample subsetting, and region/target/FILTER-string controls. |
bcftools_munge_row |
scalar | STRUCT | Normalize one summary-statistics row into GWAS-VCF-style fields (chrom/pos/ref/alt/effect metrics), resolving REF/ALT orientation against a FASTA reference and applying swap-aware sign/frequency/count transforms. The output flag alleles_swapped means REF/ALT orientation was swapped to match the FASTA reference. |
|
duckdb_munge |
table_macro | table | rduckhts_munge |
DuckDB macro wrapper over bcftools_munge_row that maps source columns (via preset or explicit map) and returns normalized GWAS-VCF-style rows with lean outputs and explicit alleles_swapped semantics. Output columns: chrom, pos, id, ref, alt, alleles_swapped, filter, ns, ez, nc, es, se, lp, af, ac, ne (16 columns). For METAL meta-analysis output with SI/I2/CQ/ED columns, use duckdb_munge_metal. |
duckdb_munge_metal |
table_macro | table | rduckhts_munge |
Extended munge macro with METAL meta-analysis output columns. Same as duckdb_munge but additionally emits: si (imputation info, from INFO input), i2 (Cochran’s I² heterogeneity, from HET_I2), cq (Cochran’s Q -log10 p, from HET_LP or -log10(HET_P)), and ed (effect direction string, from DIRE; +/- flipped on allele swap). The R wrapper rduckhts_munge() auto-dispatches to this macro when metal keys (INFO, HET_I2, HET_P, HET_LP, DIRE) are present in the resolved column map. |
Sequence UDFs
| Function | Kind | Returns | R helper | Description |
|---|---|---|---|---|
seq_revcomp |
scalar | VARCHAR | Compute the reverse complement of a DNA sequence using A, C, G, T, and N bases. | |
seq_canonical |
scalar | VARCHAR | Return the lexicographically smaller of a sequence and its reverse complement. | |
seq_hash_2bit |
scalar | UBIGINT | Encode a short DNA sequence as a 2-bit unsigned integer hash. | |
seq_encode_4bit |
scalar | UTINYINT[] | Encode an IUPAC DNA sequence as a list of 4-bit base codes, preserving ambiguity symbols including N. | |
seq_decode_4bit |
scalar | VARCHAR | Decode a list of 4-bit IUPAC DNA base codes back into a sequence string. | |
seq_gc_content |
scalar | DOUBLE | Compute GC fraction for a DNA sequence as a value between 0 and 1. | |
seq_kmers |
table | table | Expand a sequence into positional k-mers with optional canonicalization. |
SAM Flag UDFs
| Function | Kind | Returns | R helper | Description |
|---|---|---|---|---|
sam_flag_bits |
scalar | STRUCT | Decode a SAM flag into a struct of boolean bit fields using explicit SAM-oriented names such as is_paired, is_proper_pair, is_next_segment_unmapped, and is_supplementary. |
|
sam_flag_has |
scalar | BOOLEAN | Test whether any bits from the provided SAM flag mask are set in a flag value. | |
is_forward_aligned |
scalar | BOOLEAN | Test whether a mapped segment is aligned to the forward strand. Returns NULL for unmapped segments because SAM flag 0x10 does not define genomic strand when 0x4 is set. |
|
is_paired |
scalar | BOOLEAN | Test whether the SAM flag indicates that the template has multiple segments in sequencing (0x1). |
|
is_proper_pair |
scalar | BOOLEAN | Test whether the SAM flag indicates that each segment is properly aligned according to the aligner (0x2). |
|
is_unmapped |
scalar | BOOLEAN | Test whether the read itself is unmapped according to the SAM flag. | |
is_next_segment_unmapped |
scalar | BOOLEAN | Test whether the next segment in the template is flagged as unmapped (0x8). |
|
is_reverse_complemented |
scalar | BOOLEAN | Test whether SEQ is stored reverse complemented (0x10); for mapped reads this corresponds to reverse-strand alignment. |
|
is_next_segment_reverse_complemented |
scalar | BOOLEAN | Test whether SEQ of the next segment in the template is stored reverse complemented (0x20). |
|
is_first_segment |
scalar | BOOLEAN | Test whether the read is marked as the first segment in the template. | |
is_last_segment |
scalar | BOOLEAN | Test whether the read is marked as the last segment in the template. | |
is_secondary |
scalar | BOOLEAN | Test whether the alignment is marked as secondary. | |
is_qc_fail |
scalar | BOOLEAN | Test whether the read failed vendor or pipeline quality checks. | |
is_duplicate |
scalar | BOOLEAN | Test whether the alignment is flagged as a duplicate. | |
is_supplementary |
scalar | BOOLEAN | Test whether the alignment is marked as supplementary. |
CIGAR Utils
| Function | Kind | Returns | R helper | Description |
|---|---|---|---|---|
cigar_has_soft_clip |
scalar | BOOLEAN | Test whether a CIGAR string contains any soft-clipped segment (S). |
|
cigar_has_hard_clip |
scalar | BOOLEAN | Test whether a CIGAR string contains any hard-clipped segment (H). |
|
cigar_left_soft_clip |
scalar | BIGINT | Return the left-end soft-clipped length from a CIGAR string, or zero if the alignment does not start with S. |
|
cigar_right_soft_clip |
scalar | BIGINT | Return the right-end soft-clipped length from a CIGAR string, or zero if the alignment does not end with S. |
|
cigar_query_length |
scalar | BIGINT | Return the query-consuming length from a CIGAR string, counting M, I, S, =, and X. |
|
cigar_aligned_query_length |
scalar | BIGINT | Return the aligned query length from a CIGAR string, counting M, =, and X but excluding clips and insertions. |
|
cigar_reference_length |
scalar | BIGINT | Return the reference-consuming length from a CIGAR string, counting M, D, N, =, and X. |
|
cigar_has_op |
scalar | BOOLEAN | Test whether a CIGAR string contains at least one instance of the requested operator. |
Common Workflows
Region-aware variant and alignment queries
bcf_path <- system.file("extdata", "vcf_file.bcf", package = "Rduckhts")
bcf_index_path <- system.file("extdata", "vcf_file.bcf.csi", package = "Rduckhts")
bam_path <- system.file("extdata", "range.bam", package = "Rduckhts")
bam_index_path <- system.file("extdata", "range.bam.bai", package = "Rduckhts")
rduckhts_bcf(
con, "variants_idx", bcf_path,
region = "1:3000150-3000151",
index_path = bcf_index_path,
overwrite = TRUE
)
dbGetQuery(con, "SELECT CHROM, POS, REF, ALT FROM variants_idx")
#> CHROM POS REF ALT
#> 1 1 3000150 C T
#> 2 1 3000151 C T
rduckhts_bam(
con, "bam_idx_reads", bam_path,
region = "CHROMOSOME_I:1-1000",
index_path = bam_index_path,
overwrite = TRUE
)
dbGetQuery(con, "SELECT QNAME, FLAG, POS, MAPQ FROM bam_idx_reads")
#> QNAME FLAG POS MAPQ
#> 1 HS18_09653:4:1315:19857:61712 145 914 23
#> 2 HS18_09653:4:1308:11522:27107 161 934 0Interval + reference helpers
bed_path <- system.file("extdata", "targets.bed", package = "Rduckhts")
fai_path <- tempfile("duckhts_readme_", fileext = ".fai")
rduckhts_fasta_index(con, fasta_path, index_path = fai_path)
#> success index_path
#> 1 TRUE /tmp/Rtmp6jhy2i/duckhts_readme_3bf93a50fe717b.fai
rduckhts_bed(con, "targets", bed_path, overwrite = TRUE)
dbGetQuery(con, "SELECT chrom, start, \"end\", name, block_count FROM targets")
#> chrom start end name block_count
#> 1 CHROMOSOME_I 0 10 target1 2
#> 2 CHROMOSOME_I 10 20 target2 1
#> 3 CHROMOSOME_II 0 8 target3 NA
#> 4 CHROMOSOME_III 0 6 target4 1
rduckhts_fasta_nuc(con, fasta_path, bed_path = bed_path, index_path = fai_path)
#> chrom start end pct_at pct_gc num_a num_c num_g num_t num_n
#> 1 CHROMOSOME_I 0 10 0.400 0.600 2 4 2 2 0
#> 2 CHROMOSOME_I 10 20 0.500 0.500 4 3 2 1 0
#> 3 CHROMOSOME_II 0 8 0.375 0.625 2 4 1 1 0
#> 4 CHROMOSOME_III 0 6 0.500 0.500 2 2 1 1 0
#> num_other seq_len
#> 1 0 10
#> 2 0 10
#> 3 0 8
#> 4 0 6
rduckhts_fasta_nuc(con, fasta_path, bin_width = 10, region = "CHROMOSOME_I:1-20", index_path = fai_path)
#> chrom start end pct_at pct_gc num_a num_c num_g num_t num_n num_other
#> 1 CHROMOSOME_I 0 10 0.4 0.6 2 4 2 2 0 0
#> 2 CHROMOSOME_I 10 20 0.5 0.5 4 3 2 1 0 0
#> seq_len
#> 1 10
#> 2 10
unlink(fai_path)Liftover score-style rows
lift_src <- tempfile("duckhts_liftover_src_", fileext = ".fa")
lift_dst <- tempfile("duckhts_liftover_dst_", fileext = ".fa")
lift_chain <- tempfile("duckhts_liftover_", fileext = ".chain")
writeLines(c(
">chrF",
"ACGTACGTAA",
">chrR",
"AACCGGTTAA"
), lift_src)
writeLines(c(
">chrLiftF",
"ACGTACGTAA",
">chrLiftR",
"TTAACCGGTT"
), lift_dst)
writeLines(c(
"chain 100 chrF 10 + 0 10 chrLiftF 10 + 0 10 1",
"10",
"",
"chain 100 chrR 10 + 0 10 chrLiftR 10 - 0 10 2",
"10"
), lift_chain)
rduckhts_fasta_index(con, lift_src, index_path = paste0(lift_src, ".fai"))
#> success index_path
#> 1 TRUE /tmp/Rtmp6jhy2i/duckhts_liftover_src_3bf93a13f914ae.fa.fai
rduckhts_fasta_index(con, lift_dst, index_path = paste0(lift_dst, ".fai"))
#> success index_path
#> 1 TRUE /tmp/Rtmp6jhy2i/duckhts_liftover_dst_3bf93a70dc46e7.fa.fai
lifted <- rduckhts_liftover(
con,
query = paste(
"SELECT * FROM (VALUES",
"('chrF', 2, 'C', 'T'),",
"('chrR', 2, 'A', 'G'),",
"('chrF', 11, 'A', 'T')",
") AS t(chrom, pos, ref, alt)"
),
chain_path = lift_chain,
dst_fasta_ref = lift_dst,
ref_col = "ref",
alt_col = "alt",
src_fasta_ref = lift_src
)
lifted[, c(
"src_chrom", "src_pos", "dest_chrom", "dest_pos",
"dest_ref", "dest_alt", "mapped", "reverse_complemented",
"reject_reason", "note"
)]
#> src_chrom src_pos dest_chrom dest_pos dest_ref dest_alt mapped
#> 1 chrF 2 chrLiftF 2 C T TRUE
#> 2 chrR 2 chrLiftR 9 T C TRUE
#> 3 chrF 11 chrLiftF 10 A AA,AT TRUE
#> reverse_complemented reject_reason note
#> 1 FALSE <NA> <NA>
#> 2 TRUE <NA> <NA>
#> 3 FALSE <NA> Padded
unlink(c(lift_src, paste0(lift_src, ".fai"), lift_dst, paste0(lift_dst, ".fai"), lift_chain))Munge score-style rows
munge_fasta <- tempfile("duckhts_munge_", fileext = ".fa")
writeLines(c(
">chrF",
"ACGTACGTAA"
), munge_fasta)
rduckhts_fasta_index(con, munge_fasta, index_path = paste0(munge_fasta, ".fai"))
#> success index_path
#> 1 TRUE /tmp/Rtmp6jhy2i/duckhts_munge_3bf93a1d442d89.fa.fai
munge_out <- rduckhts_munge(
con,
query = paste(
"SELECT * FROM (VALUES",
"('rs1', 2, 'chrF', 'A', 'C', 0.01, 1.10, 0.20, 0.98, 0.10, 0.01, 1000),",
"('rs2', 2, 'chrF', 'C', 'A', 0.02, 0.90, -0.20, 0.98, 0.90, 0.01, 1000)",
") AS t(SNP, BP, CHR, A1, A2, P, OR_VALUE, BETA, INFO, FRQ, SE, N)"
),
fasta_ref = munge_fasta,
column_map = c(
SNP = "SNP", BP = "BP", CHR = "CHR", A1 = "A1", A2 = "A2",
P = "P", OR = "OR_VALUE", BETA = "BETA", INFO = "INFO", FRQ = "FRQ", SE = "SE", N = "N"
)
)
munge_out[, c("chrom", "pos", "id", "ref", "alt", "alleles_swapped", "filter", "af", "es", "ns")]
#> chrom pos id ref alt alleles_swapped filter af es ns
#> 1 chrF 2 rs2 C A TRUE <NA> 0.1 0.2 1000
#> 2 chrF 2 rs1 C A FALSE <NA> 0.1 0.2 1000
unlink(c(munge_fasta, paste0(munge_fasta, ".fai")))Polygenic risk scoring
rduckhts_score() computes per-sample polygenic risk scores (PRS) from a genotype VCF/BCF and a GWAS summary statistics file, wrapping bcftools_score.
vcf_path <- system.file("extdata", "score_input.vcf", package = "Rduckhts")
dosage_path <- system.file("extdata", "score_dosage.vcf", package = "Rduckhts")
sumf_path <- system.file("extdata", "score_summary.tsv", package = "Rduckhts")
gwas_path <- system.file("extdata", "score_gwas_summary.vcf", package = "Rduckhts")
# Hard-call (GT) PRS with PLINK-format summary statistics
# S1: 0×0.5 + 1×(−0.2) + 2×1.0 = 1.8
# S2: 1×0.5 + 2×(−0.2) + 0×1.0 = 0.1
gt_prs <- rduckhts_score(con, vcf_path, sumf_path, use = "GT", columns = "PLINK")
gt_prs[, c("SAMPLE", "score_summary")]
#> SAMPLE score_summary
#> 1 S1 1.80000000
#> 2 S2 0.09999999
# Dosage-based PRS (DS field) for imputed genotypes
# S1: 0.1×0.5 + 0.8×(−0.2) + 1.8×1.0 = 1.69
# S2: 1.0×0.5 + 1.9×(−0.2) + 0.2×1.0 = 0.32
ds_prs <- rduckhts_score(con, dosage_path, sumf_path, use = "DS", columns = "PLINK")
ds_prs[, c("SAMPLE", "score_summary")]
#> SAMPLE score_summary
#> 1 S1 1.69
#> 2 S2 0.32
# GWAS-VCF multi-PRS: each FORMAT/ES sample column becomes a separate PRS track
gwas_prs <- rduckhts_score(con, vcf_path, gwas_path, use = "GT")
gwas_prs[, c("SAMPLE", "PRS_A", "PRS_B")]
#> SAMPLE PRS_A PRS_B
#> 1 S1 1.80000000 1.0
#> 2 S2 0.09999999 0.3Compression + tabix round-trips
tmp_bed <- tempfile("duckhts_targets_", fileext = ".bed")
tmp_bgz <- paste0(tmp_bed, ".gz")
tmp_tbi <- paste0(tmp_bgz, ".tbi")
writeLines(c("chr1\t0\t10\ta", "chr1\t10\t20\tb"), tmp_bed)
rduckhts_bgzip(con, tmp_bed, output_path = tmp_bgz, keep = TRUE, overwrite = TRUE)
#> success output_path bytes_in
#> 1 TRUE /tmp/Rtmp6jhy2i/duckhts_targets_3bf93a55ef740e.bed.gz 25
#> bytes_out
#> 1 84
rduckhts_tabix_index(con, tmp_bgz, preset = "bed", index_path = tmp_tbi, threads = 1)
#> success index_path
#> 1 TRUE /tmp/Rtmp6jhy2i/duckhts_targets_3bf93a55ef740e.bed.gz.tbi
#> index_format
#> 1 TBI
rduckhts_bed(con, "targets_idx", tmp_bgz, region = "chr1:1-20", index_path = tmp_tbi, overwrite = TRUE)
dbGetQuery(con, "SELECT * FROM targets_idx")
#> chrom start end name score strand thick_start thick_end item_rgb block_count
#> 1 chr1 0 10 a <NA> <NA> NA NA <NA> NA
#> 2 chr1 10 20 b <NA> <NA> NA NA <NA> NA
#> block_sizes block_starts extra
#> 1 <NA> <NA> <NA>
#> 2 <NA> <NA> <NA>
unlink(c(tmp_bed, tmp_bgz, tmp_tbi))Sequence UDFs
The extension also exposes sequence utility UDFs directly in DuckDB SQL, including 4-bit IUPAC DNA encode/decode helpers. These can be applied to SEQUENCE columns from FASTA and FASTQ scans.
dbGetQuery(
con,
"SELECT
NAME,
seq_hash_2bit(substr(SEQUENCE, 1, 12)) AS hash_2bit_prefix,
seq_encode_4bit(substr(SEQUENCE, 1, 16)) AS codes,
seq_decode_4bit(seq_encode_4bit(substr(SEQUENCE, 1, 16))) AS roundtrip
FROM sequences
LIMIT 2"
)
#> NAME hash_2bit_prefix codes
#> 1 CHROMOSOME_I 9898352 4, 2, 2, 8, 1, 1, 4, 2, 2, 8, 1, 1, 4, 2, 2, 8
#> 2 CHROMOSOME_II 6038978 2, 2, 8, 1, 1, 4, 2, 2, 8, 1, 1, 4, 2, 2, 8, 1
#> roundtrip
#> 1 GCCTAAGCCTAAGCCT
#> 2 CCTAAGCCTAAGCCTA
dbGetQuery(
con,
"SELECT
NAME,
MATE,
seq_encode_4bit(substr(SEQUENCE, 1, 12)) AS codes,
seq_decode_4bit(seq_encode_4bit(substr(SEQUENCE, 1, 12))) AS roundtrip
FROM reads
LIMIT 2"
)
#> NAME MATE codes
#> 1 HS25_09827:2:1201:1505:59795#49 1 2, 2, 4, 8, 8, 1, 4, 1, 4, 2, 1, 8
#> 2 HS25_09827:2:1201:1505:59795#49 2 1, 1, 4, 4, 1, 1, 1, 4, 1, 1, 4, 4
#> roundtrip
#> 1 CCGTTAGAGCAT
#> 2 AAGGAAAGAAGGFASTA region queries
read_fasta supports indexed region queries via rduckhts_fasta(..., region = ...).
fai_path <- tempfile("duckhts_readme_", fileext = ".fai")
fai_info <- rduckhts_fasta_index(con, fasta_path, index_path = fai_path)
fai_info
#> success index_path
#> 1 TRUE /tmp/Rtmp6jhy2i/duckhts_readme_3bf93a16f0ce0c.fai
rduckhts_fasta(
con, "fasta_region", fasta_path,
region = "CHROMOSOME_I:1-25",
overwrite = TRUE
)
dbGetQuery(con, "SELECT NAME, length(SEQUENCE) AS n FROM fasta_region")
#> NAME n
#> 1 CHROMOSOME_I 25
unlink(fai_path)Examples
Region Queries
Region queries can use implicit sidecar indexes or an explicit index_path for custom index names/locations.
bcf_path <- system.file("extdata", "vcf_file.bcf", package = "Rduckhts")
bcf_index_path <- system.file("extdata", "vcf_file.bcf.csi", package = "Rduckhts")
rduckhts_bcf(con, "variants", bcf_path, overwrite = TRUE)
variants <- dbGetQuery(con, "SELECT * FROM variants LIMIT 5")
variants
#> CHROM POS ID REF ALT QUAL FILTER INFO_TEST INFO_DP4 INFO_AC
#> 1 1 3000150 <NA> C T 59.2 PASS NA NULL 2
#> 2 1 3000151 <NA> C T 59.2 PASS NA NULL 2
#> 3 1 3062915 id3D GTTT G 12.9 q10 NA 1, 2, 3, 4 2
#> 4 1 3062915 idSNP G T, C 12.6 test 5 1, 2, 3, 4 1, 1
#> 5 1 3106154 <NA> CAAA C 342.0 PASS NA NULL 2
#> INFO_AN INFO_INDEL INFO_STR FORMAT_TT_A FORMAT_GT_A FORMAT_GQ_A FORMAT_DP_A
#> 1 4 FALSE <NA> NULL 0/1 245 NA
#> 2 4 FALSE <NA> NULL 0/1 245 32
#> 3 4 TRUE test NULL 0/1 409 35
#> 4 3 FALSE <NA> 0, 1 0/1 409 35
#> 5 4 FALSE <NA> NULL 0/1 245 32
#> FORMAT_GL_A FORMAT_TT_B FORMAT_GT_B FORMAT_GQ_B FORMAT_DP_B
#> 1 NULL NULL 0/1 245 NA
#> 2 NULL NULL 0/1 245 32
#> 3 -20, -5, -20 NULL 0/1 409 35
#> 4 -20, -5, -20, -20, -5, -20 0, 1 2 409 35
#> 5 NULL NULL 0/1 245 32
#> FORMAT_GL_B
#> 1 NULL
#> 2 NULL
#> 3 -20, -5, -20
#> 4 -20, -5, -20
#> 5 NULL
rduckhts_bcf(
con, "variants_idx", bcf_path,
region = "1:3000150-3000151",
index_path = bcf_index_path,
overwrite = TRUE
)
dbGetQuery(con, "SELECT count(*) AS n FROM variants_idx")
#> n
#> 1 2
# Span-oriented index view from the same file
index_spans_preview <- rduckhts_hts_index_spans(con, bcf_path, index_path = bcf_index_path)
head(index_spans_preview[, c("seqname", "tid", "index_type", "chunk_beg_vo", "chunk_end_vo")], 5)
#> seqname tid index_type chunk_beg_vo chunk_end_vo
#> 1 1 0 CSI NA NA
#> 2 2 1 CSI NA NA
#> 3 3 2 CSI NA NA
#> 4 4 3 CSI NA NARemote VCF on S3
S3 files can be query when htslib is built with plugins enable. This is not the case on RTools
# Enable htslib plugins for remote access (S3/GCS/HTTP)
setup_hts_env()
# Example S3 URL (1000 Genomes cohort VCF)
s3_base <- "s3://1000genomes-dragen-v3.7.6/data/cohorts/"
s3_path <- "gvcf-genotyper-dragen-3.7.6/hg19/3202-samples-cohort/"
s3_vcf_file <- "3202_samples_cohort_gg_chr22.vcf.gz"
s3_vcf_uri <- paste0(s3_base, s3_path, s3_vcf_file)
rduckhts_bcf(con, "s3_variants", s3_vcf_uri, region = "chr22:16050000-16050500", overwrite = TRUE)
dbGetQuery(con, "SELECT CHROM, COUNT(*) AS n FROM s3_variants GROUP BY CHROM")
#> CHROM n
#> 1 chr22 11FASTQ files
Three modes for fastq files, single, paired and interleaved
r1 <- system.file("extdata", "r1.fq", package = "Rduckhts")
r2 <- system.file("extdata", "r2.fq", package = "Rduckhts")
interleaved <- system.file("extdata", "interleaved.fq", package = "Rduckhts")
rduckhts_fastq(con, "paired_reads", r1, mate_path = r2, overwrite = TRUE)
rduckhts_fastq(con, "interleaved_reads", interleaved, interleaved = TRUE, overwrite = TRUE)
pairs <- dbGetQuery(con, "SELECT * FROM paired_reads WHERE MATE = 1 LIMIT 5")
pairs
#> NAME DESCRIPTION
#> 1 HS25_09827:2:1201:1505:59795#49 <NA>
#> 2 HS25_09827:2:1201:1559:70726#49 <NA>
#> 3 HS25_09827:2:1201:1564:39627#49 <NA>
#> 4 HS25_09827:2:1201:1565:91731#49 <NA>
#> 5 HS25_09827:2:1201:1624:69925#49 <NA>
#> SEQUENCE
#> 1 CCGTTAGAGCATTTGTTGAAAATGCTTTCCTTGCTCCATGTGATGACTCTGGTGCCCTTGTCAAAAGCCAGCTGGGCCTATTCGTGTGGGTCTGTTTCTG
#> 2 TTGTTAAAATGACCATACCCAAAGTGATCTACAGACTCAATACAATTTCTATTGAAATACCAATCACACTCTTCACAGAACTAGAAAAACAGTTCTAAAA
#> 3 ACGCGGCAATCCAATGTGTGAGTTGAGAAGCGGTGAGGAGGGAATCCTAATTTTATGAGCAGGTCAGGACCGTGGGAGATACCTGACACCTGAGATGGTA
#> 4 GACATGCCATAACATTCATGTTTTATGTGTACAAGTCAATGAATTTTAGTATATTTACAGAGTTGTATGACTGTCTCCACAATCTAATTTTAGGTTTCCA
#> 5 GCCAGCCTCCTTCTCAATGGTCTTTTTAAACATTATATGAAAACCAGACATTTACATTTGATTTCTTTTTCAATACTATACAGTTCTAAGAGAAAAAACA
#> QUALITY
#> 1 CABCFGDEEFFEFHGHGGFFGDIGIJFIFHHGHEIFGHBCGHDIFBE9GIAICGGICFIBFGGHGDGGGHE?GIGDFGGHEGIEJG>;FG<GGHACEFGH
#> 2 CABEFGFFGFHGGGGJGGFFGKIHHJFIEHHHGIEGGEHJGHDHFGHIGICIJEFIFGIF8GGHKFHGGFEI6GGGFIGHGGIE>EFCFHGGGHEJEAJE
#> 3 BACCFGBFGFHGGJGHGGFEGHIGIJHFEH:HHEHGHHBGGH9IAGHGFHIFJFFAFGIFDIGHKEIG<C>F,CGD66?7EFI5EEG>EGGGGD5=HH6E
#> 4 CABFFGFFJFHEGEGJGGDG?FIGHHHBGHHHGIIGHGHGGHDGHFHIDFCIKEGIFHGGII9HFFGGGEEIGGEEHGGEEGDEHFH>FGGGGHAFAHGE
#> 5 CABEFGFGIFGGGJGHGGFH?FDHGHDHGHEHHJCGHHFHDHDHFGHIGHIFFHGHFGGGI9GHF@IGGH;FICGEFEIHGGIEEFC:DEGGGBDJHHFF
#> MATE PAIR_ID
#> 1 1 HS25_09827:2:1201:1505:59795#49
#> 2 1 HS25_09827:2:1201:1559:70726#49
#> 3 1 HS25_09827:2:1201:1564:39627#49
#> 4 1 HS25_09827:2:1201:1565:91731#49
#> 5 1 HS25_09827:2:1201:1624:69925#49FASTQ quality decoding and per-position histograms
FASTQ quality handling has two separate knobs:
-
input_quality_encodingcontrols how incoming FASTQ ASCII is decoded. The default is modernphred33; usephred64,solexa64, orautoonly for legacy data. -
quality_representationcontrols how qualities are returned to DuckDB: canonical Phred+33 text ("string") or numeric Phred arrays ("phred").
The flow is:
- Decode FASTQ ASCII using
input_quality_encoding. - Normalize to numeric Phred qualities.
- Return either numeric arrays or canonical Phred+33 text.
BAM/CRAM reads skip the text-decoding step because qualities are already stored numerically.
legacy_fastq <- system.file("extdata", "legacy_phred64.fq", package = "Rduckhts")
rduckhts_detect_quality_encoding(con, legacy_fastq)
#> format observed_ascii_min observed_ascii_max records_sampled
#> 1 fastq 104 104 1
#> compatible_encodings guessed_encoding is_ambiguous
#> 1 phred33,phred64,solexa64 phred64 TRUE
quality_hist <- dbGetQuery(
con,
sprintf(
"WITH q AS (
SELECT NAME, QUALITY
FROM read_fastq('%s', quality_representation := 'phred')
),
expanded AS (
SELECT
NAME,
generate_subscripts(QUALITY, 1) AS pos,
unnest(QUALITY) AS q
FROM q
)
SELECT pos, q AS phred, count(*) AS n_reads
FROM expanded
GROUP BY pos, phred
ORDER BY pos, phred
LIMIT 12",
fastq_r1
)
)
quality_hist
#> pos phred n_reads
#> 1 1 33 1
#> 2 1 34 4
#> 3 2 32 5
#> 4 3 33 4
#> 5 3 34 1
#> 6 4 34 2
#> 7 4 36 2
#> 8 4 37 1
#> 9 5 37 5
#> 10 6 38 5
#> 11 7 33 1
#> 12 7 35 1GFF files
These can be open with or with attributes maps
gff_path <- system.file("extdata", "gff_file.gff.gz", package = "Rduckhts")
rduckhts_gff(con, "genes", gff_path, attributes_map = TRUE, overwrite = TRUE)
gene_annotations <- dbGetQuery(con, "SELECT seqname, start, \"end\" FROM genes WHERE feature = 'gene' LIMIT 5")
gene_annotations
#> seqname start end
#> 1 X 2934816 2964270BAM/CRAM
When built with htslib codec, CRAM can be opened in addition to BAM files. index_path can also be passed for region scans with non-standard index names.
cram_path <- system.file("extdata", "range.cram", package = "Rduckhts")
ref_path <- system.file("extdata", "ce.fa", package = "Rduckhts")
bam_path <- system.file("extdata", "range.bam", package = "Rduckhts")
bam_index_path <- system.file("extdata", "range.bam.bai", package = "Rduckhts")
rduckhts_bam(con, "cram_reads", cram_path, reference = ref_path, overwrite = TRUE)
cram_reads <- dbGetQuery(con, "SELECT QNAME, FLAG, POS, MAPQ FROM cram_reads LIMIT 5")
cram_reads
#> QNAME FLAG POS MAPQ
#> 1 HS18_09653:4:1315:19857:61712 145 914 23
#> 2 HS18_09653:4:1308:11522:27107 161 934 0
#> 3 HS18_09653:4:2314:14991:85680 83 1020 10
#> 4 HS18_09653:4:2108:14085:93656 147 1122 60
#> 5 HS18_09653:4:1303:4347:38100 83 1137 37
rduckhts_bam(
con, "bam_idx_reads", bam_path,
region = "CHROMOSOME_I:1-1000",
index_path = bam_index_path,
overwrite = TRUE
)
dbGetQuery(con, "SELECT count(*) AS n FROM bam_idx_reads")
#> n
#> 1 2SAMtags + auxiliary tags
Standard SAMtags can be exposed as typed columns, and any remaining tags are available via AUXILIARY_TAGS:
aux_path <- system.file("extdata", "aux_tags.sam.gz", package = "Rduckhts")
rduckhts_bam(con, "aux_reads", aux_path, standard_tags = TRUE, auxiliary_tags = TRUE, overwrite = TRUE)
dbGetQuery(con, "SELECT RG, NM, map_extract(AUXILIARY_TAGS, 'XZ') AS XZ FROM aux_reads LIMIT 1")
#> RG NM XZ
#> 1 x1 2 fooTabix headers + types
Use header = TRUE to use the first non-meta row as column names, and auto_detect = TRUE / column_types to control column typing:
tabix_header <- system.file("extdata", "header_tabix.tsv.gz", package = "Rduckhts")
tabix_meta <- system.file("extdata", "meta_tabix.tsv.gz", package = "Rduckhts")
rduckhts_tabix(con, "header_tabix", tabix_header, header = TRUE, overwrite = TRUE)
dbGetQuery(con, "SELECT chrom, pos FROM header_tabix LIMIT 2")
#> chrom pos
#> 1 chr1 1
#> 2 chr1 2
rduckhts_tabix(con, "typed_tabix", tabix_meta, auto_detect = TRUE, overwrite = TRUE)
dbGetQuery(con, "SELECT typeof(column1) AS column1_type FROM typed_tabix LIMIT 1")
#> column1_type
#> 1 BIGINT
rduckhts_tabix(con, "typed_tabix_explicit", tabix_header,
header = TRUE,
column_types = c("VARCHAR", "BIGINT", "VARCHAR"),
overwrite = TRUE)
dbGetQuery(con, "SELECT pos + 1 AS pos_plus_one FROM typed_tabix_explicit LIMIT 1")
#> pos_plus_one
#> 1 2HTS header and index metadata
Use metadata helpers to inspect parsed headers, raw header lines, index summaries, span-oriented index views, and raw index blobs.
header_meta <- rduckhts_hts_header(con, bcf_path)
head(header_meta[, c("record_type", "id", "number", "value_type")], 5)
#> record_type id number value_type
#> 1 fileformat <NA> <NA> <NA>
#> 2 FILTER PASS <NA> <NA>
#> 3 INFO TEST 1 Integer
#> 4 FORMAT TT A Integer
#> 5 INFO DP4 4 Integer
header_raw <- rduckhts_hts_header(con, bcf_path, mode = "raw")
head(header_raw[, c("idx", "raw")], 5)
#> idx
#> 1 0
#> 2 1
#> 3 2
#> 4 3
#> 5 4
#> raw
#> 1 ##fileformat=VCFv4.1
#> 2 ##FILTER=<ID=PASS,Description="All filters passed">
#> 3 ##INFO=<ID=TEST,Number=1,Type=Integer,Description="Testing Tag">
#> 4 ##FORMAT=<ID=TT,Number=A,Type=Integer,Description="Testing Tag, with commas and \\"escapes\\" and escaped escapes combined with \\\\\\"quotes\\\\\\\\\\"">
#> 5 ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
index_meta <- rduckhts_hts_index(con, bcf_path, index_path = bcf_index_path)
head(index_meta[, c("seqname", "mapped", "unmapped", "index_type")], 5)
#> seqname mapped unmapped index_type
#> 1 1 11 0 CSI
#> 2 2 1 0 CSI
#> 3 3 1 0 CSI
#> 4 4 2 0 CSI
index_spans <- rduckhts_hts_index_spans(con, bcf_path, index_path = bcf_index_path)
head(index_spans[, c("seqname", "tid", "index_type", "chunk_beg_vo", "chunk_end_vo")], 5)
#> seqname tid index_type chunk_beg_vo chunk_end_vo
#> 1 1 0 CSI NA NA
#> 2 2 1 CSI NA NA
#> 3 3 2 CSI NA NA
#> 4 4 3 CSI NA NA
index_raw <- rduckhts_hts_index_raw(con, bcf_path, index_path = bcf_index_path)
head(index_raw, 1)
#> [1] index_type
#> [2] '/usr/local/lib/R/site-library/Rduckhts/extdata/vcf_file.bcf.csi'
#> [3] raw
#> <0 rows> (or 0-length row.names)Remote GTEx tabix example
GTEx eQTL matrices on EBI are tabix-indexed
gtex_url <- "http://ftp.ebi.ac.uk/pub/databases/spot/eQTL/imported/GTEx_V8/ge/Brain_Cerebellar_Hemisphere.tsv.gz"
rduckhts_tabix(con, "gtex_eqtl", gtex_url, region = "1:11868-14409",
header = TRUE, auto_detect = TRUE, overwrite = TRUE)
dbGetQuery(con, "SELECT * FROM gtex_eqtl LIMIT 5")
#> variant r2 pvalue molecular_trait_object_id molecular_trait_id
#> 1 chr1_13550_G_A NA 0.0204520 ENSG00000188290 ENSG00000188290
#> 2 chr1_13550_G_A NA 0.0303633 ENSG00000230699 ENSG00000230699
#> 3 chr1_13550_G_A NA 0.1057900 ENSG00000177757 ENSG00000177757
#> 4 chr1_13550_G_A NA 0.1617190 ENSG00000241860 ENSG00000241860
#> 5 chr1_13550_G_A NA 0.1919580 ENSG00000198744 ENSG00000198744
#> maf gene_id median_tpm beta se an ac chromosome
#> 1 0.0114286 ENSG00000188290 6.3960000 0.633986 0.270285 350 4 1
#> 2 0.0114286 ENSG00000230699 0.0674459 -0.980082 0.447861 350 4 1
#> 3 0.0114286 ENSG00000177757 1.2659000 0.631359 0.387738 350 4 1
#> 4 0.0114286 ENSG00000241860 0.1081970 -0.791695 0.562674 350 4 1
#> 5 0.0114286 ENSG00000198744 21.6284000 -0.592354 0.451705 350 4 1
#> position ref alt type rsid
#> 1 13550 G A SNP rs554008981
#> 2 13550 G A SNP rs554008981
#> 3 13550 G A SNP rs554008981
#> 4 13550 G A SNP rs554008981
#> 5 13550 G A SNP rs554008981
dbDisconnect(con, shutdown = TRUE)References
- DuckDB: https://duckdb.org/
- DuckDB Extension API: https://duckdb.org/docs/extensions/overview
- DuckDB extension template (C): https://github.com/duckdb/extension-template-c
- htslib: https://github.com/samtools/htslib
- RBCFTools: https://github.com/RGenomicsETL/RBCFTools