Using the ModSeqR package
Hailey Zimmerman and Jonathon T. Hill, PhD
2026-05-14
Source:vignettes/ModSeqRWalkthrough.Rmd
ModSeqRWalkthrough.RmdIntroduction
The ModSeqR package provides basic functions for
importing and working with third-generation methylation sequencing
data.
This vignette will walk you through a typical workflow. If sequencing
data is coming from Wasatch Biolabs (https://www.wasatchbiolabs.com/), .ch3
files will be delivered directly if requested for your batch. These
.ch3 files contain methylation calls from a
third-generation sequencing run. A .ch3 file will typically
be created for each sample within a run, and is meant as an intermediate
existing to compress large amounts of data into a usable form.
If not delivered through Wasatch Biolabs, .ch3 files can
be built using this package.
It’s most helpful to store all the sample CH3 files from a run you want to analyze within a single directory to get started.
Creating .ch3 files from a TSV (optional)
If your methylation calls are in a delimited text file (e.g., from a
modkit extract-calls TSV), you can convert them into
compressed .ch3 archives using
make_mod_archive(). This will dramatically reduce file size
and allow use of the ModSeqR package.
make_mod_archive() reads a tab-delimited calls file,
selects and transforms the required columns, and writes a compressed
Arrow dataset to out_path using zstd compression. The
function derives start/end coordinates per CpG
(accounting for strand and the CG k-mer), optionally shortens long
read_id’s to shrink file size
(short_ids = TRUE), and embeds your
sample_name into the output filenames. The output may be
sharded into multiple files (e.g., SampleA-0.ch3,
SampleA-1.ch3, …) depending on Arrow’s writer.
Coordinate convention:
The .ch3 files use 0-based, half-open coordinates for
start/end (like BED format). This means the
start column is inclusive and end is
exclusive. Keep this in mind if you compare .ch3 positions
to genome browsers or annotation files that use 1-based inclusive
coordinates (e.g., GTF/GFF). Conversion may be required when joining
with external annotations.
Expected input columns (typical from modkit):
read_id,forward_read_position,
ref_position, ref_mod_strand,
chrom, read_length, query_kmer,
call_prob, call_code, base_qual,
flag.
After the .ch3 files are written, point
make_mod_db() at the directory containing them (or pass
their paths directly), and the package will take care of the rest.
# Convert a modkit TSV to a compressed .ch3 archive
make_mod_archive(
file_name = "/path/to/modkit_calls.tsv", # tab-delimited input
sample_name = "Astrocytes", # used in output filenames
out_path = "/path/to/ch3_out", # directory to write .ch3 files
short_ids = TRUE # shorten read_id to reduce size
)
# The directory now contains e.g.: Astrocytes-0.ch3, Astrocytes-1.ch3, ...
# Build a DB from those .ch3 files:
make_mod_db(
ch3_files = "/path/to/ch3_out", # a directory OR a vector of file paths
db_name = "astro_db"
)Notes & tips
- If your input file contains unplaced minus-strand rows at position 0 (common in some extract outputs), those are filtered automatically.
- If you already have concise read IDs, set short_ids = FALSE.
- You can run make_mod_archive() for each sample TSV, all writing into the same out_path; make_mod_db() can then ingest the whole directory in one call.
- The .ch3 files are standard Parquet—fast to read, compact on disk, and splittable for parallel import.
Loading Data
The first step of a workflow will be to upload data from a directory holding sample sequencing data in the form of CH3 files. These files are typically created by Wasatch Biolabs.
Once data is obtained in this format, the first step is done using
the make_mod_db() function. This function directly imports
a directory of CH3 files, and creates a database in the form of a local
file.
make_mod_db
make_mod_db() builds the DuckDB database from one or
more .ch3 Parquet files.
It requires two arguments: ch3_files, which can be a
directory or a vector of file paths, and db_name, which is
the path where the DuckDB file will be created (the .mod.db suffix is
added if missing).
You may optionally pass a named character vector to assign explicit
sample_names; otherwise, sample_name is
derived from the file’s basename by removing a trailing -chrom (vector of chromosomes to keep),
min_read_length (default 50), min_call_prob
(default 0.9), min_base_qual (default 10), and an optional
flag filter. The function configures DuckDB pragmas for
performance (temp directory, threads ≈ all-but-one core, memory ≈ 50%
RAM), drops any existing tables, and creates a calls table
with: sample_name, chrom, start,
end,read_position, call_code,
read_length, call_prob,
base_qual, and flag.
Here is an example:
# Example: directory of CH3 files; auto-derived sample names
ch3_files <- system.file("TEST_DATA", package = "ModSeqR") # adjust as needed
mod_db <- tempfile("example_db")
make_mod_db(ch3_files, mod_db)
# Example: explicit sample names override the fallback
make_mod_db(
ch3_files = c(
sample1 = "/data/sample1-0.ch3",
sample2 = "/data/sample2.ch3"
),
db_name = "named_samples"
)After this step, a database is made and the raw calls table is
included. A user may continue their analysis in any direction they want.
They may run a quality control (run_mod_qc()), summarize
data by positions (summarize_mod_positions()), regions
(summarize_mod_regions()), or windows
(summarize_mod_windows()). At any point in the analysis, a
user can export a table in the database by calling
export_mod_tables(), or use get_mod_table()to
import as a tibble into the local environment. All these functions will
be explained below.
summarize_mod_positions
This function summarizes methylation data (from the calls table) by each unique position in the genome. A table called “positions” will be built into the database.
summarize_mod_positions() aggregates the
calls table at each genomic position for each sample and
writes a positions table. It always produces
num_calls, and, for each unique modification you request, a
<label>_counts and <label>_frac
column (ex. methylation would be m_counts and m_frac). You control
modification types via mod_code (e.g., c(“m”,“h”,“m + h”)
or a novel “a”).
Labels are formed by removing spaces/“+” (so "m + h"
becomes mh). The unmodified class is defined by
unmod_code (default "-") and named by
unmod_label (default "c", yielding
c_counts and c_frac). Optional
chrs (defaults to all chromosomes in the data),
samples, and min_num_calls let you filter
chromosomes, samples, and minimum coverage per position.
# Minimal usage: default m/h classes
summarize_mod_positions(mod_db)
# Custom classes, including a novel code 'a', with stricter filtering
summarize_mod_positions(
mod_db,
mod_code = c("a", "m + h"),
unmod_code = "-",
unmod_label = "c",
min_num_calls = 5
)summarize_mod_regions
summarize_mod_regions() joins per-position counts to an
annotation file of regions (BED/TSV/CSV with chrom,
start, end, and optional
region_name) and writes a regions table. If
region_name is missing, it is synthesized as
chrom_start_end.
You choose the join type via
join = "inner" | "left" | "right"; inner keeps only
overlapping positions/regions, left keeps all regions (filling zeros
where appropriate), and right keeps all positions with region
matches.
As with positions, you specify any set of mod_code
values (single or combined like “m + h”), unmod_code,
unmod_label, and filters (chrs,
samples, min_num_calls). Output columns
include sample_name, region_name,
chrom, start, end,
num_sites (number of unique positions aggregated),
num_calls, and the
<label>_counts/<label>_frac columns for all
requested labels.
region_bed = system.file("Islands_hg38_test.csv", package = "ModSeqR")
# Minimal call
summarize_mod_regions(mod_db, region_file = region_bed)
# Custom call
summarize_mod_regions(
mod_db,
region_file = region_bed,
join = "inner",
mod_code = c("m","h","m + h"), # supports novel codes too, e.g., "a"
min_num_calls = 5
)summarize_mod_windows
summarize_mod_windows() produces sliding/offset windows
from per-position counts and writes a windows table.
By default it tiles with 1,000 bp windows and 10 bp offsets so
positions are assigned to multiple staggered windows. The output
contains num_sites, num_calls, and
<label>_counts/<label>_frac for all requested
modification labels (as defined by mod_code,
unmod_code, and unmod_label).
You can adjust window_size, step_size,
min_num_calls, limit to particular chrs
(defaults to all chromosomes in the data) or samples, and
choose to overwrite an existing table. Default overwrite is
TRUE.
# 100 bp windows, non-overlapping (step_size = window_size)
summarize_mod_windows(mod_db, window_size = 100, step_size = 100)
# 2 kb windows with offsets; include a novel 'a' class
summarize_mod_windows(
mod_db,
window_size = 2000,
step_size = 20,
mod_code = c("a", "m + h"),
min_num_calls = 25,
overwrite = TRUE
)summarize_mod_reads
This function summarizes methylation data from a mod_db database by the reads. A table called “reads” will be built into the database.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Calculate windows
summarize_mod_reads(mod_db)There is only 1 required arguments. mod_db is a list
containing the database file path. This should be a string representing
a path to the database, or a valid “mod_db” class object. Optionally,
key_table is a path to an external table which can be used
to filter for unique windows of interest. It should have a “chrom”,
“start”, and “end” column to specify which windows are of interest. The
reads table will be filtered for just those windows from the
key_table.Lastly, min_CGs specifies the minimum number of
CG sites for a read to be included. If a read has less than that
argument, it will not be included in the table.
Convenience utilities (inspect/rename/tidy)
Sometimes you just want to peek at what’s in your database, check a table’s shape, or quickly fix sample names. The helpers below are lightweight wrappers around DuckDB/DBI that do exactly that.
get_mod_dbinfo()
Prints a quick summary of the database: on-disk size, a list of tables, and the distinct sample_names (if a calls table exists). Invisibly returns the closed mod_db object.
# Path or mod_db object both work
get_mod_dbinfo("my_data.mod.db")get_mod_tableinfo()
Summarizes a single table: total rows, per-sample row counts (if the table has a sample_name column), and a list of columns. Good first check after creating a table.
get_mod_tableinfo("my_data.mod.db", table_name = "positions")get_mod_cols()
Returns (and prints) the column names for a given table. Handy when you’re about to write a query and can’t remember exact column names.
get_mod_cols("my_data.mod.db", "windows")get_mod_cpg_count()
Counts unique CpG sites in a table, where uniqueness is defined by distinct (start, end) pairs. Prints a small summary and invisibly returns the count.
n_cpg <- get_mod_cpg_count("my_data.mod.db", table_name = "calls")rename_mod_samples()
Renames sample_names directly in a specified table (e.g., positions, windows, regions, calls). You provide a mapping from old → new names as a named character vector (c(old = “new”, …)) or a two-column data frame with old and new. The function validates inputs, updates in place, and (optionally) prints a before/after list of distinct names.
# Rename in the 'positions' table
rename_mod_samples(
"my_data.mod.db",
table = "positions",
samples_map = c("Cortical_Neurons" = "Cortex", "Blood" = "PBMC"),
preview = TRUE
)remove_mod_table()
Deletes a table by name if it exists, with a clear message either way (no error if the table is missing). Useful to clean up experiments.
remove_mod_table("my_data.mod.db", "tmp_debug_table")Tip: All helpers accept either a path to the .mod.db file or an existing mod_db object. They open a connection, do the work, and close it for you.
Differential Methylation
calc_mod_diff
This function calculates differential methylation between specified case and control groups using various statistical methods.
Differential methylation can be calculated from positional, regional, or window methylation data.
Calling this function creates a unique mod_diff_* table
within the database, where differential methylation results are
stored.
The table name depends on the input call_type:
call_type = "positions"→mod_diff_positionscall_type = "regions"→mod_diff_regionscall_type = "windows"→mod_diff_windows
Calling calc_mod_diff() again on the same type of data
will overwrite the existing mod_diff_{type} table
(e.g. mod_diff_windows will be rewritten if you run it
again on window data).
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Get methylation statistics for the 'positions' call type without plotting
calc_mod_diff(mod_db = mod_db,
call_type = "positions",
cases = "blood",
controls = "sperm")Modification type (mod_type)
The mod_type argument specifies which modification to
test. The default is "mh" (methylation/hydroxymethylation).
Other codes include "a" for 6mA, "17596" for
inosine, and "17802" for pseudouridine. Bare numeric codes
are automatically prefixed with "m_", so
mod_type = "17596" and mod_type = "m_17596"
are equivalent.
Automatic choice of statistical test (calc_type)
The argument calc_type controls which statistical method
is used to test for differential methylation. It can be set manually to
one of:
"beta_bin"– beta-binomial likelihood-ratio test (accounts for overdispersion across replicates)"wilcox"– Wilcoxon rank-sum test on per-sample methylation fractions"fast_fisher"– fast Fisher’s exact test on aggregated counts"r_fisher"– Fisher’s exact test via base R"log_reg"– logistic regression–based test
If calc_type is left as NULL (the default),
the function automatically chooses an appropriate method based on how
many samples are in each group:
5 or more samples in both groups →
"wilcox"— enough samples for a non-parametric rank-based test.2–4 samples in either group →
"beta_bin"— models biological variability (overdispersion) across replicates while respecting per-sample coverage differences.1 sample in either group →
"fast_fisher"— no replicate information available; directly compares aggregated counts.
A message is printed indicating which method was chosen.
Why this behavior?
The choice of statistical test depends on how many biological replicates are available:
With many replicates (≥ 5 per group), the Wilcoxon rank-sum test works well. It compares the distribution of methylation fractions between groups without assuming a specific distribution shape. With enough samples, rank-based tests have good power and produce granular p-values.
With few replicates (2–4 per group), the beta-binomial test is preferred. The Wilcoxon test has limited power with so few values per group (e.g., with 3 vs 3 samples, the smallest possible Wilcoxon p-value is 0.05, which cannot survive multiple testing correction). The pooled Fisher test ignores between-sample variability entirely, treating all replicates as one large observation (pseudo-replication), which leads to inflated significance. The beta-binomial solves both problems: it models each sample’s count data individually, accounts for varying coverage across samples, and explicitly estimates biological variability (overdispersion) between replicates through a likelihood-ratio test.
With no replicates (1 vs 1), the fast Fisher’s exact test is the appropriate choice. There is no between-sample variation to model, and the test directly compares modification counts in a 2×2 contingency table.
Beta-binomial test details
The beta-binomial test (calc_type = "beta_bin") models
each sample’s modified-read count at a locus as a draw from a
beta-binomial distribution. This captures two sources of variation:
sampling noise (from finite read counts) and biological variability
(different samples having genuinely different modification rates at the
same site).
The test fits two models at each locus:
- Null model: all samples (case + control) share one mean modification rate μ
- Alternative model: cases have μcase and controls have μcontrol, with shared overdispersion
The difference in log-likelihoods is compared to a χ² distribution to
obtain a p-value. The output includes an overdispersion
column (φ, ranging 0–1) that indicates how much the true modification
rate varies across replicates at each locus. Values near 0 mean
replicates are very consistent; higher values indicate more biological
variability.
Key advantages of the beta-binomial:
- Samples with higher coverage naturally contribute more to the estimate (a sample with 500 calls is more informative than one with 10 calls)
- Biological variability between replicates is explicitly modeled, not ignored
- No additional R package dependencies are required (implemented in base R)
# Auto-select: with 3 vs 3 samples, beta_bin is chosen automatically
calc_mod_diff(mod_db,
call_type = "windows",
cases = c("case1", "case2", "case3"),
controls = c("ctrl1", "ctrl2", "ctrl3"))
# Explicitly request beta-binomial
calc_mod_diff(mod_db,
call_type = "windows",
cases = c("case1", "case2"),
controls = c("ctrl1", "ctrl2"),
calc_type = "beta_bin")
# 1 vs 1: fast_fisher is chosen automatically
calc_mod_diff(mod_db,
call_type = "windows",
cases = "tumor",
controls = "normal")You can always override the automatic selection by explicitly setting
calc_type yourself.
Coverage filtering with min_coverage
When working with window-level data, some windows may have poor
breadth of coverage — reads may only cover a small portion of the window
in certain samples. The min_coverage parameter filters
these out before testing.
min_coverage is a proportion between 0 and 1
representing the minimum fraction of positions within a window that must
have modification calls for each sample. It is computed per sample as
num_sites / (end - start + 1). Windows where any sample
falls below this threshold are dropped before statistical testing.
This is especially useful for larger windows (e.g., 10 kb), where uneven read coverage can lead to unreliable modification fraction estimates. Filtering improves result quality and reduces the multiple testing burden, since fewer windows are tested.
# Require each sample to cover at least 50% of positions in each window
calc_mod_diff(mod_db,
call_type = "windows",
cases = c("sample1", "sample2", "sample3"),
controls = c("ctrl1", "ctrl2", "ctrl3"),
min_coverage = 0.5)
# More stringent: 80% breadth of coverage required
calc_mod_diff(mod_db,
call_type = "windows",
cases = c("sample1", "sample2", "sample3"),
controls = c("ctrl1", "ctrl2", "ctrl3"),
min_coverage = 0.8)Guidance on values:
-
min_coverage = 0.5— moderate filtering; good default for most analyses -
min_coverage = 0.8— stringent; best for high-coverage datasets -
min_coverage = 0.1— lenient; useful for sparse modifications like m6A
Note: min_coverage only applies when the input table has
num_sites, start, and end columns
(i.e., window data). It is silently skipped for position-level analyses.
If the filter removes all windows, the function will stop with a message
suggesting you lower the threshold.
collapse_mod_windows
This function only works on a mod_diff_windows table already in the database. After a differential methylation analysis is called on windows, this will collapse significant windows by merging contiguous regions that meet the specified criteria.
The only required parameter is the link to the database. Optionally,
a user may specify the following arguments.max_distanceis
the maximum allowable distance between consecutive significant windows
for merging (default: 1000). sig_cutoff is the significance
threshold for adjusted p-values (default: 0.05). Lastly,
min_diff specifies the minimum absolute methylation
difference required for inclusion in the analysis (default: 0.5).
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Collapse significant windows
collapse_mod_windows("my_data.mod.db")run_mod_analysis
This function automates an entire workflow in one command. It identifies differences in DNA methylation patterns between two groups (cases and controls) by summarizing methylation data at various genomic scales (positions, regions, or windows) and then performing statistical tests, outputting the results into organized CSV files within one directory.
run_mod_analysis() creates a new subdirectory named
“Mod_Diff_Analysis_Results” within the out_path where all the data will
be written to. It writes three CSV files into the
“Mod_Diff_Analysis_Results” directory by directly copying tables from
the DuckDB database:
mod_diff.csv: Contains the results of the differential analysis (e.g., p-values, fold changes) for each position, region, or window. The source table in the database is determined by diff_table_name.
All_CpGs.csv: Contains all the summarized methylation data for the chosen call_type (positions, regions, or windows) across all samples. The source table in the database is determined by call_type.
Sig_CpGs.csv: Contains summarized methylation data for only those positions/regions/windows that are deemed significantly differentially methylated, based on the p_val_max threshold.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Collapse significant windows
run_mod_analysis(mod_db,
out_path = "/Users/analysis",
call_type = "windows",
cases = c("sperm"),
controls = c("blood"))Quality Control
A variety of quality control functions are available to visually assess methylation data.
plot_mod_cov
This function calculates and optionally plots statistics for coverage data from methylation sequencing experiments. It can handle both positional and regional methylation data.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Get coverage statistics for the 'positions' call type
plot_mod_cov(mod_db = mod_db, call_type = "positions")plot_mod_modfrac
This function retrieves and calculates methylation statistics (mean methylation fractions) from a specified table in the mod database. It can either return summary statistics or plot a histogram of the methylation values, depending on the user’s preference.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Get methylation statistics for the 'positions' call type without plotting
plot_mod_modfrac(mod_db = mod_db, call_type = "positions")calc_mod_samplecor
This function calculates and optionally plots a correlation matrix for methylation or other modification fraction data from genomic positions. It can handle both position-based and region-based calls and supports visualization using ggplot2.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Run the correlation matrix function using the 'positions' call type and plot the results
calc_mod_samplecor(mod_db = mod_db, call_type = "positions", plot = TRUE)plot_mod_pca
This function performs Principal Component Analysis (PCA) on methylation data retrieved from the database.It aggregates the methylation fraction data based on the specified call type and prepares it for PCA analysis.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Calculate PCA
plot_mod_pca(mod_db)run_mod_qc
Combining all functions above, the `run_mod_qc()`` function computes all QC checks on methylation data at the same time.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Calculate windows
run_mod_qc(mod_db)Retrieving and Exporting Tables
You can access data in your database with two different functions.
get_mod_table
To retrieve a single table into your local environment for further
analysis within R, using get_mod_table will transform it
into a tibble and store it into a variable. Arguments required includes
the database, and name of the table to be retrieved.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Get any table in your database and create a variable in R to work with
positions = get_mod_table("my_data.mod.db", "positions")
regions = get_mod_table("my_data.mod.db", "regions")
mod_diff_regions = get_mod_table("my_data.mod.db", "mod_diff_regions")
collapsed_windows = get_mod_table("my_data.mod.db", "collapsed_windows")export_mod_table
To export one or more tables into a CSV file, use
export_mod_table. This function takes three arguments, the
database, which tables to export, and the path where the CSV files will
be saved.
# Specify the path to the database
mod_db <- system.file("my_data.mod.db", package = "ModSeqR")
# Export the table to your computer
export_mod_table("my_data.mod.db",
table = "windows",
out_path = "/Desktop/My_Folder/windows.csv")
export_mod_table("my_data.mod.db",
table = "mod_diff_windows",
out_path = "/Desktop/My_Folder/mod_diff_windows.csv")Conclusion
In this vignette, we have walked you through the basic functions in
the ModSeqR package. This work is a work in progress and we
hope to improve its functionality. If you have any suggestions or
requested features, please email Jonathon Hill at jhill@byu.edu.