Using the ModSeqR package
Hailey Zimmerman and Jonathon T. Hill, PhD
2026-01-09
Source:vignettes/ModSeqRWalkthrough.Rmd
ModSeqRWalkthrough.RmdIntroduction
The ModSeqR package provides basic functions for
importing and working with third-generation methylation sequencing
data.
This vignette will walk you through a typical workflow. If sequencing
data is coming from Wasatch Biolabs (https://www.wasatchbiolabs.com/), .ch3
files will be delivered directly if requested for your batch. These
.ch3 files contain methylation calls from a
third-generation sequencing run. A .ch3 file will typically
be created for each sample within a run, and is meant as an intermediate
existing to compress large amounts of data into a usable form.
If not delivered through Wasatch Biolabs, .ch3 files can
be built using this package.
It’s most helpful to store all the sample CH3 files from a run you want to analyze within a single directory to get started.
Creating .ch3 files from a TSV (optional)
If your methylation calls are in a delimited text file (e.g., from a
modkit extract-calls TSV), you can convert them into
compressed .ch3 archives using
make_mod_archive(). This will dramatically reduce file size
and allow use of the ModSeqR package.
make_mod_archive() reads a tab-delimited calls file,
selects and transforms the required columns, and writes a compressed
Arrow dataset to out_path using zstd compression. The
function derives start/end coordinates per CpG
(accounting for strand and the CG k-mer), optionally shortens long
read_id’s to shrink file size
(short_ids = TRUE), and embeds your
sample_name into the output filenames. The output may be
sharded into multiple files (e.g., SampleA-0.ch3,
SampleA-1.ch3, …) depending on Arrow’s writer.
Coordinate convention:
The .ch3 files use 0-based, half-open coordinates for
start/end (like BED format). This means the
start column is inclusive and end is
exclusive. Keep this in mind if you compare .ch3 positions
to genome browsers or annotation files that use 1-based inclusive
coordinates (e.g., GTF/GFF). Conversion may be required when joining
with external annotations.
Expected input columns (typical from modkit):
read_id,forward_read_position,
ref_position, ref_mod_strand,
chrom, read_length, query_kmer,
call_prob, call_code, base_qual,
flag.
After the .ch3 files are written, point
make_mod_db() at the directory containing them (or pass
their paths directly), and the package will take care of the rest.
# Convert a modkit TSV to a compressed .ch3 archive
make_mod_archive(
file_name = "/path/to/modkit_calls.tsv", # tab-delimited input
sample_name = "Astrocytes", # used in output filenames
out_path = "/path/to/ch3_out", # directory to write .ch3 files
short_ids = TRUE # shorten read_id to reduce size
)
# The directory now contains e.g.: Astrocytes-0.ch3, Astrocytes-1.ch3, ...
# Build a DB from those .ch3 files:
make_mod_db(
ch3_files = "/path/to/ch3_out", # a directory OR a vector of file paths
db_name = "astro_db"
)Notes & tips
- If your input file contains unplaced minus-strand rows at position 0 (common in some extract outputs), those are filtered automatically.
- If you already have concise read IDs, set short_ids = FALSE.
- You can run make_mod_archive() for each sample TSV, all writing into the same out_path; make_mod_db() can then ingest the whole directory in one call.
- The .ch3 files are standard Parquet—fast to read, compact on disk, and splittable for parallel import.
Loading Data
The first step of a workflow will be to upload data from a directory holding sample sequencing data in the form of CH3 files. These files are typically created by Wasatch Biolabs.
Once data is obtained in this format, the first step is done using
the make_mod_db() function. This function directly imports
a directory of CH3 files, and creates a database in the form of a local
file.
make_mod_db
make_mod_db() builds the DuckDB database from one or
more .ch3 Parquet files.
It requires two arguments: ch3_files, which can be a
directory or a vector of file paths, and db_name, which is
the path where the DuckDB file will be created (the .ch3.db suffix is
added if missing).
You may optionally pass a named character vector to assign explicit
sample_names; otherwise, sample_name is
derived from the file’s basename by removing a trailing -chrom (vector of chromosomes to keep),
min_read_length (default 50), min_call_prob
(default 0.9), min_base_qual (default 10), and an optional
flag filter. The function configures DuckDB pragmas for
performance (temp directory, threads ≈ all-but-one core, memory ≈ 50%
RAM), drops any existing tables, and creates a calls table
with: sample_name, chrom, start,
end,read_position, call_code,
read_length, call_prob,
base_qual, and flag.
Here is an example:
# Example: directory of CH3 files; auto-derived sample names
ch3_files <- system.file("TEST_DATA", package = "ModSeqR") # adjust as needed
mod_db <- tempfile("example_db")
make_mod_db(ch3_files, mod_db)
# Example: explicit sample names override the fallback
make_mod_db(
ch3_files = c(
sample1 = "/data/sample1-0.ch3",
sample2 = "/data/sample2.ch3"
),
db_name = "named_samples"
)After this step, a database is made and the raw calls table is
included. A user may continue their analysis in any direction they want.
They may run a quality control (run_mod_qc()), summarize
data by positions (summarize_mod_positions()), regions
(summarize_mod_regions()), or windows
(summarize_mod_windows()). At any point in the analysis, a
user can export a table in the database by calling
export_mod_tables(), or use get_mod_table()to
import as a tibble into the local environment. All these functions will
be explained below.
summarize_mod_positions
This function summarizes methylation data (from the calls table) by each unique position in the genome. A table called “positions” will be built into the database.
summarize_mod_positions() aggregates the
calls table at each genomic position for each sample and
writes a positions table. It always produces
num_calls, and, for each unique modification you request, a
<label>_counts and <label>_frac
column (ex. methylation would be m_counts and m_frac). You control
modification types via mod_code (e.g., c(“m”,“h”,“m + h”)
or a novel “a”).
Labels are formed by removing spaces/“+” (so "m + h"
becomes mh). The unmodified class is defined by
unmod_code (default "-") and named by
unmod_label (default "c", yielding
c_counts and c_frac). Optional
chrs, samples, and min_num_calls
let you filter chromosomes, samples, and minimum coverage per
position.
# Minimal usage: default m/h classes
summarize_mod_positions(mod_db)
# Custom classes, including a novel code 'a', with stricter filtering
summarize_mod_positions(
mod_db,
mod_code = c("a", "m + h"),
unmod_code = "-",
unmod_label = "c",
min_num_calls = 5
)summarize_mod_regions
summarize_mod_regions() joins per-position counts to an
annotation file of regions (BED/TSV/CSV with chrom,
start, end, and optional
region_name) and writes a regions table. If
region_name is missing, it is synthesized as
chrom_start_end.
You choose the join type via
join = "inner" | "left" | "right"; inner keeps only
overlapping positions/regions, left keeps all regions (filling zeros
where appropriate), and right keeps all positions with region
matches.
As with positions, you specify any set of mod_code
values (single or combined like “m + h”), unmod_code,
unmod_label, and filters (chrs,
samples, min_num_calls). Output columns
include sample_name, region_name,
chrom, start, end,
num_CpGs (number of positions aggregated),
num_calls, and the
<label>_counts/<label>_frac columns for all
requested labels.
region_bed = system.file("Islands_hg38_test.csv", package = "ModSeqR")
# Minimal call
summarize_mod_regions(mod_db, region_file = region_bed)
# Custom call
summarize_mod_regions(
mod_db,
region_file = region_bed,
join = "inner",
mod_code = c("m","h","m + h"), # supports novel codes too, e.g., "a"
min_num_calls = 5
)summarize_mod_windows
summarize_mod_windows() produces sliding/offset windows
from per-position counts and writes a windows table.
By default it tiles with 1,000 bp windows and 10 bp offsets so
positions are assigned to multiple staggered windows. The output
contains num_CpGs, num_calls, and
<label>_counts/<label>_frac for all requested
modification labels (as defined by mod_code,
unmod_code, and unmod_label).
You can adjust window_size, step_size,
min_num_calls, limit to particular chrs or
samples, and choose to overwrite an existing
table. Default overwrite is TRUE.
# 100 bp windows, non-overlapping (step_size = window_size)
summarize_mod_windows(mod_db, window_size = 100, step_size = 100)
# 2 kb windows with offsets; include a novel 'a' class
summarize_mod_windows(
mod_db,
window_size = 2000,
step_size = 20,
mod_code = c("a", "m + h"),
min_num_calls = 25,
overwrite = TRUE
)summarize_mod_reads
This function summarizes methylation data from a mod_db database by the reads. A table called “reads” will be built into the database.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Calculate windows
summarize_mod_reads(mod_db)There is only 1 required arguments. mod_db is a list
containing the database file path. This should be a string representing
a path to the database, or a valid “ch3_db” class object. Optionally,
key_table is a path to an external table which can be used
to filter for unique windows of interest. It should have a “chrom”,
“start”, and “end” column to specify which windows are of interest. The
reads table will be filtered for just those windows from the
key_table.Lastly, min_CGs specifies the minimum number of
CG sites for a read to be included. If a read has less than that
argument, it will not be included in the table.
Convenience utilities (inspect/rename/tidy)
Sometimes you just want to peek at what’s in your database, check a table’s shape, or quickly fix sample names. The helpers below are lightweight wrappers around DuckDB/DBI that do exactly that.
get_mod_dbinfo()
Prints a quick summary of the database: on-disk size, a list of tables, and the distinct sample_names (if a calls table exists). Invisibly returns the closed ch3_db object.
# Path or ch3_db object both work
get_mod_dbinfo("my_data.ch3.db")get_mod_tableinfo()
Summarizes a single table: total rows, per-sample row counts (if the table has a sample_name column), and a list of columns. Good first check after creating a table.
get_mod_tableinfo("my_data.ch3.db", table_name = "positions")get_mod_cols()
Returns (and prints) the column names for a given table. Handy when you’re about to write a query and can’t remember exact column names.
get_mod_cols("my_data.ch3.db", "windows")get_mod_cpg_count()
Counts unique CpG sites in a table, where uniqueness is defined by distinct (start, end) pairs. Prints a small summary and invisibly returns the count.
n_cpg <- get_mod_cpg_count("my_data.ch3.db", table_name = "calls")rename_mod_samples()
Renames sample_names directly in a specified table (e.g., positions, windows, regions, calls). You provide a mapping from old → new names as a named character vector (c(old = “new”, …)) or a two-column data frame with old and new. The function validates inputs, updates in place, and (optionally) prints a before/after list of distinct names.
# Rename in the 'positions' table
rename_mod_samples(
"my_data.ch3.db",
table = "positions",
samples_map = c("Cortical_Neurons" = "Cortex", "Blood" = "PBMC"),
preview = TRUE
)remove_mod_table()
Deletes a table by name if it exists, with a clear message either way (no error if the table is missing). Useful to clean up experiments.
remove_mod_table("my_data.ch3.db", "tmp_debug_table")Tip: All helpers accept either a path to the .ch3.db file or an existing ch3_db object. They open a connection, do the work, and close it for you.
Differential Methylation
calc_mod_diff
This function calculates differential methylation between specified case and control groups using various statistical methods.
Differential methylation can be calculated from positional, regional, or window methylation data.
Calling this function creates a unique mod_diff_* table
within the database, where differential methylation results are
stored.
The table name depends on the input call_type:
call_type = "positions"→mod_diff_positionscall_type = "regions"→mod_diff_regionscall_type = "windows"→mod_diff_windows
Calling calc_mod_diff() again on the same type of data
will overwrite the existing mod_diff_{type} table
(e.g. mod_diff_windows will be rewritten if you run it
again on window data).
Automatic choice of statistical test (calc_type)
The argument calc_type controls which statistical method
is used to test for differential methylation. It can be set manually to
one of:
"wilcox"– Wilcoxon rank-sum test on per-sample methylation fractions"fast_fisher"– fast Fisher’s exact test on aggregated counts"r_fisher"– Fisher’s exact test via base R"log_reg"– logistic regression–based test
If calc_type is left as NULL (the default),
the function automatically chooses an appropriate method based on how
many samples are in each group:
-
If there is more than one sample in either the case or
control group
-
calc_typeis set to"wilcox" - The function compares the distribution of methylation fractions across samples between groups.
-
-
If there is exactly one sample in the case group and one in
the control group
-
calc_typeis set to"fast_fisher" - The function aggregates counts and performs a Fisher’s exact test on the resulting 2×2 contingency table.
-
A message is printed indicating which method was chosen.
Why this behavior?
- With replicates (more than one sample in a group), the Wilcoxon rank-sum test is preferred because it uses the variation between samples and compares the distributions of methylation fractions across samples. This better reflects biological variability.
- With only one sample per group, there is no between-sample variation to model. In that case, collapsing to counts and using Fisher’s exact test is more appropriate and statistically well-defined.
You can always override this behavior by explicitly setting
calc_type yourself.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Get methylation statistics for the 'positions' call type without plotting
calc_mod_diff(mod_db = mod_db,
call_type = "positions",
cases = "blood",
controls = "sperm")collapse_mod_windows
This function only works on a mod_diff_windows table already in the database. After a differential methylation analysis is called on windows, this will collapse significant windows by merging contiguous regions that meet the specified criteria.
The only required parameter is the link to the database. Optionally,
a user may specify the following arguments.max_distanceis
the maximum allowable distance between consecutive significant windows
for merging (default: 1000). sig_cutoff is the significance
threshold for adjusted p-values (default: 0.05). Lastly,
min_diff specifies the minimum absolute methylation
difference required for inclusion in the analysis (default: 0.5).
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Collapse significant windows
collapse_mod_windows("my_data.ch3.db")run_mod_analysis
This function automates an entire workflow in one command. It identifies differences in DNA methylation patterns between two groups (cases and controls) by summarizing methylation data at various genomic scales (positions, regions, or windows) and then performing statistical tests, outputting the results into organized CSV files within one directory.
run_mod_analysis() creates a new subdirectory named
“Mod_Diff_Analysis_Results” within the out_path where all the data will
be written to. It writes three CSV files into the
“Mod_Diff_Analysis_Results” directory by directly copying tables from
the DuckDB database:
mod_diff.csv: Contains the results of the differential analysis (e.g., p-values, fold changes) for each position, region, or window. The source table in the database is determined by diff_table_name.
All_CpGs.csv: Contains all the summarized methylation data for the chosen call_type (positions, regions, or windows) across all samples. The source table in the database is determined by call_type.
Sig_CpGs.csv: Contains summarized methylation data for only those positions/regions/windows that are deemed significantly differentially methylated, based on the p_val_max threshold.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Collapse significant windows
run_mod_analysis(mod_db,
out_path = "/Users/analysis",
call_type = "windows",
cases = c("sperm"),
controls = c("blood"))Quality Control
A variety of quality control functions are available to visually assess methylation data.
plot_mod_cov
This function calculates and optionally plots statistics for coverage data from methylation sequencing experiments. It can handle both positional and regional methylation data.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Get coverage statistics for the 'positions' call type
plot_mod_cov(mod_db = mod_db, call_type = "positions")plot_mod_modfrac
This function retrieves and calculates methylation statistics (mean methylation fractions) from a specified table in the mod database. It can either return summary statistics or plot a histogram of the methylation values, depending on the user’s preference.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Get methylation statistics for the 'positions' call type without plotting
plot_mod_modfrac(mod_db = mod_db, call_type = "positions")calc_mod_samplecor
This function calculates and optionally plots a correlation matrix for methylation or other modification fraction data from genomic positions. It can handle both position-based and region-based calls and supports visualization using ggplot2.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Run the correlation matrix function using the 'positions' call type and plot the results
calc_mod_samplecor(mod_db = mod_db, call_type = "positions", plot = TRUE)plot_mod_pca
This function performs Principal Component Analysis (PCA) on methylation data retrieved from the database.It aggregates the methylation fraction data based on the specified call type and prepares it for PCA analysis.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Calculate PCA
plot_mod_pca(mod_db)run_mod_qc
Combining all functions above, the `run_mod_qc()`` function computes all QC checks on methylation data at the same time.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Calculate windows
run_mod_qc(mod_db)Retrieving and Exporting Tables
You can access data in your database with two different functions.
get_mod_table
To retrieve a single table into your local environment for further
analysis within R, using get_mod_table will transform it
into a tibble and store it into a variable. Arguments required includes
the database, and name of the table to be retrieved.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Get any table in your database and create a variable in R to work with
positions = get_mod_table("my_data.ch3.db", "positions")
regions = get_mod_table("my_data.ch3.db", "regions")
mod_diff_regions = get_mod_table("my_data.ch3.db", "mod_diff_regions")
collapsed_windows = get_mod_table("my_data.ch3.db", "collapsed_windows")export_mod_table
To export one or more tables into a CSV file, use
export_mod_table. This function takes three arguments, the
database, which tables to export, and the path where the CSV files will
be saved.
# Specify the path to the database
mod_db <- system.file("my_data.ch3.db", package = "ModSeqR")
# Export the table to your computer
export_mod_table("my_data.ch3.db",
table = "windows",
out_path = "/Desktop/My_Folder/windows.csv")
export_mod_table("my_data.ch3.db",
table = "mod_diff_windows",
out_path = "/Desktop/My_Folder/mod_diff_windows.csv")Conclusion
In this vignette, we have walked you through the basic functions in
the ModSeqR package. This work is a work in progress and we
hope to improve its functionality. If you have any suggestions or
requested features, please email Jonathon Hill at jhill@byu.edu.