Skip to contents

This function classifies reads in the `reads` table of a DuckDB database as either `"case"`, `"control"`, or `"unknown"` based on similarity to reference methylation fractions in a `key_table` (e.g., a collapsed windows file). Classification is based on how close the read's `mh_frac` is to the average case or control `mh_frac`, within a user-defined `meth_diff_threshold`.

Usage

classify_mod_reads(
  mod_db,
  table_name = "classified_reads",
  reads_table = "reads",
  key_table,
  case,
  control,
  meth_diff_threshold = 0.1
)

Arguments

mod_db

Path to a DuckDB `.db` file created by this package (e.g., from `summarize_reads()`).

table_name

Character. Name of the output table to store classified reads (default: "classified_reads").

reads_table

Character. Name of the reads table in which you want to classify the reads (default: "reads").

key_table

Path to a CSV, TSV, or BED file generated by `collapse_windows()`. Must include the columns: `chrom`, `start`, `end`, `avg_mh_frac_control`, `avg_mh_frac_case`, and `avg_meth_diff`.

case

Character string used to label case reads (e.g., `"case"`).

control

Character string used to label control reads (e.g., `"control"`).

meth_diff_threshold

Numeric value specifying the maximum difference in `mh_frac` allowed to match either case or control averages. Must be less than half the minimum absolute value of `avg_meth_diff` to prevent ambiguous classifications.

Value

Invisibly returns the open database connection with a new table named `classified_reads` added to the database. This table includes:

  • sample_name – Sample identifier

  • read_id – Unique read identifier

  • first_cpg_pos – First CpG position of the read

  • last_cpg_pos – Last CpG position of the read

  • mh_frac – Methylation fraction of the read

  • classification – `"case"`, `"control"`, or `"unknown"`

Details

This function runs entirely in SQL for scalability. It performs an interval join between the `reads` table and the key table on `chrom` and CpG position range, then classifies each read based on proximity of `mh_frac` to either `avg_mh_frac_control` or `avg_mh_frac_case`.

Examples

if (FALSE) { # \dontrun{
classify_mod_reads(
  mod_db = "my_data.mod.db",
  key_table = "key_table.csv",
  case = "treated",
  control = "untreated",
  meth_diff_threshold = 0.1
)
} # }