Clustering Tutorial

This tutorial walks through intron clustering from RNA-seq junction files using leafcutter-cluster.

Prerequisites

Aligned RNA-seq BAM files (or pre-extracted junction files)
A file listing junction file paths (one per line)

Intron clustering options:

Clustering with leafcutter2, which annotates unproductive splicing events.
Clustering with leafcutter-cluster available through this package outlined below.

Step 1: Extract junctions

If starting from BAM files, extract splice junctions using regtools:

regtools junctions extract -s 0 -a 8 -m 50 -M 500000 sample.bam -o sample.junc

Create a file listing all junction files:

ls *.junc > junc_files.txt

Step 2: Run clustering

leafcutter-cluster \
    --juncfiles junc_files.txt \
    --outprefix my_study \
    --minclureads 30 \
    --mincluratio 0.001 \
    --maxintronlen 100000 \
    --rundir output/

Output description: leafcutter-cluster (this workflow)

Counts file (`_perind_numers.counts.gz`)

The main counts file has one row per intron and one column per sample. The first column is the intron ID; subsequent columns are the counts corresponding to each intron junction count. This is the file passed to leafcutter-ds.

chrom                              sample1   sample2   ...
chr1:14829:14970:clu_1_-           348      351
chr1:14829:15021:clu_1_-           0        1
chr1:15947:16607:clu_2_-           13       3

Field	Description
`chr:start:end`	Intron coordinates (1-based, inclusive)
`clu_N`	Cluster ID
`strand`	`+`, `-`, or `NA`
`count`	Reads supporting this intron for that junction

Numerators file (`_perind.counts.gz`)

Same layout, but each sample contains the raw read count / the total read count the cluster for that sample.

File	Description
`my_study_perind_numers.counts.gz`	Per sample junction counts; input to differential splicing
`my_study_perind.counts.gz`	Per-sample intron usage ratios (numerator/total)

Options reference

Option	Default	Description
`-j / --juncfiles`	(required)	Path to a text file listing junction files, one per line
`-o / --outprefix`	`leafcutter`	Prefix for all output file names
`-r / --rundir`	`./`	Directory to write output files
`-l / --maxintronlen`	`100000`	Maximum intron length in bp; longer junctions are discarded
`-m / --minclureads`	`30`	Minimum total reads across all samples for a cluster to be retained
`-p / --mincluratio`	`0.001`	Minimum fraction of cluster reads required to keep an individual intron
`-c / --cluster`	—	Provide a pre-computed refined cluster file to skip the clustering step
`-k / --nochromcheck`	`False`	Disable the check that chromosomes are named `chr1…chrY` or `1…Y`
`-C / --includeconst`	`False`	Also output constitutive introns (used in only one cluster)
`-q / --quiet`	`False`	Suppress progress messages

Output description: leafcutter2

See leafcutter2 github for details.

Output:

leafcutter2.cluster_ratios.gz a table quantifying splice junction read counts for each intron, divided by the total number of reads observed for the intron cluster to which the intron belongs. Each row corresponds to a splice junction, with the first column indicating the splice junction ID and all subsequent columns corresponding to each sample. The splice junction ID has the following format: chr10:134786:179993:clu_1_+:PR, indicating the chromosome, start and end of the splice junction, the LeafCutter intron cluster to which it belongs (in this case, clu_1_+), and a label indicating the splice junction's function: PR (productive/protein-coding), UP (unproductive), NE (ambiguous in their functional effect) or IN (intergenic).
leafcutter2.junction_counts.gz a table similar to leafcutter2.cluster_ratios.gz, except it only keeps track of the splice junction read counts for each intron. This can be used downstream for leafcutter-ds.
clustering/ a directory containing files relevant for clustering and annotation.
- clustering/leafcutter2_clusters contains the intron clusters. Useful for skipping the clustering step in repeated runs.
- Other files documenting stats from the clustering and classification algorithm. Useful for debugging.

Next steps

Proceed to Differential Splicing to test for group differences.