stats
- Count Matrix Statistics¶
Easy access to informative count matrix statistics. Each of these functions produces two outputs:
- a json formatted file containing relevant statistics in a machine-parsable format
- a human-friendly HTML page displaying the results
All of the commands accept a single counts file as input with optional
arguments as indicated in the documentation. By default, the JSON and HTML
output files have the same basename without extension as the counts file but
including .json or .html as appropriate. E.g., counts.csv will produce
counts.json and counts.html in the current directory. These default
filenames can be changed using optional command line arguments --json=<json
fn>
and --html=<html fn>
as appropriate for all commands. If <json
fn>
, either default or specified, already exists, it is read in, parsed, and
added to. The HTML report is overwritten on every invocation using the
contents of the JSON file.
JSON output format¶
The JSON file produced by these modules is formatted as a JSON array containing objects that each correspond to a stats module. For example:
[
{
'name': 'base',
'stats': {
'num_cols': 50,
'num_rows': 27143
}
},
{
'name': 'coldist',
'stats': {
'pct' : [ 0, 5, 10, 20, ...],
'counts' : [
{
'name': 'H_0001',
'counts': [ 129, 317, 900, 1325, ...]
'frac': [ 0.01, 0.02, 0.04, 0.06, ...]
},
{
'name': 'H_0002',
'counts': [ 502, 127, 222, 591, ...]
'frac': [ 0.05, 0.01, 0.02, 0.05, ...]
}
]
}
},
{
'name': 'rowdist',
'stats': ...
}
...
]
The object format for each module is described in detail below.
summary
- Summary Statistics¶
Compute summary statistics on a counts matrix file:
detk-stats [--json=<json_fn>] [--html=<html_fn>] summary <counts file>
This command is equivalent to running each of the following stats commands:
concatenating the results.
base
- Basic statistics¶
Usage:
detk-stats base <counts file>
The most basic statistics of the counts file, including:
- number of samples
- number of rows
Example JSON output:
{
'name': 'base',
'stats': {
'num_cols': 50,
'num_rows': 27143
}
}
coldist
- Column-wise distribution of counts¶
Usage:
detk-stats [options] coldist [--bins=<bins>] [--log] [--density] <counts file>
Options:
--bins=<bins> The number of bins to use when computing the counts
distribution
--log Perform a log10 transform on the counts before calculating
the distribution. Zeros are omitted prior to histogram
calculation.
--density Return a density distribution instead of counts, such that
the sum of values in *dist* for each column approximately
sum to 1.
Compute the distribution of counts column-wise. Each column is subject to binning by percentile, with output identical to that produced by numpy.histogram.
Example JSON output:
{
'name': 'coldist',
'stats': {
'pct' : [ 5, 10, 20, ..., 95 ],
'dists' : [
{
'name': 'H_0001',
'dist': [ 129, 317, 900, 1325, ...],
'bins': [ 100, 200, 300, 400, ...],
'extrema': {
'lower': [1, 2, 5],
'upper': [19325, 5233]
}
]
},
{
'name': 'H_0002',
'dist': [ 502, 127, 222, 591, ...],
'bins': [ 6000, 6200, 6400, 6600, ...],
'extrema': {
'lower': [419, 2, 20],
'upper': [21999,74381]
}
]
}
]
}
}
In the stats object, the fields are defined as follows:
- pct
- The percentiles of the distributions in the range 0 < pct < 100, by default in increments of 5. This defines the length of the dist and bins arrays in each of the objects for each sample.
- dists
- Array of objects containing one object for each column, described below.
Each item of dists is an object with the following keys:
- name
- Column name from original file
- dist
- Array of raw or normalized counts in each bin according to the percentiles from pct
- bins
- Array of the bin boundary values for the distribution. Should be of length len(counts)+1. These are what would be the x-axis labels if this was plotted as a histogram.
- extrema
- Object with two keys, min and max, that contain the literal count values for counts that have a value larger or smaller than 1.5*(inner quartile length) of the distribution. These could be marked as outliers in a boxplot, for example.
rowdist
- Row-wise distribution of counts¶
Usage:
detk-stats [options] rowdist [--bins=<bins>] [--log] [--density] <counts file>
Identical to coldist
except calculated across rows. The name key is
rowdist, and the name key of the items in dists is the row name from the
counts file.
colzero
- Column-wise distribution of zero counts¶
Usage:
detk-stats [options] colzero <counts fn>
Compute the number and fraction of exact zero counts for each column. Example JSON output:
{
'name': 'colzero',
'stats': {
'zeros' : [
{
'name': 'col1',
'zero_count': 20,
'zero_frac': 0.2,
'mean': 101.31,
'nonzero_mean': 155.23
},
{
'name': 'col2',
'zero_count': 0,
'zero_frac': 0,
'mean': 3021.92,
'nonzero_mean': 3021.92
},
]
}
}
The stats value is an array containing one object per column as follows:
- name
- column name
- zero_count
- absolute count of rows with exactly zero counts
- zero_frac
- zero_count divided by the number of rows
- col_mean
- the mean of counts in the column
- nonzero_col_mean
- the mean of only the non-zero counts in the column
rowzero
- Row-wise distribution of zero counts¶
Usage:
detk-stats [options] rowzero <counts fn>
Identical to colzero
, only computed across rows instead of columns. The
name key is rowzero, and the name key of the items in dists is the
row name from the counts file.
entropy
- Row-wise sample entropy calculation¶
Usage:
detk-stats [options] entropy <counts fn>
Sample entropy is a metric that can be used to identify outlier samples by locating rows which are overly influenced by a single count value. This metric can be calculated for a single row as follows:
Here, \(c_i\) is the number of counts in sample \(i\), \(p_i\) is the fraction of reads contributed by sample \(i\) to the overall counts of the row, and \(H\) is the Shannon entropy of the row when using \(\log_2\). The maximum value possible for \(H\) is 2 when using Shannon entropy.
Rows with a very low \(H\) indicate a row has most of its count mass contained in a small number of columns. These are rows that are likely to drive outliers in downstream analysis, e.g. differential expression.
Example JSON output:
[
'name': 'entropy',
'stats': {
'entropies': [
{
'name': 'row1',
'entropy': 1.013
},
{
'name': 'row2',
'entropy': 0.001
}
]
}
]
The key entropies is an array containing one object per row with the following keys:
- name
- row name from counts file
- entropy
- the value of \(H\) calculated as above for that row
pca
- Principal Component Analysis¶
Usage:
detk-stats [options] pca [-m <column data fn> -f <column_variable> [-f <column_variable>]] <counts fn>
This module performs Principal Component Analysis (PCA) on a \(n \times m\) counts matrix, where \(n\) is the number of rows (genes) and \(m\) is the number of columns (samples). Briefly, PCA identifies the directions (e.g. genes and their magnitudes) that represent directions of maximal variance in a dataset. The output of PCA is a set of principal components, where each principal component consists of an \(m\)-length vector of weights or loadings and a \(n\)-length vector of scores. Each principal component describes a precentage of the overall variance of the dataset. There are exactly \(m\) principal components identified by a PCA, but typically only a small subset of these components explains a large amount of the variance in a real dataset.
This module performs PCA on a provided counts matrix and returns the principal component weights, scores, and variances. In addition, the weights and scores for each individual component can be combined to define the projection of each sample along that component. Commonly, projections of each sample against each principal component can be used to identify outlier samples, batch effects, sample group, etc by describing how each sample contributes to the variance in each component. Therefore, the projections for each sample for each component are also included in the output.
Experimental: The PCA module can also accept a metadata file that contains information about the samples in each column. The user can specify some of these columns to include as variables for plotting purposes. The idea is that columns labeled with the same class will be colored according to their class, such that separations in the data can be more easily observed when projections are plotted.
Example JSON output:
[
'name': 'pca',
'stats': {
'column_names': ['sample1','sample2',...],
'column_variables': {
'sample_type':['HD','HD','C',...],
'sample_batch':['Batch1','Batch2','Batch2',...]
},
'components': [
{
'name': 'PC1',
'scores': [0.126,0.975,...], # length n
'projections': [-8.01,5.93,...], # length m, ordered by 'column_names'
'perc_variance': 0.75
},
{
'name': 'PC2',
'scores' : [0.126,0.975,...], # length n
'projections': [5.93,-5.11,...], # length m
'perc_variance': 0.22
}
]
}
]
The html output for this module contains three plots: a scree plot, a set of line plots containing the sample projections, and an interactive scatter plot where the user can choose which principal component projections to plot on the X and Y axis. The two types of projection plots also have interactivity allowing the user to select which column variable to use for coloring the plotted projection points.