stats
- Count Matrix Statistics¶
- Tabular output format
- JSON output format
- API Documentation
base
- Basic statisticscoldist
- Column-wise counts distributionsrowdist
- Row-wise counts distributionscolzero
- Column-wise statistics on zero countsrowzero
- Row-wise statistics on zero countsentropy
- Row-wise sample entropy calculationpca
- Principal component analysissummary
- Common statistics set
Easy access to informative count matrix statistics. Each of these functions produces three outputs:
- a tabular form of the statistics, formatted either as CSV or a human readable table using the terminaltables package
- a json formatted file containing relevant statistics in a machine-parsable format
- a human-friendly HTML page displaying the results
All of the commands accept a single counts file as input with optional
arguments as indicated in the documentation of each subtool. By default, the
JSON and HTML output files have the same basename without extension as the
counts file but including .json or .html as appropriate. E.g., counts.csv
will produce counts.json and counts.html in the current directory. These
default filenames can be changed using optional command line arguments
--json=<json fn>
and --html=<html fn>
as appropriate for all commands.
If <json fn>
, either default or specified, already exists, it is read in,
parsed, and added to. The HTML report is overwritten on every invocation using
the contents of the JSON file.
Tabular output format¶
Each tool prints out the statistics it calculates to standard output by default. The standard output format is comma separated values, e.g.:
$ detk-stats base test_counts.csv
stat,val
num_cols,3
num_rows,4
If desired, the -f table
argument may be passed to pretty-print the table
instead:
$ detk-stats base -f table test_counts.csv
+base------+-----+
| stat | val |
+----------+-----+
| num_cols | 4 |
| num_rows | 3 |
+----------+-----+
The summary module is slightly different, as it executes multiple subtools.
The CSV output of the summary module adds a line starting with #
before
each different output:
$ detk-stats summary --bins=2 test_counts.csv
#base
stat,val
num_cols,3
num_rows,4
#coldist
colname,bin_50.0,bin_100.0,dist_50.0,dist_100.0
a,55.0,100.0,2.0,2.0
b,5500.0,10000.0,2.0,2.0
c,550000.0,1000000.0,2.0,2.0
#rowdist
rowname,bin_50.0,bin_100.0,dist_50.0,dist_100.0
gene1,50005.0,100000.0,2.0,1.0
The pretty-printed output simply outputs each table serially.
JSON output format¶
The JSON file produced by these modules is formatted as a JSON array containing objects that each correspond to a stats module. For example:
[
{
'name': 'base',
'stats': {
'num_cols': 50,
'num_rows': 27143
}
},
{
'name': 'coldist',
'stats': {
'dists' : [
{
'name': 'H_0001',
'dist': [ [5, 129], [103, 317], ...],
'percentiles': [ [0, 193], [1, 362], ...],
},
{
'name': 'H_0002',
'dist': [ [6, 502], [122, 127], ...],
'bins': [ [0, 6000], [1, 6200], ...],
}
]
}
},
{
'name': 'rowdist',
'stats': ...
}
...
]
The example above has been pretty-printed for visibility; the actual output is written to a single line. The object format for each module is described in detail below.
API Documentation¶
base
- Basic statistics¶
-
class
de_toolkit.stats.
BaseStats
(count_mat)[source]¶ Basic statistics of the counts file
The most basic statistics of the counts file, including: - number of columns - number of rows
-
output
¶ Example output output:
+basestats-+-----+ | stat | val | +----------+-----+ | num_cols | 4 | | num_rows | 3 | +----------+-----+
-
Command line usage:
Usage:
detk-stats base [options] <counts_fn>
Options:
-o FILE --output=FILE Destination of primary output [default: stdout]
-f FMT --format=FMT Format of output, either csv or table [default: csv]
--json=<json_fn> Name of JSON output file
--html=<html_fn> Name of HTML output file
coldist
- Column-wise counts distributions¶
-
class
de_toolkit.stats.
ColDist
(count_mat, bins=100, log=False, density=False)[source]¶ Column-wise distribution of counts
Compute the distribution of counts column-wise. Each column is subject to binning by percentile, with output identical to that produced by np.histogram.
Parameters: - count_mat (CountMatrix) – count matrix containing counts
- bins (int) – number of bins to use when computing distribution
- log (bool) – take the log10 of counts+1 prior to computing distribution
- density (bool) – return densities rather than absolute bin counts for the distribution, densities sum to 1
-
output
¶ Tabular output is a table with four columns per input counts column –
- bin start value (column name: sampleA__binstart)
- number of features with counts or density in bin (sampleA__bincount)
- percentile increment (i.e. 0, 1, etc) (sampleA__pct)
- percentile value for corresponding percentile (sampleA__pctVal)
-
properties
¶ In the properties object, the fields are defined as follows
- dists
- Array of objects containing one object for each column, described below.
Each item of dists is an object with the following keys:
- name
- Column name from original file
- dist
- Array of (bin start, count) pairs defining the counts histogram
- percentile
- Array of (percentile, count) pairs defining the counts percentiles
Example JSON properties output:
{ 'dists' : [ { 'name': 'H_0001', 'dist': [ [5, 129], [103, 317], ...], 'percentiles': [ [0, 193], [1, 362], ...], }, { 'name': 'H_0002', 'dist': [ [6, 502], [122, 127], ...], 'bins': [ [0, 6000], [1, 6200], ...], } ] }
Command line usage:
Usage:
detk-stats coldist [options] <counts_fn>
Options:
--bins=N The number of bins to use when computing the counts
distribution [default: 20]
--log Perform a log10 transform on the counts before
calculating the distribution. Zeros are omitted
prior to histogram calculation.
--density Return a density distribution instead of counts,
such that the sum of values in *dist* for each
column approximately sum to 1.
-o FILE --output=FILE Destination of primary output [default: stdout]
-f FMT --format=FMT Format of output, either csv or table [default: csv]
--json=<json_fn> Name of JSON output file
--html=<html_fn> Name of HTML output file
rowdist
- Row-wise counts distributions¶
-
class
de_toolkit.stats.
RowDist
(count_obj, bins=100, log=False, density=False)[source]¶ Row-wise distribution of counts
Identical to coldist except calculated across rows. The name key is rowdist, and the name key of the items in dists is the row name from the counts file.
Parameters: - count_mat (CountMatrix) – count matrix containing counts
- bins (int) – number of bins to use when computing distribution
- log (bool) – take the log10 of counts prior to computing distribution
- density (bool) – return densities rather than absolute bin counts for the distribution, densities sum to 1
-
output
¶ Tabular output is a table where each row corresponds to a row with row name as the first column. The next columns are broken into two parts:
- the bin start values, named like bin_N, where N is the percentile
- the bin count values, named like dist_N, where N is the percentile
Command line usage:
Usage:
detk-stats rowdist [options] <counts_fn>
Options:
--bins=N The number of bins to use when computing the counts
distribution [default: 20]
--log Perform a log10 transform on the counts before calculating
the distribution. Zeros are omitted prior to histogram
calculation.
--density Return a density distribution instead of counts, such that
the sum of values in *dist* for each row approximately
sum to 1.
-o FILE --output=FILE Destination of primary output [default: stdout]
-f FMT --format=FMT Format of output, either csv or table [default: csv]
--json=<json_fn> Name of JSON output file
--html=<html_fn> Name of HTML output file
colzero
- Column-wise statistics on zero counts¶
-
class
de_toolkit.stats.
ColZero
(count_mat)[source]¶ Column-wise distribution of zero counts
Compute the number and fraction of exact zero counts for each column.
-
output
¶ Tabular output is a table where each row corresponds to a column with the following fields:
- name: Column name
- zero_count: Number of zero counts
- zero_frac: Fraction of zero counts
- mean: Overall mean count
- median: Overall median count
- nonzero_mean: Mean of non-zero counts only
- nonzero_median: Mean of non-zero counts only
-
properties
¶ The stats value is an array containing one object per column as follows –
- name
- column name
- zero_count
- absolute count of rows with exactly zero counts
- zero_frac
- zero_count divided by the number of rows
- col_mean
- the mean of counts in the column
- col_median
- the median of counts in the column
- nonzero_col_mean
- the mean of only the non-zero counts in the column
- nonzero_col_median
- the median of only the non-zero counts in the column
Example JSON output:
{ 'zeros' : [ { 'name': 'col1', 'zero_count': 20, 'zero_frac': 0.2, 'mean': 101.31, 'median': 31.31, 'nonzero_mean': 155.23, 'nonzero_median': 55.18 }, { 'name': 'col2', 'zero_count': 0, 'zero_frac': 0, 'mean': 3021.92, 'median': 329.23, 'nonzero_mean': 3021.92, 'nonzero_median': 819.32 }, ] }
-
Command line usage:
Usage:
detk-stats colzero [options] <counts_fn>
Options:
-o FILE --output=FILE Destination of primary output [default: stdout]
-f FMT --format=FMT Format of output, either csv or table [default: csv]
--json=<json_fn> Name of JSON output file
--html=<html_fn> Name of HTML output file
rowzero
- Row-wise statistics on zero counts¶
-
class
de_toolkit.stats.
RowZero
(count_mat)[source]¶ Row-wise distribution of zero counts
Computes statistics about the mean and median counts of rows by the number of zeros.
-
output
¶ Tabular output is a table where each row corresponds to rows having a given number of zero columns with the following fields:
- num_zero: the number of zeros for this row
- num_features: the number of features with this number of zeros
- feature_frac: the fraction of features with this number of zeros
- cum_feature_frac: cumulative fraction of features remeaning with this number of zeros or fewer
- mean: the mean count mean of genes with this number of zeros
- nonzero_mean: the mean count mean of genes with this number of zeros not including zero counts
- median: the median count median of genes with this number of zeros
- nonzero_median: the median count median of genes with this number of zeros, not including zero counts
-
properties
¶ The stats value is an array containing one object per number of zeros as follows:
- num_zero
- the number of zeros for this group of features
- num_features
- the number of features with this number of zeros
- feature_frac
- the fraction of features with this number of zeros
- cum_feature_frac
- cumulative fraction of features remeaning with this number of zeros or fewer
- mean
- the mean count mean of genes with this number of zeros
- nonzero_mean
- the mean count mean of genes with this number of zeros not including zero counts
- median
- the median count mean of genes with this number of zeros
- nonzero_median
- the median count mean of genes with this number of zeros, not including zero counts
Example JSON output:
{ 'zeros' : [ { 'num_zeros': 0, 'num_features': 14031, 'feature_frac': .61, 'cum_feature_frac': .61, 'mean': 3351.13, 'nonzero_mean': 3351.13, 'median': 2125.9, 'nonzero_median': 2125.9 }, { 'num_zeros': 1, 'num_features': 5031, 'feature_frac': .21, 'cum_feature_frac': .82, 'mean': 3125.91, 'nonzero_mean': 3295.4, 'median': 1825.8, 'nonzero_median': 1976.1 }, ] }
-
Command line usage:
Usage:
detk-stats rowzero [options] <counts_fn>
Options:
-o FILE --output=FILE Destination of primary output [default: stdout]
-f FMT --format=FMT Format of output, either csv or table [default: csv]
--json=<json_fn> Name of JSON output file
--html=<html_fn> Name of HTML output file
entropy
- Row-wise sample entropy calculation¶
-
class
de_toolkit.stats.
Entropy
(count_mat)[source]¶ Row-wise sample entropy calculation
Sample entropy is a metric that can be used to identify outlier samples by locating rows which are overly influenced by a small number of count values. This metric can be calculated for a single row as follows:
pi = ci/sumj(cj) sum(pi) = 1 H = -sumi(pi*log2(pi))
Here, ci is the number of counts in sample i, pi is the fraction of reads contributed by sample i to the overall counts of the row, and H is the Shannon entropy of the row when using log2. The maximum value possible for H is 2 when using Shannon entropy.
Rows with a very low H indicate a row has most of its count mass contained in a small number of columns. These are rows that are likely to drive outliers in downstream analysis, e.g. differential expression.
-
output
¶ Tabular output is a table where each row corresponds to a percentile with the following columns:
- pct
- percentile of entropy distribution
- pctVal
- the entropy value for each percentile
- num_features
- the number of features with entropy in the corresponding percentile
- frac_features
- the fraction of features with entropy in the corresponding percentile
- cum_frac_features
- the cumulative fraction of features with entropy in the corresponding percentile, i.e. the fraction of features with pctVal entropy or higher
- exemplar_feature
- the name of a feature with an entropy in the given percentile
-
properties
¶ The key entropies contains a single object with following keys –
- pct
- percentile of entropy distribution
- pctVal
- the entropy value for each percentile
- num_features
- the number of features with entropy in the corresponding percentile
- frac_features
- the fraction of features with entropy in the corresponding percentile
- cum_frac_features
- the cumulative fraction of features with entropy in the corresponding percentile, i.e. the fraction of features with pctVal entropy or higher
- exemplar_features
an array of objects with an exemplar feature for each percentile with the following fields:
- name
- the name of the feature
- entropy
- the sample entropy of the feature
- counts
- array of [column name, count] pairs sorted by count ascending
Example JSON output:
{ 'pct': [0, 1, 2, 3, ...], 'pctVal': [0, 0.1, 0.5, 0.9, ...], 'num_features': [10, 12, 23, 100, ...], 'frac_features': [0.001, 0.0012, 0.0023, 0.01, ...], 'cum_frac_features': [0.001, 0.0022, 0.0045, 0.0145, ...], 'exemplar_features': [ { 'name': 'ENSG0000055095.1', 'entropy': 0, 'counts': [ ['sampleA', 0], ['sampleB',0], ..., ['sampleN',1]] }, { 'name': 'ENSG0000398715.1', 'entropy': 0.11, 'counts': [ ['sampleA', 0], ['sampleB',0], ..., ['sampleM',5]] } ] }
-
Command line usage:
Usage:
detk-stats [options] entropy <counts_fn>
Options:
-o FILE --output=FILE Destination of primary output [default: stdout]
-f FMT --format=FMT Format of output, either csv or table [default: csv]
--json=<json_fn> Name of JSON output file
--html=<html_fn> Name of HTML output file
pca
- Principal component analysis¶
-
class
de_toolkit.stats.
CountPCA
(count_mat)[source]¶ Principal common analysis of the counts matrix.
This module performs PCA on a provided counts matrix and returns the principal component weights, scores, and variances. In addition, the weights and scores for each individual component can be combined to define the projection of each sample along that component.
The PCA module can also use a counts matrix that has associated column data information about the samples in each column. The user can specify some of these columns to include as variables for plotting purposes. The idea is that columns labeled with the same class will be colored according to their class, such that separations in the data can be more easily observed when projections are plotted.
-
output
¶ Tabular output is a table where each row corresponds to a column in the counts matrix with the following fields:
- name
- name of the column for the row
- PC*X*_*YY*
- projections of principal component X (e.g. 1) that explains YY percent of the variance for each column
-
properties
¶ Example JSON output:
[ 'name': 'pca', 'stats': { 'column_names': ['sample1','sample2',...], 'column_variables': { 'sample_names': ['sample1','sample2',...], 'columns': [ { 'column':'status', 'values':['disease','control',...] }, { 'column':'batch', 'values':['b1','b1',...] }, }, 'components': [ { 'name': 'PC1', 'scores': [0.126,0.975,...], # length n 'projections': [-8.01,5.93,...], # length m, ordered by 'column_names' 'perc_variance': 0.75 }, { 'name': 'PC2', 'scores' : [0.126,0.975,...], # length n 'projections': [5.93,-5.11,...], # length m 'perc_variance': 0.22 } ] } ]
-
Command line usage:
Usage:
detk-stats pca [options] <counts_fn>
Options:
-m FN --column-data=FN Column data for annotating PCA results and
plots (experimental)
-f NAME --column-name=NAME Column name from provided column data for
annotation PCA results and plots (experimental)
-o FILE --output=FILE Destination of primary output [default: stdout]
-f FMT --format=FMT Format of output, either csv or table [default: csv]
--json=<json_fn> Name of JSON output file
--html=<html_fn> Name of HTML output file
summary
- Common statistics set¶
-
de_toolkit.stats.
summary
(count_mat, bins=20, log=False, density=False)[source]¶ Compute summary statistics on a counts matrix file.
This is equivalent to running each of these tools separately:
- basestats
- coldist
- colzero
- rowzero
- entropy
- pca
Parameters: - count_mat (CountMatrix object) – count matrix object
- bins (int) – number of bins, passed to coldist
- log (bool) – perform log10 transform of counts in coldist
- density (bool) – return a density distribution from coldist
Returns: list of DetkModule subclasses for each of the called submodules
Return type: list
Command line usage:
Usage:
detk-stats summary [options] <counts_fn>
Options:
-h --help
--column-data=FN Use column data provided in FN, only used in PCA
--color-col=COLNAME Use column data column COLNAME for coloring output plots
--bins=BINS Number of bins to use for the calculated
distributions [default: 20]
--log log transform count statistics
--density Produce density distribution by dividing each distribution
by the appropriate sum
-o FILE --output=FILE Destination of primary output [default: stdout]
-f FMT --format=FMT Format of output, either csv or table [default: csv]
--json=<json_fn> Name of JSON output file
--html=<html_fn> Name of HTML output file