`stats` - Count Matrix Statistics¶

Tabular output format
JSON output format
API Documentation

Easy access to informative count matrix statistics. Each of these functions produces three outputs:

a tabular form of the statistics, formatted either as CSV or a human readable table using the terminaltables package
a json formatted file containing relevant statistics in a machine-parsable format
a human-friendly HTML page displaying the results

All of the commands accept a single counts file as input with optional arguments as indicated in the documentation of each subtool. By default, the JSON and HTML output files have the same basename without extension as the counts file but including .json or .html as appropriate. E.g., counts.csv will produce counts.json and counts.html in the current directory. These default filenames can be changed using optional command line arguments --json=<json fn> and --html=<html fn> as appropriate for all commands. If <json fn>, either default or specified, already exists, it is read in, parsed, and added to. The HTML report is overwritten on every invocation using the contents of the JSON file.

Tabular output format ¶

Each tool prints out the statistics it calculates to standard output by default. The standard output format is comma separated values, e.g.:

$ detk-stats base test_counts.csv
stat,val
num_cols,3
num_rows,4

If desired, the -f table argument may be passed to pretty-print the table instead:

$ detk-stats base -f table test_counts.csv
+base------+-----+
| stat     | val |
+----------+-----+
| num_cols | 4   |
| num_rows | 3   |
+----------+-----+

The summary module is slightly different, as it executes multiple subtools. The CSV output of the summary module adds a line starting with # before each different output:

$ detk-stats summary --bins=2 test_counts.csv
#base
stat,val
num_cols,3
num_rows,4
#coldist
colname,bin_50.0,bin_100.0,dist_50.0,dist_100.0
a,55.0,100.0,2.0,2.0
b,5500.0,10000.0,2.0,2.0
c,550000.0,1000000.0,2.0,2.0
#rowdist
rowname,bin_50.0,bin_100.0,dist_50.0,dist_100.0
gene1,50005.0,100000.0,2.0,1.0

The pretty-printed output simply outputs each table serially.

JSON output format ¶

The JSON file produced by these modules is formatted as a JSON array containing objects that each correspond to a stats module. For example:

[
  {
    'name': 'base',
    'stats': {
      'num_cols': 50,
      'num_rows': 27143
    }
  },
  {
    'name': 'coldist',
    'stats': {
        'dists' : [
          {
            'name': 'H_0001',
            'dist': [ [5, 129], [103, 317], ...],
            'percentiles': [ [0, 193], [1, 362], ...],
          },
          {
            'name': 'H_0002',
            'dist': [ [6, 502], [122, 127], ...],
            'bins': [ [0, 6000], [1, 6200], ...],
          }
        ]
    }
  },
  {
    'name': 'rowdist',
    'stats': ...
  }
  ...
]

The example above has been pretty-printed for visibility; the actual output is written to a single line. The object format for each module is described in detail below.

API Documentation ¶

`base` - Basic statistics ¶

class de_toolkit.stats.BaseStats(count_mat)[source]¶

Basic statistics of the counts file

The most basic statistics of the counts file, including: - number of columns - number of rows

output¶

Example output output:

+basestats-+-----+
| stat     | val |
+----------+-----+
| num_cols | 4   |
| num_rows | 3   |
+----------+-----+

Command line usage:

Usage:
    detk-stats base [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

`coldist` - Column-wise counts distributions ¶

class de_toolkit.stats.ColDist(count_mat, bins=100, log=False, density=False)[source]¶

Column-wise distribution of counts

Compute the distribution of counts column-wise. Each column is subject to binning by percentile, with output identical to that produced by np.histogram.

Parameters:	count_mat (CountMatrix) – count matrix containing counts bins (int) – number of bins to use when computing distribution log (bool) – take the log10 of counts+1 prior to computing distribution density (bool) – return densities rather than absolute bin counts for the distribution, densities sum to 1

output¶

Tabular output is a table with four columns per input counts column –

bin start value (column name: sampleA__binstart)
number of features with counts or density in bin (sampleA__bincount)
percentile increment (i.e. 0, 1, etc) (sampleA__pct)
percentile value for corresponding percentile (sampleA__pctVal)

properties¶

In the properties object, the fields are defined as follows

dists: Array of objects containing one object for each column, described below.

Each item of dists is an object with the following keys:

name: Column name from original file
dist: Array of (bin start, count) pairs defining the counts histogram
percentile: Array of (percentile, count) pairs defining the counts percentiles

Example JSON properties output:

{
  'dists' : [
    {
      'name': 'H_0001',
      'dist': [ [5, 129], [103, 317], ...],
      'percentiles': [ [0, 193], [1, 362], ...],
    },
    {
      'name': 'H_0002',
      'dist': [ [6, 502], [122, 127], ...],
      'bins': [ [0, 6000], [1, 6200], ...],
    }
  ]
}

Command line usage:

Usage:
    detk-stats coldist [options] <counts_fn>

Options:
    --bins=N               The number of bins to use when computing the counts
                           distribution [default: 20]
    --log                  Perform a log10 transform on the counts before
                           calculating the distribution. Zeros are omitted
                           prior to histogram calculation.
    --density              Return a density distribution instead of counts,
                           such that the sum of values in *dist* for each
                           column approximately sum to 1.
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

`rowdist` - Row-wise counts distributions ¶

class de_toolkit.stats.RowDist(count_obj, bins=100, log=False, density=False)[source]¶

Row-wise distribution of counts

Identical to coldist except calculated across rows. The name key is rowdist, and the name key of the items in dists is the row name from the counts file.

Parameters:	count_mat (CountMatrix) – count matrix containing counts bins (int) – number of bins to use when computing distribution log (bool) – take the log10 of counts prior to computing distribution density (bool) – return densities rather than absolute bin counts for the distribution, densities sum to 1

output¶

Tabular output is a table where each row corresponds to a row with row name as the first column. The next columns are broken into two parts:

the bin start values, named like bin_N, where N is the percentile

the bin count values, named like dist_N, where N is the percentile

Command line usage:

Usage:
    detk-stats rowdist [options] <counts_fn>

Options:
    --bins=N               The number of bins to use when computing the counts
                           distribution [default: 20]
    --log                  Perform a log10 transform on the counts before calculating
                           the distribution. Zeros are omitted prior to histogram
                           calculation.
    --density              Return a density distribution instead of counts, such that
                           the sum of values in *dist* for each row approximately
                           sum to 1.
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

`colzero` - Column-wise statistics on zero counts ¶

class de_toolkit.stats.ColZero(count_mat)[source]¶

Column-wise distribution of zero counts

Compute the number and fraction of exact zero counts for each column.

output¶

Tabular output is a table where each row corresponds to a column with the following fields:

name: Column name
zero_count: Number of zero counts
zero_frac: Fraction of zero counts
mean: Overall mean count
median: Overall median count
nonzero_mean: Mean of non-zero counts only
nonzero_median: Mean of non-zero counts only

properties¶

The stats value is an array containing one object per column as follows –

name: column name
zero_count: absolute count of rows with exactly zero counts
zero_frac: zero_count divided by the number of rows
col_mean: the mean of counts in the column
col_median: the median of counts in the column
nonzero_col_mean: the mean of only the non-zero counts in the column
nonzero_col_median: the median of only the non-zero counts in the column

Example JSON output:

{
  'zeros' : [
    {
      'name': 'col1',
      'zero_count': 20,
      'zero_frac': 0.2,
      'mean': 101.31,
      'median': 31.31,
      'nonzero_mean': 155.23,
      'nonzero_median': 55.18
    },
    {
      'name': 'col2',
      'zero_count': 0,
      'zero_frac': 0,
      'mean': 3021.92,
      'median': 329.23,
      'nonzero_mean': 3021.92,
      'nonzero_median': 819.32
    },
  ]
}

Command line usage:

Usage:
    detk-stats colzero [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

`rowzero` - Row-wise statistics on zero counts ¶

class de_toolkit.stats.RowZero(count_mat)[source]¶

Row-wise distribution of zero counts

Computes statistics about the mean and median counts of rows by the number of zeros.

output¶

Tabular output is a table where each row corresponds to rows having a given number of zero columns with the following fields:

num_zero: the number of zeros for this row

num_features: the number of features with this number of zeros

feature_frac: the fraction of features with this number of zeros

cum_feature_frac: cumulative fraction of features remeaning with this number of zeros or fewer

mean: the mean count mean of genes with this number of zeros

nonzero_mean: the mean count mean of genes with this number of zeros not including zero counts

median: the median count median of genes with this number of zeros

nonzero_median: the median count median of genes with this number of zeros, not including zero counts

properties¶

The stats value is an array containing one object per number of zeros as follows:

num_zero: the number of zeros for this group of features
num_features: the number of features with this number of zeros
feature_frac: the fraction of features with this number of zeros
cum_feature_frac: cumulative fraction of features remeaning with this number of zeros or fewer
mean: the mean count mean of genes with this number of zeros
nonzero_mean: the mean count mean of genes with this number of zeros not including zero counts
median: the median count mean of genes with this number of zeros
nonzero_median: the median count mean of genes with this number of zeros, not including zero counts

Example JSON output:

{
  'zeros' : [
    {
        'num_zeros': 0,
        'num_features': 14031,
        'feature_frac': .61,
        'cum_feature_frac': .61,
        'mean': 3351.13,
        'nonzero_mean': 3351.13,
        'median': 2125.9,
        'nonzero_median': 2125.9
    },
    {
        'num_zeros': 1,
        'num_features': 5031,
        'feature_frac': .21,
        'cum_feature_frac': .82,
        'mean': 3125.91,
        'nonzero_mean': 3295.4,
        'median': 1825.8,
        'nonzero_median': 1976.1
    },
  ]
}

Command line usage:

Usage:
    detk-stats rowzero [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

`entropy` - Row-wise sample entropy calculation ¶

class de_toolkit.stats.Entropy(count_mat)[source]¶

Row-wise sample entropy calculation

Sample entropy is a metric that can be used to identify outlier samples by locating rows which are overly influenced by a small number of count values. This metric can be calculated for a single row as follows:

pi = ci/sumj(cj)
sum(pi) = 1
H = -sumi(pi*log2(pi))

Here, ci is the number of counts in sample i, pi is the fraction of reads contributed by sample i to the overall counts of the row, and H is the Shannon entropy of the row when using log2. The maximum value possible for H is 2 when using Shannon entropy.

Rows with a very low H indicate a row has most of its count mass contained in a small number of columns. These are rows that are likely to drive outliers in downstream analysis, e.g. differential expression.

output¶

Tabular output is a table where each row corresponds to a percentile with the following columns:

pct: percentile of entropy distribution
pctVal: the entropy value for each percentile
num_features: the number of features with entropy in the corresponding percentile
frac_features: the fraction of features with entropy in the corresponding percentile
cum_frac_features: the cumulative fraction of features with entropy in the corresponding percentile, i.e. the fraction of features with pctVal entropy or higher
exemplar_feature: the name of a feature with an entropy in the given percentile

properties¶

The key entropies contains a single object with following keys –

pct

percentile of entropy distribution

pctVal

the entropy value for each percentile

num_features

the number of features with entropy in the corresponding percentile

frac_features

the fraction of features with entropy in the corresponding percentile

cum_frac_features

the cumulative fraction of features with entropy in the corresponding percentile, i.e. the fraction of features with pctVal entropy or higher

exemplar_features

an array of objects with an exemplar feature for each percentile with the following fields:

name: the name of the feature
entropy: the sample entropy of the feature
counts: array of [column name, count] pairs sorted by count ascending

Example JSON output:

{
    'pct': [0, 1, 2, 3, ...],
    'pctVal': [0, 0.1, 0.5, 0.9, ...],
    'num_features': [10, 12, 23, 100, ...],
    'frac_features': [0.001, 0.0012, 0.0023, 0.01, ...],
    'cum_frac_features': [0.001, 0.0022, 0.0045, 0.0145, ...],
    'exemplar_features': [
        {
            'name': 'ENSG0000055095.1',
            'entropy': 0,
            'counts': [ ['sampleA', 0], ['sampleB',0], ..., ['sampleN',1]]
        },
        {
            'name': 'ENSG0000398715.1',
            'entropy': 0.11,
            'counts': [ ['sampleA', 0], ['sampleB',0], ..., ['sampleM',5]]
        }
    ]
}

Command line usage:

Usage:
    detk-stats [options] entropy <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

`pca` - Principal component analysis ¶

class de_toolkit.stats.CountPCA(count_mat)[source]¶

Principal common analysis of the counts matrix.

This module performs PCA on a provided counts matrix and returns the principal component weights, scores, and variances. In addition, the weights and scores for each individual component can be combined to define the projection of each sample along that component.

The PCA module can also use a counts matrix that has associated column data information about the samples in each column. The user can specify some of these columns to include as variables for plotting purposes. The idea is that columns labeled with the same class will be colored according to their class, such that separations in the data can be more easily observed when projections are plotted.

output¶

Tabular output is a table where each row corresponds to a column in the counts matrix with the following fields:

name: name of the column for the row
PC*X*_*YY*: projections of principal component X (e.g. 1) that explains YY percent of the variance for each column

properties¶

Example JSON output:

[
    'name': 'pca',
    'stats': {
        'column_names': ['sample1','sample2',...],
        'column_variables': {
            'sample_names': ['sample1','sample2',...],
            'columns': [
                {
                    'column':'status',
                    'values':['disease','control',...]
                },
                {
                    'column':'batch',
                    'values':['b1','b1',...]
                },
        },
        'components': [
            {
                'name': 'PC1',
                'scores': [0.126,0.975,...], # length n
                'projections': [-8.01,5.93,...], # length m, ordered by 'column_names'
                'perc_variance': 0.75
            },
            {
                'name': 'PC2',
                'scores' : [0.126,0.975,...], # length n
                'projections': [5.93,-5.11,...], # length m
                'perc_variance': 0.22
            }
        ]
    }
]

Command line usage:

Usage:
    detk-stats pca [options] <counts_fn>

Options:
    -m FN --column-data=FN      Column data for annotating PCA results and
                                plots (experimental)
    -f NAME --column-name=NAME  Column name from provided column data for
                                annotation PCA results and plots (experimental)
    -o FILE --output=FILE       Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>            Name of JSON output file
    --html=<html_fn>            Name of HTML output file

`summary` - Common statistics set ¶

de_toolkit.stats.summary(count_mat, bins=20, log=False, density=False)[source]¶

Compute summary statistics on a counts matrix file.

This is equivalent to running each of these tools separately:

basestats
coldist
colzero
rowzero
entropy
pca

Parameters:	count_mat (CountMatrix object) – count matrix object bins (int) – number of bins, passed to coldist log (bool) – perform log10 transform of counts in coldist density (bool) – return a density distribution from coldist
Returns:	list of DetkModule subclasses for each of the called submodules
Return type:	list

Command line usage:

Usage:
    detk-stats summary [options] <counts_fn>

Options:
    -h --help
    --column-data=FN       Use column data provided in FN, only used in PCA
    --color-col=COLNAME    Use column data column COLNAME for coloring output plots
    --bins=BINS            Number of bins to use for the calculated
                           distributions [default: 20]
    --log                  log transform count statistics
    --density              Produce density distribution by dividing each distribution
                           by the appropriate sum
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

stats - Count Matrix Statistics¶

`stats` - Count Matrix Statistics¶