stats - Count Matrix Statistics

Easy access to informative count matrix statistics. Each of these functions produces three outputs:

  • a tabular form of the statistics, formatted either as CSV or a human readable table using the terminaltables package
  • a json formatted file containing relevant statistics in a machine-parsable format
  • a human-friendly HTML page displaying the results

All of the commands accept a single counts file as input with optional arguments as indicated in the documentation of each subtool. By default, the JSON and HTML output files have the same basename without extension as the counts file but including .json or .html as appropriate. E.g., counts.csv will produce counts.json and counts.html in the current directory. These default filenames can be changed using optional command line arguments --json=<json fn> and --html=<html fn> as appropriate for all commands. If <json fn>, either default or specified, already exists, it is read in, parsed, and added to. The HTML report is overwritten on every invocation using the contents of the JSON file.

Tabular output format

Each tool prints out the statistics it calculates to standard output by default. The standard output format is comma separated values, e.g.:

$ detk-stats base test_counts.csv
stat,val
num_cols,3
num_rows,4

If desired, the -f table argument may be passed to pretty-print the table instead:

$ detk-stats base -f table test_counts.csv
+base------+-----+
| stat     | val |
+----------+-----+
| num_cols | 4   |
| num_rows | 3   |
+----------+-----+

The summary module is slightly different, as it executes multiple subtools. The CSV output of the summary module adds a line starting with # before each different output:

$ detk-stats summary --bins=2 test_counts.csv
#base
stat,val
num_cols,3
num_rows,4
#coldist
colname,bin_50.0,bin_100.0,dist_50.0,dist_100.0
a,55.0,100.0,2.0,2.0
b,5500.0,10000.0,2.0,2.0
c,550000.0,1000000.0,2.0,2.0
#rowdist
rowname,bin_50.0,bin_100.0,dist_50.0,dist_100.0
gene1,50005.0,100000.0,2.0,1.0

The pretty-printed output simply outputs each table serially.

JSON output format

The JSON file produced by these modules is formatted as a JSON array containing objects that each correspond to a stats module. For example:

[
  {
    'name': 'base',
    'stats': {
      'num_cols': 50,
      'num_rows': 27143
    }
  },
  {
    'name': 'coldist',
    'stats': {
        'dists' : [
          {
            'name': 'H_0001',
            'dist': [ [5, 129], [103, 317], ...],
            'percentiles': [ [0, 193], [1, 362], ...],
          },
          {
            'name': 'H_0002',
            'dist': [ [6, 502], [122, 127], ...],
            'bins': [ [0, 6000], [1, 6200], ...],
          }
        ]
    }
  },
  {
    'name': 'rowdist',
    'stats': ...
  }
  ...
]

The example above has been pretty-printed for visibility; the actual output is written to a single line. The object format for each module is described in detail below.

API Documentation

base - Basic statistics

class de_toolkit.stats.BaseStats(count_mat)[source]

Basic statistics of the counts file

The most basic statistics of the counts file, including: - number of columns - number of rows

output

Example output output:

+basestats-+-----+
| stat     | val |
+----------+-----+
| num_cols | 4   |
| num_rows | 3   |
+----------+-----+

Command line usage:

Usage:
    detk-stats base [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

coldist - Column-wise counts distributions

class de_toolkit.stats.ColDist(count_mat, bins=100, log=False, density=False)[source]

Column-wise distribution of counts

Compute the distribution of counts column-wise. Each column is subject to binning by percentile, with output identical to that produced by np.histogram.

Parameters:
  • count_mat (CountMatrix) – count matrix containing counts
  • bins (int) – number of bins to use when computing distribution
  • log (bool) – take the log10 of counts+1 prior to computing distribution
  • density (bool) – return densities rather than absolute bin counts for the distribution, densities sum to 1
output

Tabular output is a table with four columns per input counts column

  • bin start value (column name: sampleA__binstart)
  • number of features with counts or density in bin (sampleA__bincount)
  • percentile increment (i.e. 0, 1, etc) (sampleA__pct)
  • percentile value for corresponding percentile (sampleA__pctVal)
properties

In the properties object, the fields are defined as follows

dists
Array of objects containing one object for each column, described below.

Each item of dists is an object with the following keys:

name
Column name from original file
dist
Array of (bin start, count) pairs defining the counts histogram
percentile
Array of (percentile, count) pairs defining the counts percentiles

Example JSON properties output:

{
  'dists' : [
    {
      'name': 'H_0001',
      'dist': [ [5, 129], [103, 317], ...],
      'percentiles': [ [0, 193], [1, 362], ...],
    },
    {
      'name': 'H_0002',
      'dist': [ [6, 502], [122, 127], ...],
      'bins': [ [0, 6000], [1, 6200], ...],
    }
  ]
}

Command line usage:

Usage:
    detk-stats coldist [options] <counts_fn>

Options:
    --bins=N               The number of bins to use when computing the counts
                           distribution [default: 20]
    --log                  Perform a log10 transform on the counts before
                           calculating the distribution. Zeros are omitted
                           prior to histogram calculation.
    --density              Return a density distribution instead of counts,
                           such that the sum of values in *dist* for each
                           column approximately sum to 1.
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

rowdist - Row-wise counts distributions

class de_toolkit.stats.RowDist(count_obj, bins=100, log=False, density=False)[source]

Row-wise distribution of counts

Identical to coldist except calculated across rows. The name key is rowdist, and the name key of the items in dists is the row name from the counts file.

Parameters:
  • count_mat (CountMatrix) – count matrix containing counts
  • bins (int) – number of bins to use when computing distribution
  • log (bool) – take the log10 of counts prior to computing distribution
  • density (bool) – return densities rather than absolute bin counts for the distribution, densities sum to 1
output

Tabular output is a table where each row corresponds to a row with row name as the first column. The next columns are broken into two parts:

  • the bin start values, named like bin_N, where N is the percentile
  • the bin count values, named like dist_N, where N is the percentile

Command line usage:

Usage:
    detk-stats rowdist [options] <counts_fn>

Options:
    --bins=N               The number of bins to use when computing the counts
                           distribution [default: 20]
    --log                  Perform a log10 transform on the counts before calculating
                           the distribution. Zeros are omitted prior to histogram
                           calculation.
    --density              Return a density distribution instead of counts, such that
                           the sum of values in *dist* for each row approximately
                           sum to 1.
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

colzero - Column-wise statistics on zero counts

class de_toolkit.stats.ColZero(count_mat)[source]

Column-wise distribution of zero counts

Compute the number and fraction of exact zero counts for each column.

output

Tabular output is a table where each row corresponds to a column with the following fields:

  • name: Column name
  • zero_count: Number of zero counts
  • zero_frac: Fraction of zero counts
  • mean: Overall mean count
  • median: Overall median count
  • nonzero_mean: Mean of non-zero counts only
  • nonzero_median: Mean of non-zero counts only
properties

The stats value is an array containing one object per column as follows

name
column name
zero_count
absolute count of rows with exactly zero counts
zero_frac
zero_count divided by the number of rows
col_mean
the mean of counts in the column
col_median
the median of counts in the column
nonzero_col_mean
the mean of only the non-zero counts in the column
nonzero_col_median
the median of only the non-zero counts in the column

Example JSON output:

{
  'zeros' : [
    {
      'name': 'col1',
      'zero_count': 20,
      'zero_frac': 0.2,
      'mean': 101.31,
      'median': 31.31,
      'nonzero_mean': 155.23,
      'nonzero_median': 55.18
    },
    {
      'name': 'col2',
      'zero_count': 0,
      'zero_frac': 0,
      'mean': 3021.92,
      'median': 329.23,
      'nonzero_mean': 3021.92,
      'nonzero_median': 819.32
    },
  ]
}

Command line usage:

Usage:
    detk-stats colzero [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

rowzero - Row-wise statistics on zero counts

class de_toolkit.stats.RowZero(count_mat)[source]

Row-wise distribution of zero counts

Computes statistics about the mean and median counts of rows by the number of zeros.

output

Tabular output is a table where each row corresponds to rows having a given number of zero columns with the following fields:

  • num_zero: the number of zeros for this row
  • num_features: the number of features with this number of zeros
  • feature_frac: the fraction of features with this number of zeros
  • cum_feature_frac: cumulative fraction of features remeaning with this number of zeros or fewer
  • mean: the mean count mean of genes with this number of zeros
  • nonzero_mean: the mean count mean of genes with this number of zeros not including zero counts
  • median: the median count median of genes with this number of zeros
  • nonzero_median: the median count median of genes with this number of zeros, not including zero counts
properties

The stats value is an array containing one object per number of zeros as follows:

num_zero
the number of zeros for this group of features
num_features
the number of features with this number of zeros
feature_frac
the fraction of features with this number of zeros
cum_feature_frac
cumulative fraction of features remeaning with this number of zeros or fewer
mean
the mean count mean of genes with this number of zeros
nonzero_mean
the mean count mean of genes with this number of zeros not including zero counts
median
the median count mean of genes with this number of zeros
nonzero_median
the median count mean of genes with this number of zeros, not including zero counts

Example JSON output:

{
  'zeros' : [
    {
        'num_zeros': 0,
        'num_features': 14031,
        'feature_frac': .61,
        'cum_feature_frac': .61,
        'mean': 3351.13,
        'nonzero_mean': 3351.13,
        'median': 2125.9,
        'nonzero_median': 2125.9
    },
    {
        'num_zeros': 1,
        'num_features': 5031,
        'feature_frac': .21,
        'cum_feature_frac': .82,
        'mean': 3125.91,
        'nonzero_mean': 3295.4,
        'median': 1825.8,
        'nonzero_median': 1976.1
    },
  ]
}

Command line usage:

Usage:
    detk-stats rowzero [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

entropy - Row-wise sample entropy calculation

class de_toolkit.stats.Entropy(count_mat)[source]

Row-wise sample entropy calculation

Sample entropy is a metric that can be used to identify outlier samples by locating rows which are overly influenced by a small number of count values. This metric can be calculated for a single row as follows:

pi = ci/sumj(cj)
sum(pi) = 1
H = -sumi(pi*log2(pi))

Here, ci is the number of counts in sample i, pi is the fraction of reads contributed by sample i to the overall counts of the row, and H is the Shannon entropy of the row when using log2. The maximum value possible for H is 2 when using Shannon entropy.

Rows with a very low H indicate a row has most of its count mass contained in a small number of columns. These are rows that are likely to drive outliers in downstream analysis, e.g. differential expression.

output

Tabular output is a table where each row corresponds to a percentile with the following columns:

pct
percentile of entropy distribution
pctVal
the entropy value for each percentile
num_features
the number of features with entropy in the corresponding percentile
frac_features
the fraction of features with entropy in the corresponding percentile
cum_frac_features
the cumulative fraction of features with entropy in the corresponding percentile, i.e. the fraction of features with pctVal entropy or higher
exemplar_feature
the name of a feature with an entropy in the given percentile
properties

The key entropies contains a single object with following keys

pct
percentile of entropy distribution
pctVal
the entropy value for each percentile
num_features
the number of features with entropy in the corresponding percentile
frac_features
the fraction of features with entropy in the corresponding percentile
cum_frac_features
the cumulative fraction of features with entropy in the corresponding percentile, i.e. the fraction of features with pctVal entropy or higher
exemplar_features

an array of objects with an exemplar feature for each percentile with the following fields:

name
the name of the feature
entropy
the sample entropy of the feature
counts
array of [column name, count] pairs sorted by count ascending

Example JSON output:

{
    'pct': [0, 1, 2, 3, ...],
    'pctVal': [0, 0.1, 0.5, 0.9, ...],
    'num_features': [10, 12, 23, 100, ...],
    'frac_features': [0.001, 0.0012, 0.0023, 0.01, ...],
    'cum_frac_features': [0.001, 0.0022, 0.0045, 0.0145, ...],
    'exemplar_features': [
        {
            'name': 'ENSG0000055095.1',
            'entropy': 0,
            'counts': [ ['sampleA', 0], ['sampleB',0], ..., ['sampleN',1]]
        },
        {
            'name': 'ENSG0000398715.1',
            'entropy': 0.11,
            'counts': [ ['sampleA', 0], ['sampleB',0], ..., ['sampleM',5]]
        }
    ]
}

Command line usage:

Usage:
    detk-stats [options] entropy <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file

pca - Principal component analysis

class de_toolkit.stats.CountPCA(count_mat)[source]

Principal common analysis of the counts matrix.

This module performs PCA on a provided counts matrix and returns the principal component weights, scores, and variances. In addition, the weights and scores for each individual component can be combined to define the projection of each sample along that component.

The PCA module can also use a counts matrix that has associated column data information about the samples in each column. The user can specify some of these columns to include as variables for plotting purposes. The idea is that columns labeled with the same class will be colored according to their class, such that separations in the data can be more easily observed when projections are plotted.

output

Tabular output is a table where each row corresponds to a column in the counts matrix with the following fields:

name
name of the column for the row
PC*X*_*YY*
projections of principal component X (e.g. 1) that explains YY percent of the variance for each column
properties

Example JSON output:

[
    'name': 'pca',
    'stats': {
        'column_names': ['sample1','sample2',...],
        'column_variables': {
            'sample_names': ['sample1','sample2',...],
            'columns': [
                {
                    'column':'status',
                    'values':['disease','control',...]
                },
                {
                    'column':'batch',
                    'values':['b1','b1',...]
                },
        },
        'components': [
            {
                'name': 'PC1',
                'scores': [0.126,0.975,...], # length n
                'projections': [-8.01,5.93,...], # length m, ordered by 'column_names'
                'perc_variance': 0.75
            },
            {
                'name': 'PC2',
                'scores' : [0.126,0.975,...], # length n
                'projections': [5.93,-5.11,...], # length m
                'perc_variance': 0.22
            }
        ]
    }
]

Command line usage:

Usage:
    detk-stats pca [options] <counts_fn>

Options:
    -m FN --column-data=FN      Column data for annotating PCA results and
                                plots (experimental)
    -f NAME --column-name=NAME  Column name from provided column data for
                                annotation PCA results and plots (experimental)
    -o FILE --output=FILE       Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>            Name of JSON output file
    --html=<html_fn>            Name of HTML output file

summary - Common statistics set

de_toolkit.stats.summary(count_mat, bins=20, log=False, density=False)[source]

Compute summary statistics on a counts matrix file.

This is equivalent to running each of these tools separately:

  • basestats
  • coldist
  • colzero
  • rowzero
  • entropy
  • pca
Parameters:
  • count_mat (CountMatrix object) – count matrix object
  • bins (int) – number of bins, passed to coldist
  • log (bool) – perform log10 transform of counts in coldist
  • density (bool) – return a density distribution from coldist
Returns:

list of DetkModule subclasses for each of the called submodules

Return type:

list

Command line usage:

Usage:
    detk-stats summary [options] <counts_fn>

Options:
    -h --help
    --column-data=FN       Use column data provided in FN, only used in PCA
    --color-col=COLNAME    Use column data column COLNAME for coloring output plots
    --bins=BINS            Number of bins to use for the calculated
                           distributions [default: 20]
    --log                  log transform count statistics
    --density              Produce density distribution by dividing each distribution
                           by the appropriate sum
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file