Source code for de_toolkit.stats

r'''
Easy access to informative count matrix statistics. Each of these functions
produces two outputs:

- a json formatted file containing relevant statistics in a machine-parsable
  format
- an optional human-friendly HTML page displaying the results

All of the commands accept a single counts file as input with optional
arguments as indicated in the documentation. By default, the JSON and HTML
output files have the same basename without extension as the counts file but
including .json or .html as appropriate. E.g., counts.csv will produce
counts.json and counts.html in the current directory. These default filenames
can be changed using optional command line arguments --json=<json fn> and
--html=<html fn> as appropriate for all commands. If <json fn>, either default
or specified, already exists, it is read in, parsed, and added to. The HTML
report is overwritten on every invocation using the contents of the JSON file.

Usage:
    detk-stats summary [options] <counts_fn>
    detk-stats basestats [options] <counts_fn>
    detk-stats coldist [options] <counts_fn>
    detk-stats rowdist [options] <counts_fn>
    detk-stats colzero [options] <counts_fn>
    detk-stats rowzero [options] <counts_fn>
    detk-stats entropy [options] <counts_fn>
    detk-stats pca [options] <counts_fn>

Options:
    -h --help       Access detailed help for individual commands
'''

cmd_opts = {
    'summary':r'''
Compute summary statistics on a counts matrix file.

This is equivalent to running each of these tools separately:

- basestats
- coldist
- colzero
- rowzero
- entropy
- pca

Usage:
    detk-stats summary [options] <counts_fn>

Options:
    -h --help
    --column-data=FN       Use column data provided in FN, only used in PCA
    --color-col=COLNAME    Use column data column COLNAME for coloring output plots
    --bins=BINS            Number of bins to use for the calculated
                           distributions [default: 20]
    --log                  log transform count statistics
    --density              Produce density distribution by dividing each distribution
                           by the appropriate sum
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file


''',
    'basestats':r'''
Calculate basic statistics of the counts file, including:
    number of samples
    number of rows

Usage:
    detk-stats basestats [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file
''',
    'coldist':r'''
Column-wise distribution of counts

Compute the distribution of counts column-wise. Each column is subject to
binning by percentile, with output identical to that produced by np.histogram.

In the stats object, the fields are defined as follows:
    pct
        The percentiles of the distributions in the range 0 < pct < 100, by
        default in increments of 5. This defines the length of the dist and
        bins arrays in each of the objects for each sample.
    dists
        Array of objects containing one object for each column, described below.
    Each item of dists is an object with the following keys:
        name
            Column name from original file
        dist
            Array of raw or normalized counts in each bin according to the
            percentiles from pct
        bins
            Array of the bin boundary values for the distribution. Should
            be of length len(counts)+1. These are what would be the x-axis
            labels if this was plotted as a histogram.
        extrema
            Object with two keys, min and max, that contain the literal
            count values for counts that have a value larger or smaller than
            1.5*(inner quartile length) of the distribution. These could be
            marked as outliers in a boxplot, for example.

Usage:
    detk-stats coldist [options] <counts_fn>

Options:
    --bins=N               The number of bins to use when computing the counts
                           distribution [default: 20]
    --log                  Perform a log10 transform on the counts before
                           calculating the distribution. Zeros are omitted
                           prior to histogram calculation.
    --density              Return a density distribution instead of counts,
                           such that the sum of values in *dist* for each
                           column approximately sum to 1.
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file
''',
    'rowdist':r'''
Row-wise distribution of counts

Compute the distribution of counts row-wise. Each row is subject to binning by
percentile, with output identical to that produced by np.histogram.

In the stats object, the fields are defined as follows:
    pct
        The percentiles of the distributions in the range 0 < pct < 100, by
        default in increments of 5. This defines the length of the dist and
        bins arrays in each of the objects for each sample.
    dists
        Array of objects containing one object for each column, described
        below.
    Each item of dists is an object with the following keys:
        name
            Column name from original file
        dist
            Array of raw or normalized counts in each bin according to the
            percentiles from pct
        bins
            Array of the bin boundary values for the distribution. Should
            be of length len(counts)+1. These are what would be the x-axis
            labels if this was plotted as a histogram.
        extrema
            Object with two keys, min and max, that contain the literal
            count values for counts that have a value larger or smaller than
            1.5*(inner quartile length) of the distribution. These could be
            marked as outliers in a boxplot, for example.

Usage:
    detk-stats rowdist [options] <counts_fn>

Options:
    --bins=N               The number of bins to use when computing the counts
                           distribution [default: 20]
    --log                  Perform a log10 transform on the counts before calculating
                           the distribution. Zeros are omitted prior to histogram
                           calculation.
    --density              Return a density distribution instead of counts, such that
                           the sum of values in *dist* for each row approximately
                           sum to 1.
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file
''',
    'colzero':r'''
Column-wise distribution of zero counts

Compute the number and fraction of exact zero counts for each column.
The stats value is an array containing one object per column as follows:
    name
        column name
    zero_count
        absolute count of rows with exactly zero counts
    zero_frac
        zero_count divided by the number of rows
    col_mean
        the mean of counts in the column
    nonzero_col_mean
        the mean of only the non-zero counts in the column

Usage:
    detk-stats colzero [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file
            ''',
    'rowzero':r'''
Row-wise distribution of zero counts

Compute the number and fraction of exact zero counts for each row.
The stats value is an array containing one object per row as follows:
    name
        row name
    zero_count
        absolute count of rows with exactly zero counts
    zero_frac
        zero_count divided by the number of rows
    row_mean
        the mean of counts in the row
    nonzero_row_mean
        the mean of only the non-zero counts in the row

Usage:
    detk-stats rowzero [options] <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file
''',
    'entropy':r'''
Row-wise sample entropy calculation

Sample entropy is a metric that can be used to identify outlier samples by locating
rows which are overly influenced by a single count value. This metric can be
calculated for a single row as follows:
    pi = ci/sumj(cj)
    sum(pi) = 1
    H = -sumi(pi*log2(pi))
Here, ci is the number of counts in sample i, pi is the fraction of reads contributed
by sample i to the overall counts of the row, and H is the Shannon entropy of the row
when using log2. The maximum value possible for H is 2 when using Shannon entropy.

Rows with a very low H indicate a row has most of its count mass contained in a small
number of columns. These are rows that are likely to drive outliers in downstream
analysis, e.g. differential expression.

The key entropies is an array containing one object per row with the following keys:
    name
        row name from counts file
    entropy
        the value of H calculated as above for that row

Usage:
    detk-stats [options] entropy <counts_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>       Name of JSON output file
    --html=<html_fn>       Name of HTML output file
''',
    'pca':r'''
Principal common analysis of the counts matrix.

This module performs PCA on a provided counts matrix and returns the principal
component weights, scores, and variances. In addition, the weights and scores
for each individual component can be combined to define the projection of each
sample along that component.

The PCA module can also accept a metadata file that contains information about
the samples in each column. The user can specify some of these columns to
include as variables for plotting purposes. The idea is that columns labeled
with the same class will be colored according to their class, such that
separations in the data can be more easily observed when projections are
plotted.

Usage:
    detk-stats pca [options] <counts_fn>

Options:
    -m FN --column-data=FN      Column data for annotating PCA results and
                                plots (experimental)
    -f NAME --column-name=NAME  Column name from provided column data for
                                annotation PCA results and plots (experimental)
    -o FILE --output=FILE       Destination of primary output [default: stdout]
    -f FMT --format=FMT    Format of output, either csv or table [default: csv]
    --json=<json_fn>            Name of JSON output file
    --html=<html_fn>            Name of HTML output file
'''
}
from collections import OrderedDict, defaultdict
import csv
from docopt import docopt
import json
import math
import numpy as np
import pandas
import pkg_resources
import os.path
import scipy
from sklearn.decomposition import PCA
from string import Template
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
import sys
import warnings

from .common import CountMatrixFile, DetkModule, _cli_doc
from .report import DetkReport

[docs]def summary(count_mat,
        bins=20,
        log=False,
        density=False) :
    '''
    Compute summary statistics on a counts matrix file.

    This is equivalent to running each of these tools separately:

    - basestats
    - coldist
    - colzero
    - rowzero
    - entropy
    - pca

    Parameters
    ----------
    count_mat : CountMatrix object
        count matrix object
    bins : int
        number of bins, passed to coldist
    log : bool
        perform log10 transform of counts in coldist
    density : bool
        return a density distribution from coldist

    Returns
    -------
    list
        list of DetkModule subclasses for each of the called submodules
    '''

    total_output = [
        BaseStats(count_mat),
        ColDist(count_mat, bins, log, density),
        #RowDist(count_mat, bins, log, density),
        ColZero(count_mat),
        RowZero(count_mat),
        Entropy(count_mat),
        CountPCA(count_mat)
    ]

    return total_output

[docs]class BaseStats(DetkModule) :
    '''
        Basic statistics of the counts file

        The most basic statistics of the counts file, including:
        - number of columns
        - number of rows

    '''
    def __init__(self, count_mat) :

        self.count_mat = count_mat

    @property
    def properties(self):
        #Get counts, number of columns, and number of rows
        return {
                'num_rows': self.count_mat.counts.shape[0],
                'num_cols': self.count_mat.counts.shape[1]
               }

    @property
    def output(self):
        '''
        Example output output::

            +basestats-+-----+
            | stat     | val |
            +----------+-----+
            | num_cols | 4   |
            | num_rows | 3   |
            +----------+-----+
        '''
        return [
                ['stat','val'],
                ['num_cols',self.properties['num_cols']],
                ['num_rows',self.properties['num_rows']]
               ]

[docs]class ColDist(DetkModule) :
    '''
        Column-wise distribution of counts

        Compute the distribution of counts column-wise. Each column is subject
        to binning by percentile, with output identical to that produced by
        np.histogram.

        Parameters
        ----------
        count_mat : CountMatrix
            count matrix containing counts
        bins : int
            number of bins to use when computing distribution
        log : bool
            take the log10 of counts+1 prior to computing distribution
        density : bool
            return densities rather than absolute bin counts for the
            distribution, densities sum to 1

    '''
    def __init__(self,count_mat,bins=100,log=False,density=False):

        self['params'] = {
                'bins': bins,
                'log': log,
                'density': density
        }

        self['pct'] = pct = np.arange(bins)/bins

        self['dists'] = []

        self.stats = stats = OrderedDict()

        for col in count_mat.counts:

            #to access the data in each column
            data = count_mat.counts[col]

            #Take the log10 of each count if log option is specified
            if log :
                data = np.log10(data+1)

            #for the histogram bin edges and count numbers
            n, dist_bins = np.histogram(data,bins=bins,density=density)

            binstart=dist_bins[:-1]
            bincount=n
            pctVal=np.percentile(data,100*pct)

            stats[col] = OrderedDict(
                binstart=binstart,
                bincount=bincount,
                pct=pct,
                pctVal=pctVal
            )

            # unlog binstarts and pctVals
            if log :
                binstart = 10**binstart
                pctVal = 10**pctVal

            #make the dict for each sample
            self['dists'].append(
                {
                    'name':col,
                    'dist':list(zip(binstart,bincount)),
                    'percentiles':list(zip(pct,pctVal))
                }
            )

    @property
    def output(self) :
        '''
        Tabular output is a table with four columns per input counts column:

          - bin start value (column name: sampleA__binstart)
          - number of features with counts or density in bin (sampleA__bincount)
          - percentile increment (i.e. 0, 1, etc) (sampleA__pct)
          - percentile value for corresponding percentile (sampleA__pctVal)

        '''
        res = []
        for col in self.stats :
            for colstat in self.stats[col] :
                res.append(['{}__{}'.format(col,colstat)]+list(self.stats[col][colstat]))
        return list(list(_) for _ in zip(*res))
    @property
    def properties(self) :
        '''
        In the properties object, the fields are defined as follows

        dists
            Array of objects containing one object for each column,
            described below.

        Each item of dists is an object with the following keys:

        name
            Column name from original file
        dist
            Array of (bin start, count) pairs defining the counts histogram
        percentile
            Array of (percentile, count) pairs defining the counts
            percentiles

        Example JSON properties output::

            {
              'dists' : [
                {
                  'name': 'H_0001',
                  'dist': [ [5, 129], [103, 317], ...],
                  'percentiles': [ [0, 193], [1, 362], ...],
                },
                {
                  'name': 'H_0002',
                  'dist': [ [6, 502], [122, 127], ...],
                  'bins': [ [0, 6000], [1, 6200], ...],
                }
              ]
            }
        '''
        return {
                'dists': self['dists']
               }

[docs]class RowDist(DetkModule):
    '''
        Row-wise distribution of counts

        Identical to coldist except calculated across rows. The name key is
        rowdist, and the name key of the items in dists is the row name from
        the counts file.

        Parameters
        ----------
        count_mat : CountMatrix
            count matrix containing counts
        bins : int
            number of bins to use when computing distribution
        log : bool
            take the log10 of counts prior to computing distribution
        density : bool
            return densities rather than absolute bin counts for the
            distribution, densities sum to 1
    '''
    def __init__(self, count_obj, bins=100, log=False, density=False) :

        self['params'] = {
                'bins': bins,
                'log': log,
                'density': density
        }

        self['pct'] = list(100*(_+1)/bins for _ in range(bins))
        self['dists'] = []

        for i in range(len(count_obj.feature_names)):
            #to access the data in each row
            data = count_obj.counts.iloc[i]

            #Compute log10 of each count if log option is specified
            if log :
                data = np.log10(data)

            #for the upper and lower outliers
            Q1 = np.percentile(data, 25)
            Q3 = np.percentile(data, 75)
            IQR =  np.percentile(data, 75) - np.percentile(data, 25)

            #for the histogram bin edges and count numbers
            n, dist_bins = np.histogram(data,bins=bins,density=density)

            #make the dict for each row
            self['dists'].append(
                    {
                        'name':count_obj.feature_names[i],
                        'dist':list(n),
                        'bins':list(dist_bins)[1:],
                        'extrema': {
                            'lower':[i for i in data if i < Q1-1.5*IQR],
                            'upper':[i for i in data if i > Q3+1.5*IQR]
                        }
                    }
                )
    @property
    def output(self) :
        '''
            Tabular output is a table where each row corresponds to a row
            with row name as the first column. The next columns are broken
            into two parts:

              - the bin start values, named like bin_N, where N is the
                percentile
              - the bin count values, named like dist_N, where N is the
                percentile
        '''
        colnames = ['rowname']+\
              ['bin_{}'.format(_) for _ in self['pct']]+\
              ['dist_{}'.format(_) for _ in self['pct']]
        res = [colnames]
        for dist in self['dists'] :
            res.append([dist['name']]+dist['bins']+dist['dist'])
        return res
    @property
    def properties(self) :
        '''Same format as ColDist'''
        return {
                'pct': self['pct'],
                'dists': self['dists']
               }

[docs]class ColZero(DetkModule) :
    '''
        Column-wise distribution of zero counts
    
        Compute the number and fraction of exact zero counts for each column.

    '''
    def __init__(self,count_mat) :
        #Get counts, number of columns, number of rows, and sample names
        num_rows, num_cols = count_mat.counts.shape
        col_names=count_mat.sample_names

        # Calculate zero counts, zero fractions, means, and nonzero means for
        # each column
        # the mean and median function raise warnings when a row/col is all zero
        # ignore
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            zero_counts = (count_mat.counts==0).sum(axis=0).fillna(0)
            zero_fracs = zero_counts/num_rows
            col_means = count_mat.counts.mean(axis=0)
            col_medians = count_mat.counts.median(axis=0)
            nonzero_col_means = count_mat.counts[count_mat.counts!=0].mean(axis=0)
            nonzero_col_medians = count_mat.counts[count_mat.counts!=0].median(axis=0)

        self['zeros'] = []

        for i in range(0, num_cols):
            col = {}
            col['name'] = col_names[i]
            col['zero_count'] = zero_counts[i]
            col['zero_frac'] = zero_fracs[i]
            col['mean'] = col_means[i]
            col['median'] = col_medians[i]
            col['nonzero_mean'] = nonzero_col_means[i]
            col['nonzero_median'] = nonzero_col_medians[i]
            self['zeros'].append(col)

    @property
    def output(self) :
        '''
            Tabular output is a table where each row corresponds to a column
            with the following fields:

            - name: Column name
            - zero_count: Number of zero counts
            - zero_frac: Fraction of zero counts
            - mean: Overall mean count
            - median: Overall median count
            - nonzero_mean: Mean of non-zero counts only
            - nonzero_median: Mean of non-zero counts only
        '''
        res = [['name','zero_count','zero_frac','mean','median','nonzero_mean','nonzero_median']]
        for col in self['zeros'] :
            res.append([col[_] for _ in res[0]])
        return res

    @property
    def properties(self):
        '''
        The stats value is an array containing one object per column as follows:

        name
            column name
        zero_count
            absolute count of rows with exactly zero counts
        zero_frac
            zero_count divided by the number of rows
        col_mean
            the mean of counts in the column
        col_median
            the median of counts in the column
        nonzero_col_mean
            the mean of only the non-zero counts in the column
        nonzero_col_median
            the median of only the non-zero counts in the column

        Example JSON output::

            {
              'zeros' : [
                {
                  'name': 'col1',
                  'zero_count': 20,
                  'zero_frac': 0.2,
                  'mean': 101.31,
                  'median': 31.31,
                  'nonzero_mean': 155.23,
                  'nonzero_median': 55.18
                },
                {
                  'name': 'col2',
                  'zero_count': 0,
                  'zero_frac': 0,
                  'mean': 3021.92,
                  'median': 329.23,
                  'nonzero_mean': 3021.92,
                  'nonzero_median': 819.32
                },
              ]
            }
        '''
        return { 'zeros':self['zeros'] }

[docs]class RowZero(DetkModule):
    '''
        Row-wise distribution of zero counts
    
        Computes statistics about the mean and median counts of rows by the
        number of zeros.
    '''
    def __init__(self,count_mat):

        #Get counts, number of columns, number of rows, and gene names
        cnts = count_mat.counts
        num_cols = cnts.shape[1]

        self['zeros'] = []

        num_zeros = (cnts==0).sum(axis=1)
        cum_frac = 0
        for i in range(0, num_cols+1) :
            with warnings.catch_warnings():
                warnings.simplefilter("ignore")
                cnts_subset = cnts[num_zeros==i]
                frac = cnts_subset.shape[0]/cnts.shape[0]
                cum_frac += frac

                num_zero = {
                    'num_zeros': i,
                    'num_features': (num_zeros==i).sum(),
                    'feature_frac': frac,
                    'cum_feature_frac': cum_frac,
                    'mean': cnts_subset.mean(axis=1).fillna(0).mean(),
                    'nonzero_mean': cnts_subset[cnts_subset!=0].mean(axis=1).fillna(0).mean(),
                    'median': cnts_subset.median(axis=1).fillna(0).median(),
                    'nonzero_median': cnts_subset[cnts_subset!=0].median(axis=1).fillna(0).median()
                }

            self['zeros'].append(num_zero)

    @property
    def output(self) :
        '''
            Tabular output is a table where each row corresponds to rows
            having a given number of zero columns with the following fields:

              - num_zero: the number of zeros for this row
              - num_features: the number of features with this number of zeros
              - feature_frac: the fraction of features with this number of zeros
              - cum_feature_frac: cumulative fraction of features remeaning with
                this number of zeros or fewer
              - mean: the mean count mean of genes with this number of zeros
              - nonzero_mean: the mean count mean of genes with this number of
                zeros not including zero counts
              - median: the median count median of genes with this number of zeros
              - nonzero_median: the median count median of genes with this number
                of zeros, not including zero counts

        '''
        res = [['num_zeros','num_features','feature_frac','cum_feature_frac','mean','nonzero_mean','median','nonzero_median']]
        for col in self['zeros'] :
            res.append([col[_] for _ in res[0]])
        return res
    @property
    def properties(self) :
        '''
            The stats value is an array containing one object per number of zeros
            as follows:

            num_zero
                the number of zeros for this group of features
            num_features
                the number of features with this number of zeros
            feature_frac
                the fraction of features with this number of zeros
            cum_feature_frac
                cumulative fraction of features remeaning with this number of zeros
                or fewer
            mean
                the mean count mean of genes with this number of zeros
            nonzero_mean
                the mean count mean of genes with this number of zeros not
                including zero counts
            median
                the median count mean of genes with this number of zeros
            nonzero_median
                the median count mean of genes with this number of zeros, not
                including zero counts

            Example JSON output::

                {
                  'zeros' : [
                    {
                        'num_zeros': 0,
                        'num_features': 14031,
                        'feature_frac': .61,
                        'cum_feature_frac': .61,
                        'mean': 3351.13,
                        'nonzero_mean': 3351.13,
                        'median': 2125.9,
                        'nonzero_median': 2125.9
                    },
                    {
                        'num_zeros': 1,
                        'num_features': 5031,
                        'feature_frac': .21,
                        'cum_feature_frac': .82,
                        'mean': 3125.91,
                        'nonzero_mean': 3295.4,
                        'median': 1825.8,
                        'nonzero_median': 1976.1
                    },
                  ]
                }
        '''
        return { 'zeros':self['zeros'] }

[docs]class Entropy(DetkModule) :
    '''
    Row-wise sample entropy calculation

    Sample entropy is a metric that can be used to identify outlier samples
    by locating rows which are overly influenced by a small number of count
    values. This metric can be calculated for a single row as follows::

        pi = ci/sumj(cj)
        sum(pi) = 1
        H = -sumi(pi*log2(pi))

    Here, ci is the number of counts in sample i, pi is the fraction of
    reads contributed by sample i to the overall counts of the row, and H
    is the `Shannon entropy`_ of the row when using log2. The maximum value
    possible for H is 2 when using Shannon entropy.

    Rows with a very low H indicate a row has most of its count mass
    contained in a small number of columns. These are rows that are likely
    to drive outliers in downstream analysis, e.g. differential expression.

    .. _Shannon entropy: https://en.wikipedia.org/wiki/Entropy_(information_theory)
    '''
    def __init__(self,count_mat) :

        #Get counts, number of columns, number of rows, and gene names
        cnts = count_mat.counts.values
        num_cols=len(cnts[0])
        num_rows=len(cnts)

        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            entropies = count_mat.counts.apply(scipy.stats.entropy,axis=1).fillna(0)

        # the number of percentile bins is the minimum of:
        # - the unique number of distinct entropy values
        # - the number of features
        # - 100
        pct = list(np.linspace(0,100,min(len(set(entropies)),cnts.shape[0],101)))
        pctVal = np.percentile(entropies,pct,interpolation='higher')

        #Format output
        self['entropies'] = res = defaultdict(list)
        res.update({
            'pct':pct,
            'pctVal':pctVal
        })

        for p1, p2 in zip(pctVal.tolist(),pctVal[1:].tolist()+[1e6]) :
            pct_features = entropies.index[(entropies>=p1) & (entropies<p2)]

            res['num_features'].append(pct_features.size)
            res['frac_features'].append(pct_features.size/entropies.size)
            res['cum_frac_features'].append(sum(res['frac_features']))

            if entropies[pct_features].size != 0 :
                min_feature = entropies[pct_features].idxmin()
                res['exemplar_features'].append({
                    'name': min_feature,
                    'entropy': entropies[min_feature],
                    'counts': list(zip(
                        count_mat.counts.columns,
                        count_mat.counts.loc[min_feature].tolist()
                        )
                    )
                })
            else :
                res['exemplar_features'].append({
                    'name': 'No genes in bin',
                    'entropy': [],
                    'counts': []
                })


    @property
    def output(self) :
        '''
        Tabular output is a table where each row corresponds to a percentile
        with the following columns:

        pct
            percentile of entropy distribution
        pctVal
            the entropy value for each percentile
        num_features
            the number of features with entropy in the corresponding
            percentile
        frac_features
            the fraction of features with entropy in the corresponding
            percentile
        cum_frac_features
            the cumulative fraction of features with entropy in the
            corresponding percentile, i.e. the fraction of features
            with pctVal entropy or higher
        exemplar_feature
            the name of a feature with an entropy in the given percentile

        '''
        res = [['pct','pctVal','num_features','frac_features','cum_frac_features','exemplar_feature']]

        fields = [self['entropies'][_] for _ in res[0][:-1]]
        exemplar_names = [[_['name'] for _ in self['entropies']['exemplar_features']]]
        res.extend(list(zip(*fields+exemplar_names)))

        return res
    @property
    def properties(self) :
        '''
        The key entropies contains a single object with following keys:

        pct
            percentile of entropy distribution
        pctVal
            the entropy value for each percentile
        num_features
            the number of features with entropy in the corresponding
            percentile
        frac_features
            the fraction of features with entropy in the corresponding
            percentile
        cum_frac_features
            the cumulative fraction of features with entropy in the
            corresponding percentile, i.e. the fraction of features
            with pctVal entropy or higher
        exemplar_features
            an array of objects with an exemplar feature for each percentile
            with the following fields:

            name
                the name of the feature
            entropy
                the sample entropy of the feature
            counts
                array of [column name, count] pairs sorted by count
                ascending

        Example JSON output::

            {
                'pct': [0, 1, 2, 3, ...],
                'pctVal': [0, 0.1, 0.5, 0.9, ...],
                'num_features': [10, 12, 23, 100, ...],
                'frac_features': [0.001, 0.0012, 0.0023, 0.01, ...],
                'cum_frac_features': [0.001, 0.0022, 0.0045, 0.0145, ...],
                'exemplar_features': [
                    {
                        'name': 'ENSG0000055095.1',
                        'entropy': 0,
                        'counts': [ ['sampleA', 0], ['sampleB',0], ..., ['sampleN',1]]
                    },
                    {
                        'name': 'ENSG0000398715.1',
                        'entropy': 0.11,
                        'counts': [ ['sampleA', 0], ['sampleB',0], ..., ['sampleM',5]]
                    }
                ]
            }
        '''
        return {'entropies': self['entropies']}

[docs]class CountPCA(DetkModule) :
    '''
    Principal common analysis of the counts matrix.

    This module performs PCA on a provided counts matrix and returns the
    principal component weights, scores, and variances. In addition, the
    weights and scores for each individual component can be combined to define
    the projection of each sample along that component.  

    The PCA module can also use a counts matrix that has associated column data
    information about the samples in each column. The user can specify some of
    these columns to include as variables for plotting purposes. The idea is
    that columns labeled with the same class will be colored according to their
    class, such that separations in the data can be more easily observed when
    projections are plotted.
    '''
    def __init__(self,count_mat) :

        # get counts from file and scale counts
        # counts matrices are n_features x n_samples, need to transpose
        # since PCA expects n_samples x n_features
        cnts = scale(count_mat.counts.values.astype(float).T)

        # perform PCA and fit to the data
        pca = PCA(n_components=min(*cnts.shape))
        pca.fit(cnts)
        X = pca.transform(cnts)

        # get sample names
        sample_names = list(count_mat.counts.columns)

        # format output
        self['column_names'] = sample_names
        self['column_variables'] = {}

        # if metadata option is given, get column variables
        if count_mat.column_data is not None :
            columns = []
            for k,v in count_mat.column_data.iteritems() :
                if k != 'counts' :
                    columns.append({'column':k,'values':v.tolist()})
            self['column_variables'] = {
                'sample_names': count_mat.column_data.index.tolist(),
                'columns': columns
            }

        self['components'] = []
        for i in range(0, pca.n_components_):
            comp = {}
            comp['name'] = 'PC' + str(i+1)
            comp['scores'] = [row[i] for row in X]
            comp['projections'] = [row[i] for row in pca.components_]
            comp['perc_variance'] =  pca.explained_variance_ratio_[i]
            if np.isnan(comp['perc_variance']) :
                raise Exception('nan encountered in calculating PCA component '
                        'percent variance, this means the counts features have '
                        'zero total variance, cannot compute PCA. Examine your '
                        'counts matrix if you did not expect this?')
            self['components'].append(comp)
    @property
    def name(self):
        return 'pca'
    @property
    def output(self) :
        '''
        Tabular output is a table where each row corresponds to a column
        in the counts matrix with the following fields:

        name
            name of the column for the row
        PC*X*_*YY*
            projections of principal component X (e.g. 1) that explains YY
            percent of the variance for each column 
        '''
        res = [['colname']+self['column_names']]
        for comp in self['components'] :
            name = '{}_{:03d}'.format(comp['name'],int(100*comp['perc_variance']))
            res.append([name]+comp['projections'])
        # transpose the list of lists
        res = list(zip(*res))
        return res
    @property
    def properties(self) :
        '''
        Example JSON output::

            [
                'name': 'pca',
                'stats': {
                    'column_names': ['sample1','sample2',...],
                    'column_variables': {
                        'sample_names': ['sample1','sample2',...],
                        'columns': [
                            {
                                'column':'status',
                                'values':['disease','control',...]
                            },
                            {
                                'column':'batch',
                                'values':['b1','b1',...]
                            },
                    },
                    'components': [
                        {
                            'name': 'PC1',
                            'scores': [0.126,0.975,...], # length n
                            'projections': [-8.01,5.93,...], # length m, ordered by 'column_names'
                            'perc_variance': 0.75
                        },
                        {
                            'name': 'PC2',
                            'scores' : [0.126,0.975,...], # length n
                            'projections': [5.93,-5.11,...], # length m
                            'perc_variance': 0.22
                        }
                    ]
                }
            ]
        '''
        return {
                'column_names': self['column_names'],
                'column_variables': self['column_variables'],
                'components': self['components']
               }

def main(argv=sys.argv) :

    if '--version' in argv :
        from .version import __version__
        print(__version__)
        return

    # add the common opts to the docopt strings
    cmd_opts_aug = {}
    for k,v in cmd_opts.items() :
        cmd_opts_aug[k] = _cli_doc(v)

    if len(argv) < 2 or (len(argv) > 1 and argv[1] not in cmd_opts) :
        docopt(_cli_doc(__doc__))
    argv = argv[1:]
    cmd = argv[0]

    if cmd == 'pca' :
        args = docopt(cmd_opts_aug['pca'],argv)
        counts_obj = CountMatrixFile(
                args['<counts_fn>'],
                column_data_f=args['--column-data']
            )
        output = CountPCA(counts_obj)
    elif cmd == 'summary' :
        args = docopt(cmd_opts_aug['summary'],argv)
        counts_obj = CountMatrixFile(
                args['<counts_fn>'],
                column_data_f=args['--column-data']
            )
        output = summary(counts_obj
          ,int(args['--bins'])
          ,args['--log']
          ,args['--density']
        )
    elif cmd == 'coldist' :
        args = docopt(cmd_opts_aug['coldist'],argv)
        counts_obj = CountMatrixFile(args['<counts_fn>'])
        output = ColDist(counts_obj
          ,bins=int(args['--bins'])
          ,log=args['--log']
          ,density=args['--density']
        )
    elif cmd == 'rowdist' :
        args = docopt(cmd_opts_aug['rowdist'],argv)
        counts_obj = CountMatrixFile(args['<counts_fn>'])
        output = RowDist(counts_obj
          ,bins=int(args['--bins'])
          ,log=args['--log']
          ,density=args['--density']
        )
    elif cmd == 'colzero' :
        args = docopt(cmd_opts_aug['colzero'],argv)
        counts_obj = CountMatrixFile(args['<counts_fn>'])
        output = ColZero(counts_obj)
    elif cmd == 'rowzero' :
        args = docopt(cmd_opts_aug['rowzero'],argv)
        counts_obj = CountMatrixFile(args['<counts_fn>'])
        output = RowZero(counts_obj)
    elif cmd == 'entropy' :
        args = docopt(cmd_opts_aug['entropy'],argv)
        counts_obj = CountMatrixFile(args['<counts_fn>'])
        output = Entropy(counts_obj)
    elif cmd == 'basestats' :
        args = docopt(cmd_opts_aug['basestats'],argv)
        counts_obj = CountMatrixFile(args['<counts_fn>'])
        output = BaseStats(counts_obj)

    # make output a list if it is a singleton
    if not isinstance(output,list) :
        output = [output]

    #Obtain string used to name output files, unless filename is specified
    filename_prefix = os.path.splitext(args['<counts_fn>'])[0]

    outf = sys.stdout
    if args['--output'] != 'stdout' :
        outf = open(args['--output'],'wt')

    if args['--format'] == 'table' :
        from terminaltables import AsciiTable
        for out in output :
            table = AsciiTable(out.output)
            table.title = out.name
            outf.write(table.table+'\n')

    else : # csv is default
        out_writer = csv.writer(outf,delimiter=',')

        # write out the output data
        if len(output) == 1 :
            out_writer.writerows(output[0].output)
        else :
            for out in output :
                out_writer.writerow(['#{}'.format(out.name)])
                out_writer.writerows(out.output)

    # write out the report json
    with DetkReport(args['--report-dir']) as r :
        for out in output :
            r.add_module(
                    out,
                    in_file_path=args['<counts_fn>'],
                    out_file_path=args['--output'],
                    column_data_path=args.get('--column-data'),
                    workdir=os.getcwd()
                )

if __name__ == '__main__':
    main()