outlier - Outlier Identification

Functions for identifying/manipulating outlier counts.

entropy

de_toolkit.outlier.entropy(counts_obj, threshold)[source]

Calculate sample entropy for each gene and flag genes that exceed the lower threshold’ile

Sample entropy is a metric that can be used to identify outlier samples by locating rows which are overly influenced by a single count value. This metric is calculated for each gene/feature g as follows:

p_i = c_i/sumj(c_j)
sum(p_i) = 1
H_g = -sum_i(p_i*log2(p_i))

Here, c_i is the number of counts in sample i, p_i is the fraction of reads contributed by sample i to the overall counts of the row, and H_g is the Shannon entropy of the row when using log2. The maximum value possible for H is 2 when using Shannon entropy. Genes/features with very low entropy are those where a small number of samples makes up most of the counts across all samples.

Parameters:
  • counts_obj (de_toolkit.CountMatrix) – count matrix object
  • threshold (float) – the lower percentile below which to flag genes
Returns:

data frame with one row for each row in the input counts matrix and two columns:

  • entropy: the calculated entropy value for that row
  • entropy_p0_XX: a True/False column for genes flagged as having an entropy value less than the 0.XX percentile; XX is the first two digits of the selected threshold

Return type:

pandas.DataFrame

Command line usage:

Usage:
    detk-outlier entropy <counts_fn> [options]

Options:
    -p P --percentile=P    Float value between 0 and 1
    -o FILE --output=FILE  Name of the ouput csv
    --plot-output=FILE     Name of the plot png

shrink

de_toolkit.outlier.shrink(count_obj, shrink_factor=0.25, p_max=None, iters=1000)[source]

Outlier count shrinkage routine as described in Labadorf et al, PLOSONE (2015)

This algorithm identifies feature where a small number of samples contains a disproportionately large number of the overall counts for that feature across samples. For each feature the algorithm is as follows:

  1. Divide each sample count by the sum of counts (i.e. sample count proportions)

  2. Identify samples that have >*p_max* sample count proportion

    1. If no samples are identified, return the most recent set of adjusted counts
    2. Else, shrink the identified samples toward the largest sample s for which P(x)<p_max by multiplying the difference between the outlier sample and s by the shrinkage factor and replacing o with s the shrunken count value
  3. Go to 1, repeat until no samples exceed p_max count proportion

This strategy assumes that samples with disproportionate count contribution are outliers and that the order of samples is correct and the magnitude is sometimes not. The order of the samples is thus always maintained, and the shrinking does not introduce new false positives beyond what would already be in the dataset. The maximum proportion of reads allowed in one sample, p, and the shrinkage factor were both set to 0.2.

Parameters:
  • count_obj (de_toolkit.CountMatrix object) – counts object
  • shrink_factor (float) – number between 0 and 1 that determines how much the residual counts of outlier samples is shrunk in each iteration
  • p_max (float) – number between 0 and 1 that indicates the maximum proportion of counts a sample may have before being considered an outlier, default is sqrt(1/num_samples)

Command line usage:

Usage:
    detk-transform shrink [options] <count_fn>

Options:
    -o FILE --output=FILE  Destination of primary output [default: stdout]