`filter` - Filtering Count Matrices¶

Functions for filtering count matrices based on various criteria.

The output is a file with rows filtered out of the original data based on a filter command. The module accepts a single counts file as input. By default, the output file has the same basename followed by ‘_filtered’ and the same file extension as the input, so counts.csv will produce counts_filtered.csv. The default output filename can be changed using the optional command line argument ‘–output=<out_fn>’.

Quick start¶

Here is an example command that takes a normalized count matrix and retain those genes that only have a mean count greater than 10.

detk-filter  -o counts_gt10.csv 'mean(all) > 10' norm_counts.csv

Incorporating column data into filter¶

A column data file can be optionally input to the filter module. The column data file should specify subsets of the samples that the filter can then be applied to separately. The first column of the file must match the sample names given in the counts file. For example, if your counts file contains samples ‘A’, ‘B’, ‘C’, and ‘D’, a column data file might look like this:

sample_name, condition
A, test
B, test
C, test
D, control
E, control

Using the column data, the filter module can then be run in two different ways. The first way is to apply the filter to each group separately. If all groups fail the filter criteria, then that row is filtered out. In order to use this method, the command should be as follows:

mean(condition) > 10

The condition column spec above corresponds to the condition column in the column data file. This filter will retain genes that have a mean count greater than 10 within either the test samples or the control samples. This enables powerful and expressive filtering schemes, for example:

nonzero(condition) > 0.5

This filter retains genes that have fewer than 50% zero counts in either condition, so genes that are uniquely expressed in test or control will proceed to downstream analysis. Filtering on overall mean may eliminate these very interesting genes from consideration.

The second way that you can specify the filter with column data is to filter rows based on a specific condition. The command includes a value that subsets the columns of the counts matrix so that filters can be applied to specific groups:

mean(condition[test]) > 10

This filter will retain genes that have a mean count greater than 10 in the test samples, regardless of the counts in the control samples.

filter - Filtering Count Matrices¶

Quick start¶

How to run the filter module¶

Incorporating column data into filter¶

`filter` - Filtering Count Matrices¶