enrich - Set Enrichment Methods

Functions for performing statistical set enrichment methods, e.g. Gene Set Enrichment Analysis

fgsea

de_toolkit.enrich.fgsea(gmt, stat, minSize=15, maxSize=500, nperm=10000, nproc=None, rda_fn=None)[source]

Perform pre-ranked Gene Set Enrichment Analysis using the fgsea Bioconductor package

Compute GSEA enrichment using the provided gene sets in the GMT object gmt using the statistics in the pandas.Series stat. The fgsea Bioconductor package must be installed on your system for this function to work.

The output dataframe contains one result row per features set in the GMT file, in the same order. Output columns include:

  • name: name of feature set
  • ES: GSEA enrichment score
  • NES: GSEA normalized enrichment score
  • pval: nominal p-value
  • padj: Benjamini-Hochberg adjusted p-value
  • nMoreExtreme: number of permutations with a more extreme NES than true
  • size: number of features in the feature set
  • leadingEdge: the leading edge features as defined by GSEA (string with space-separated feature names)

Command line usage:

Perform preranked Gene Set Enrichment Analysis using the fgsea bioconductor
package on the given gmt gene set file.

The GMT file must be tab delimited with set name in the first column, a
description in the second column (ignored by detk), and an individual feature
ID in each column after, one feature set per line. The result file can be any
character delimited file, and is assumed to have column names in the first row.

The feature IDs must be from the same system (e.g. gene symbols, ENSGIDs, etc)
in both GMT and result files. The user will likely have to provide:

- -i <col>: column name in the results file that contains feature IDs, e.g.
  gene_name
- -c <col>: column name in the results file that contains the statistics to
  use when computing enrichment, e.g. log2FoldChange

fgsea: https://bioconductor.org/packages/release/bioc/html/fgsea.html

Usage:
    detk-enrich fgsea [options] <gmt_fn> <result_fn>

Options:
    -h --help                 Print out this help
    -o FILE --output=FILE     Destination of fgsea output [default: stdout]
    -p PROCS --cores=PROCS    Ask BiocParallel to use PROCS processes when
                              executing fgsea in parallel, requires the
                              BiocParallel package to be installed
    -i FIELD --idcol=FIELD    Column name or 0-based integer index to use as
                              the gene identifier [default: 0]
    -c FIELD --statcol=FIELD  Column name or 0-based integer index to use as
                              the statistic for ranking, defaults to the last
                              numeric column in the file
    -a --ascending            Sort column ascending, default is to sort
                              descending, use this if you are sorting by p-value
                              or want to reverse the directionality of the NES
                              scores
    --abs                     Take the absolute value of the column before
                              passing to fgsea
    --minSize=INT             minSize argument to fgsea [default: 15]
    --maxSize=INT             maxSize argument to fgsea [default: 500]
    --nperm=INT               nperm argument to fgsea [default: 10000]
    --rda=FILE                write out the fgsea result to the provide file
                              using saveRDS() in R