Details of predictive capability columns

By default, epcy pred and epcy pred_rna will return an output file predictive_capability.tsv of 9 columns:

Default columns:
- id: the id of each feature.
- l2fc: log2 fold change.
- kernel_mcc: Matthews Correlation Coefficient (MCC) compute by a predictor using KDE.
- kernel_mcc_low, kernel_mcc_high: lower and upper bounds of confidence interval (90%).
- mean_query: average values of this feature for samples in the subgroup of interest defined using the –query parameter.
- mean_ref: average values of this feature for samples in the reference group (not in the query subset).
- bw_query: estimated bandwidth used by KDE, to calculate the density of query samples.
- bw_ref: estimated bandwidth used by KDE, to calculate the density of ref samples.

However, epcy can expand or modify this default output using several options, see:

epcy pred -h

Predictive scores

By default we decide to return MCC scores. However, it’s possible to compute other predictive scores, in case they are more suitable for your needs. Using the following parameters will add new columns to the default output, as:

--ppv:
- kernel_ppv: Positive Predictive value (PPV, precision) compute by a predictor using KDE.
- kernel_ppv_low, kernel_ppv_high: boundaries of confidence interval (90%).
--npv:
- kernel_npv: Negative Predictive value (NPV) compute by a predictor using KDE.
- kernel_npv_low, kernel_ppv_high: boundaries of confidence interval (90%).
--tpr:
- kernel_tpr: True Positive Rate value (TPR, sensitivity) compute by a predictor using KDE.
- kernel_tpr_low, kernel_tpr_high: boundaries of confidence interval (90%).
--tnr:
- kernel_tnr: True Negative Rate value (TNR, specificity) compute by a predictor using KDE.
- kernel_tnr_low, kernel_tnr_high: boundaries of confidence interval (90%).
--fnr:
- kernel_fnr: False Negative Rate value (FNR, miss rate) compute by a predictor using KDE.
- kernel_fnr_low, kernel_fnr_high: boundaries of confidence interval (90%).
--fpr:
- kernel_fpr: False Positive Rate Rate value (FPR, fall-out) compute by a predictor using KDE.
- kernel_fpr_low, kernel_fpr_high: boundaries of confidence interval (90%).
--fdr:
- kernel_fdr: False Discovery Rate value (FDR) compute by a predictor using KDE.
- kernel_fdr_low, kernel_fdr_high: boundaries of confidence interval (90%).
--for:
- kernel_for: False Omission Rate value (FOR) compute by a predictor using KDE.
- kernel_for_low, kernel_for_high: boundaries of confidence interval (90%).
--ts:
- kernel_ts: Threat Score value (TS, critical sucess index) compute by a predictor using KDE.
- kernel_ts_low, kernel_ts_high: boundaries of confidence interval (90%).
--acc:
- kernel_acc: Accuracy value (ACC) compute by a predictor using KDE.
- kernel_acc_low, kernel_acc_high: boundaries of confidence interval (90%).
--f1:
- kernel_f1: F1 score value (F1) compute by a predictor using KDE.
- kernel_f1_low, kernel_f1_high: boundaries of confidence interval (90%).
--auc:
- auc: Area Under the Curve

Statistical test

EPCY is able to perform statistical tests, using:

--auc --utest:
- u_pv: pvalue compute by a MannWhitney rank test
--ttest:
- t_pv: pvalue compute by ttest_ind

Normal distribution

The type of classifier used to evaluate the predictive score of each gene (feature), is a parameter of EPCY. By default, EPCY will use a KDE classifier. However, it is possible to replace the KDE classifier by a normal classifier, using --normal.

Using the normal classifier, all predictive scores (listed above) remain available. However, the column name of each predictive score, will be changed to start with normal instead of kernel (normal_mcc vs kernel_mcc), to be consistant.

Missing values

On some dataset (as in proteomics or single-cell), quantitative matrix can have some missing values (nan). In that case there are different alternatives to manage these missing values within EPCY: * Impute missing values before running EPCY. * Replace missing values by a constant, using --replacena. * For each gene (or feature), remove samples with missing values.

If you choose to remove samples with missing values, EPCY will return a predictive_capability.tsv with two new columns, sample_query and sample_ref, to report for each gene (feature), the number of query and reference samples used (without missing values).

If you have downloaded the source code or data on git, you can test these procedures using:

epcy pred --norm --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/matrix.tsv -o ./data/small_for_test/using_na
epcy pred --replacena 0 --norm --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/matrix.tsv -o ./data/small_for_test/replace_na