Details of predictive capability columns
By default, epcy pred and epcy pred_rna will return an output file predictive_capability.tsv of 9 columns:
Default columns:
id: the id of each feature.
l2fc: log2 fold change.
kernel_mcc: Matthews Correlation Coefficient (MCC) compute by a predictor using KDE.
kernel_mcc_low, kernel_mcc_high: lower and upper bounds of confidence interval (90%).
mean_query: average values of this feature for samples in the subgroup of interest defined using the –query parameter.
mean_ref: average values of this feature for samples in the reference group (not in the query subset).
bw_query: estimated bandwidth used by KDE, to calculate the density of query samples.
bw_ref: estimated bandwidth used by KDE, to calculate the density of ref samples.
However, epcy can expand or modify this default output using several options, see:
epcy pred -h
Predictive scores
By default we decide to return MCC scores. However, it’s possible to compute other predictive scores, in case they are more suitable for your needs. Using the following parameters will add new columns to the default output, as:
--ppv:
--npv:
--tpr:
--tnr:
--fnr:
--fpr:
--fdr:
--for:
--ts:
--acc:
--f1:
--auc:
auc: Area Under the Curve
Statistical test
EPCY is able to perform statistical tests, using:
--auc --utest:
u_pv: pvalue compute by a MannWhitney rank test
--ttest:
t_pv: pvalue compute by ttest_ind
Normal distribution
The type of classifier used to evaluate the predictive score of each gene (feature), is a parameter of EPCY. By default, EPCY will use a KDE classifier. However, it is possible to replace the KDE classifier by a normal classifier, using --normal.
Using the normal classifier, all predictive scores (listed above) remain available. However, the column name of each predictive score, will be changed to start with normal instead of kernel (normal_mcc vs kernel_mcc), to be consistant.
Missing values
On some dataset (as in proteomics or single-cell), quantitative matrix can have some missing values (nan). In that case there are different alternatives to manage these missing values within EPCY: * Impute missing values before running EPCY. * Replace missing values by a constant, using --replacena. * For each gene (or feature), remove samples with missing values.
If you choose to remove samples with missing values, EPCY will return a predictive_capability.tsv with two new columns, sample_query and sample_ref, to report for each gene (feature), the number of query and reference samples used (without missing values).
If you have downloaded the source code or data on git, you can test these procedures using:
epcy pred --norm --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/matrix.tsv -o ./data/small_for_test/using_na epcy pred --replacena 0 --norm --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/matrix.tsv -o ./data/small_for_test/replace_na