WCLFJHEF

Weighted Combination of Lukasiewicz Implication and Fuzzy Jaccard similarity in Hybrid Ensemble Framework (WCLFJHEF) for Gene Selection

Authors:Sukriti Roy, Joginder Singh, and Shubhra Sankar Ray

research.sr22@gmail.com, joginder265@gmail.com, and shubhra@isical.ac.in

The steps to be followed for gene selection (using Python 3 and 16GB RAM) using WCLFJHEF are:

1. Download cancer datasets from: Expression_leukemia.csv, Expression_breast.csv, Expression_srbct.csv

2. Open python and install the packages numpy, math, csv, pandas, sklearn, operator, ReliefF, mlxtend.feature_selection, itertools, collections, matplotlib, warnings, skfuzzy, sys, shap, interpret.glassbox, and xgboost.

Use command 'pip install package_name' e.g., 'pip install ReliefF'. In higher versions of python use pip3 in place of pip.

For the package skfuzzy, use command 'pip install scikit-fuzzy'.

3. Download code from: WCLFJHEF.py

4.        Keep the code and the datasets in the same folder, otherwise change the folder path along with the name of the dataset in the code (Line number 45).
5.        Run the code and when prompted provide the number of genes to be selected.

6.        Three files: Result.csv, Genenames.txt, and Gene_id.txt will be generated. "Result.csv" contains the classification results of the selected genes. While "Genenames.txt" contains only genenames, "Gene_id.txt" contains the gene indices only.

        Note that if you want to try other methods, search for comments like "clustering", "ensembling" and make changes accordingly.

The steps to biologically validate the selected genes are as follows:

I. For Gene Ontology:

a) Open webpage: DAVID.

b) Copy the gene names from "Genenames.txt".

c) Paste the gene names in the webpage.

d) Select identifier type. For example, for leukemia dataset select 'Affymetrix-3prime-IVT-ID', and for breast cancer dataset select 'RefSeqRNA' as identifiers.

e) Select species as "Homo Sapiens" and click on submit list.

II. For KEGG pathway:

a) Open webpage: PantherDB.

b) Copy gene names from "Genenames.txt" and paste them in the webpage.

c) Select analysis-"Statistical enrichment test" and click on submit.

To find the rank of the selected genes using explainable model involving SHAP values:

1. Download code from: ExplainableAI.py.

For running the above code in windows environment, use float64 in place of float in line number 26.

2. Keep datasets, "Gene_id.txt", and the above code in the same folder and run the code.

3. A file Generank.txt will be generated containing a sorted list of gene names and their corresponding SHAP values.

Hello users!!! Hope you found this page helpful... N.B.: Before running the above codes make sure you have all the packages installed in your API or local machine.