Significant Research Contribution


Source Code and Some Important Results

  1. FRSAC: Fuzzy-Rough Supervised Attribute/Feature Clustering

  2. MRMS: Rough Set Based Maximum Relevance-Maximum Significance Criterion for Feature/Gene Selection

  3. FEPM: Fuzzy Equivalence Partition Matrix for Feature/Gene Selection

  4. f-mRMR: f-Information Measures for Feature/Gene Selection

  5. RFPCM: Rough Set Based Generalized Fuzzy C-Means Algorithm


Research Contribution

  1. Design of String Kernel Function for Biological Sequence Analysis
    In order to apply the powerful kernel based pattern recognition algorithms such as support vector machines to predict functional sites in proteins, amino acids need encoding prior to input. In this regard, a new string kernel function, termed as the modified bio-basis function, is proposed that maps a nonnumerical sequence space to a numerical feature space. The proposed string kernel function is developed based on the conventional bio-basis function and needs a bio-basis string as a support like conventional kernel function. The concept of zone of influence of bio-basis string is introduced in the proposed kernel function to take into account the influence of each bio-basis string in nonnumerical sequence space. An efficient method is described to select a set of bio-basis strings for the proposed kernel function, integrating the Fisher ratio and a novel concept of degree of resemblance. The integration enables the method to select a reduced set of relevant and nonredundant bio-basis strings. Some new quantitative indices are introduced for evaluating the quality of selected bio-basis strings. The effectiveness of the new string kernel function and bio-basis string selection method, along with a comparison with existing bio-basis function and related bio-basis string selection methods, is demonstrated on different protein data sets using the new quantitative indices and support vector machines.

    Publication:

    1. P. Maji and C. Das, Efficient Design of Bio-Basis Function to Predict Protein Functional Sites Using Kernel-Based Classifiers, IEEE Transactions on NanoBioscience, 9(4), pp. 242--249, December 2010.

    2. P. Maji and C. Das, Protein Functional Sites Prediction Using Modified Bio-Basis Function and Quantitative Indices, IEEE Transactions on NanoBioscience, 9(4), pp. 250--257, December 2010.


  2. Supervised Attribute Clustering
    One of the major tasks with gene expression data is to find groups of co-regulated genes whose collective expression is strongly associated with sample categories. In this regard, a new supervised attribute clustering algorithm is presented to find such groups of genes. The proposed algorithm directly incorporates the information of sample categories into the gene clustering process. Some new quantitative measures are introduced based on fuzzy-rough sets and mutual information that incorporate the information of sample categories to measure the similarity among genes. The proposed algorithm is based on measuring the similarity between genes using the new quantitative measures, whereby redundancy among the genes is removed. The clusters are refined incrementally based on sample categories. The effectiveness of the proposed algorithm, along with a comparison with existing supervised and unsupervised gene selection and clustering algorithms, is demonstrated on different cancer and arthritis data sets based on the class separability index and predictive accuracy of naive bayes classifier, K-nearest neighbor rule, and support vector machine. The biological significance of the generated clusters is interpreted using the gene ontology. An important finding is that the proposed supervised attribute clustering algorithm is shown to be effective for identifying biologically significant gene clusters with excellent predictive capability.

    Publication:

    1. P. Maji, Mutual Information Based Supervised Attribute Clustering for Microarray Sample Classification, IEEE Transactions on Knowledge and Data Engineering, 24(1), pp. 127--140, 2012.

    2. P. Maji, Fuzzy-Rough Supervised Attribute Clustering Algorithm and Classification of Microarray Data, IEEE Transactions on System, Man and Cybernetics, Part B, Cybernetics, 41(1), pp. 222--233, February 2011.


  3. MRMS: Rough Set Based Maximum Relevance-Maximum Significance Criteria for Feature/Gene Selection
    A new feature selection algorithm is presented based on rough set theory. It selects a set of features from a data set by maximizing the relevance and significance of the selected features. A theoretical analysis is presented to justify the use of both relevance and significance criteria for selecting a reduced feature set with high predictive accuracy. The importance of rough set theory for computing both relevance and significance of the features is also established. The performance of the proposed algorithm, along with a comparison with other related methods, is studied using the predictive accuracy of K-nearest neighbor rule and support vector machine on three QSAR, five cancer and two arthritis microarray data sets.

    Publication:

    1. P. Maji and S. Paul, Rough Sets for Selection of Molecular Descriptors to Predict Biological Activity of Molecules, IEEE Transactions on System, Man and Cybernetics, Part C, Applications and Reviews, 40(6), pp. 639--648, November 2010.

    2. P. Maji and S. Paul, Rough Set Based Maximum Relevance-Maximum Significance Criterion and Gene Selection from Microarray Data, International Journal of Approximate Reasoning, 52(3), pp. 408--426, March 2011.


  4. FEPM: Fuzzy Equivalence Partition Matrix for Feature/Gene Selection
    The selection of nonredundant and relevant features of real valued data sets is a highly challenging problem. A novel feature selection method is presented here based on fuzzy-rough sets by maximizing the relevance and minimizing the redundancy of the selected features. By introducing the fuzzy equivalence partition matrix, a novel representation of Shannon's entropy for fuzzy approximation spaces is proposed to measure the relevance and redundancy of features suitable for real valued data sets. The fuzzy equivalence partition matrix is based on the theory of fuzzy-rough sets, where each row of the matrix represents a fuzzy equivalence partition that can be automatically derived from the given data. It also offers an efficient way to calculate many more information measures, termed as f-information measures. Several f-information measures are shown to be effective for selecting nonredundant and relevant features of real valued data sets. This work compares the performance of different f-information measures for feature selection in fuzzy approximation spaces. Some quantitative indices are introduced based on fuzzy-rough sets for evaluating the performance of proposed method. The effectiveness of the proposed method, along with a comparison with other methods, is demonstrated on a set of real life data sets and microarray gene expression data.

    Publication:

    1. P. Maji and S. K. Pal, Feature Selection Using f-Information Measures in Fuzzy Approximation Spaces, IEEE Transactions on Knowledge and Data Engineering, 22(6), pp. 854--867, June 2010.

    2. P. Maji and S. K. Pal, Fuzzy-Rough Sets for Information Measures and Selection of Relevant Genes from Microarray Data, IEEE Transactions on System, Man and Cybernetics, Part B, Cybernetics, 40(3), pp. 741--752, June 2010.


  5. f-mRMR: f-Information Measures for Feature/Gene Selection
    Among the large amount of genes presented in microarray gene expression data, only a small fraction of them are effective for performing a certain diagnostic test. In this regard, mutual information has been shown to be successful for selecting a set of relevant and nonredundant genes from microarray data. However, information theory offers many more measures such as the f-information measures that may be suitable for selection of genes from microarray gene expression data. This work presents different f-information measures as the evaluation criteria for gene selection problem. To compute the gene-gene redundancy (respectively, gene-class relevance), these information measures calculate the divergence of the joint distribution of two genes' expression values (respectively, the expression values of a gene and the class labels of samples) from the joint distribution when two genes (respectively, the gene and class label) are considered to be completely independent. The performance of different f-information measures is compared with that of mutual information based on the predictive accuracy of naive bayes classifier, K-nearest neighbor rule, and support vector machine. An important finding is that some finformation measures are shown to be effective for selecting relevant and nonredundant genes from microarray data. The effectiveness of different f-information measures, along with a comparison with mutual information, is demonstrated on breast cancer, leukemia, and colon cancer data sets. While some f-information measures provide 100% prediction accuracy for all three microarray data sets, mutual information attains this accuracy only for breast cancer data set, and 98.6% and 93.6% for leukemia and colon cancer data sets, respectively.

    Publication:

    1. P. Maji, f-Information Measures for Efficient Selection of Discriminative Genes from Microarray Data, IEEE Transactions on Biomedical Engineering, 56(4), pp. 1063--1069, April 2009.


  6. RFPCM: Rough Set Based Generalized Fuzzy C-Means Algorithm
    A generalized hybrid unsupervised learning algorithm, termed as rough-fuzzy-possibilistic c-means, is proposed in this work. It comprises a judicious integration of the principles of rough sets and fuzzy sets. While the concept of lower and upper approximations of rough sets deals with uncertainty, vagueness, and incompleteness in class definition, the membership function of fuzzy sets enables efficient handling of overlapping partitions. It incorporates both probabilistic and possibilistic memberships simultaneously to avoid the problems of noise sensitivity of fuzzy c-means and the coincident clusters of possibilistic c-means. The concept of crisp lower bound and fuzzy boundary of a class, introduced in rough-fuzzy-possibilistic c-means, enables efficient selection of cluster prototypes. The algorithm is generalized in the sense that all the existing variants of c-means algorithms can be derived from the proposed algorithm as a special case. Several quantitative indices are introduced based on rough sets for evaluating the performance of the proposed c-means algorithm. The effectiveness of the algorithm, along with a comparison with other algorithms, has been demonstrated both qualitatively and quantitatively on a set of real life data sets.

    Publication:

    1. P. Maji and S. K. Pal, Rough Set Based Generalized Fuzzy C-Means Algorithm and Quantitative Indices, IEEE Transactions on System, Man and Cybernetics, Part B, Cybernetics, 37(6), pp. 1529--1540, December 2007.

    2. P. Maji and S. K. Pal, RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets, Fundamenta Informaticae, 80(4), pp. 475--496, November 2007.

    3. P. Maji and S. K. Pal, Maximum Class Separability for Rough-Fuzzy C-Means Based Brain MR Image Segmentation, LNCS Transactions on Rough Sets, IX, 5390, pp. 114--134, December 2008.


  7. RFCMdd: Rugh-Fuzzy C-Medoids Algorithm
    In most pattern recognition algorithms, amino acids cannot be used directly as inputs since they are nonnumerical variables. They, therefore, need encoding prior to input. In this regard, bio-basis function maps a nonnumerical sequence space to a numerical feature space. It is designed using an amino acid mutation matrix. One of the important issues for the bio-basis function is how to select a minimum set of bio-bases with maximum information. In this work, the rough-fuzzy c-medoids algorithm is proposed to select most informative bio-bases. It comprises a judicious integration of the principles of rough sets, fuzzy sets, c-medoids algorithm, and amino acid mutation matrix. While the membership function of fuzzy sets enables efficient handling of overlapping partitions, the concept of lower and upper bounds of rough sets deals with uncertainty, vagueness, and incompleteness in class definition. The concept of crisp lower bound and fuzzy boundary of a class, introduced in rough-fuzzy c-medoids, enables efficient selection of a minimum set of most informative bio-bases. Some new indices are introduced for evaluating quantitatively the quality of selected bio-bases. The effectiveness of the proposed algorithm, along with a comparison with other algorithms, has been demonstrated on different types of protein data sets.

    Publication:

    1. P. Maji and S. K. Pal, Rough-Fuzzy C-Medoids Algorithm and Selection of Bio-Basis for Amino Acid Sequence Analysis, IEEE Transactions on Knowledge and Data Engineering, 19(6), pp. 859--872, June 2007.

    2. P. Maji and S. K. Pal, Protein Sequence Analysis Using Relational Soft Clustering Algorithms, International Journal of Computer Mathematics, 84(5), pp. 599--617, May 2007.


  8. Content Based Image Retrieval Using Visually Significant Point Features
    This work presents a new image retrieval scheme using visually significant point features. The clusters of points around significant curvature regions (high, medium, and weak type) are extracted using a fuzzy set theoretic approach. Some invariant color features are computed from these points to evaluate the similarity between images. A set of relevant and non-redundant features is selected using the mutual information based minimum redundancy-maximum relevance framework. The relative importance of each feature is evaluated using a fuzzy entropy based measure, which is computed from the sets of retrieved images marked relevant and irrelevant by the users. The performance of the system is evaluated using different sets of examples from a general purpose image database. The robustness of the system is also shown when the images undergo different transformations.

    Publication:

    1. M. Banerjee, M. K. Kundu, and P. Maji, Content-Based Image Retrieval Using Visually Significant Point Features, Fuzzy Sets and Systems, 160(23), pp. 3323--3341, December 2009.


  9. Second Order Fuzzy Measure and Weighted Co-Occurrence Matrix for Segmentation of Brain MR Images
    A robust thresholding technique is proposed in this work for segmentation of brain MR images. It is based on the fuzzy thresholding techniques. Its aim is to threshold the gray level histogram of brain MR images by splitting the image histogram into multiple crisp subsets. The histogram of the given image is thresholded according to the similarity between gray levels. The similarity is assessed through a second order fuzzy measure such as fuzzy correlation, fuzzy entropy, and index of fuzziness. To calculate the second order fuzzy measure, a weighted co-occurrence matrix is presented, which extracts the local information more accurately. Two quantitative indices are introduced to determine the multiple thresholds of the given histogram. The effectiveness of the proposed algorithm, along with a comparisonwith standard thresholding techniques, is demonstrated on a set of brain MR images.

    Publication:

    1. P. Maji, M. K. Kundu, and B. Chanda, Second Order Fuzzy Measure and Weighted Co-Occurrence Matrix for Segmentation of Brain MR Images, Fundamenta Informaticae, 88(1-2), pp. 161--176, 2008.


  10. NNTree: Efficient Design of Neural Network Tree Using A New Splitting Criterion
    This work presents the design of a hybrid learning model, termed as neural network tree (NNTree). It incorporates the advantages of both decision tree and neural network. An NNTree is a decision tree, where each non-terminal node contains a neural network. The idea of the proposed method is to use the framework of multilayer perceptron to design tree-structured pattern classifier. At each non-terminal node, the multilayer perceptron partitions the dataset into m subsets; m being the number of classes in the dataset present at that node. The NNTree is designed by splitting the non-terminal nodes of the tree by maximizing classification accuracy of the multilayer perceptron. In effect, it produces a reduced height m-ary tree. The versatility of the proposed scheme is illustrated through its application in diverse fields. The effectiveness of the hybrid algorithm, along with a comparison with other related algorithms, has been demonstrated on a set of benchmark datasets. Simulation results show that the NNTree achieves excellent performance in terms of classification accuracy, size of the tree, and classification time.

    Publication:

    1. P. Maji, Efficient Design of Neural Network Tree Using A New Splitting Criterion, Neurocomputing, 71(4-6), pp. 787--800, January 2008.

    2. P. Maji and C. Das, Pattern Classification Using NNTree: Design and Application for Biological Dataset, Journal of Intelligent Systems, 17(1-3), pp. 51--71, 2008.


  11. Cellular Automata Evolution for Pattern Recognition
    Study of Cellular Automata (CA) as a modeling tool has received considerable attention in recent years. Researchers from different fields - image processing, language recognition, pattern recognition, VLSI testing etc - have proposed CA based models for different applications. However, the research community from diverse disciplines looks forward for a versatile and robust CA based modeling tool to study physical systems observed in nature around us or designed artificially. Following two objectives need due consideration to build such a tool: the analytical framework to derive the global dynamics of the CA from the local rules; and inversely, to derive the local rules of the desired CA simulating the global behavior of the system to be modeled. The thesis realizes these two objectives for a specific application domain. It presents analysis and synthesis framework of a special class of linear CA termed as Multiple Attractor CA (MACA). This class of CA employs only XOR logic as the next state function of a cell. The thesis addresses a number of open problems associated with MACA characterization. The new characterization has provided the foundation for MACA synthesis in O(n) time complexity where n is the number of CA cells. The synthesis framework is supported with genetic evolution to arrive at the desired local rules of a CA modeling a global function. The versatility of the proposed model has been derived from three basic frameworks - analysis, synthesis, and genetic evolution. The MACA based modeling tool is next employed to address pattern recognition problem. The thesis presents the design and application of MACA based pattern classifier in diverse fields like data mining, image compression, fault diagnosis, etc. The evolution scheme is next extended for non-linear CA synthesis. Such a CA employs all types of next state function (linear/additive/non-linear) for a CA cell. The non-linear CA is explored for modeling associative memory. An in-depth study of non-linear CA state space is also reported. Solutions of many real life problems display fuzzy characteristics. In this context, the last section of the thesis introduces Fuzzy CA (FCA) - the CA that employs fuzzy logic as next state function. The application of FCA has been demonstrated in the field of pattern recognition/classification.

    Publication:

    1. P. Maji, On Characterization of Attractor Basins of Fuzzy Multiple Attractor Cellular Automata, Fundamenta Informaticae, 86(1-2), pp. 143--168, 2008.

    2. P. Maji and P. Pal Chaudhuri, Non-uniform Cellular Automata Based Associative Memory: Evolutionary Design and Basins of Attraction, Information Sciences, 178(10), pp. 2315--2336, May 2008.

    3. P. Maji and P. Pal Chaudhuri, RBFFCA: A Hybrid Pattern Classifier Using Radial Basis Function and Fuzzy Cellular Automata, Fundamenta Informaticae, 78(3), pp. 369--396, August 2007.

    4. P. Maji and P. Pal Chaudhuri, Fuzzy Cellular Automata for Modeling Pattern Classifier, IEICE Transactions on Information and Systems, E88-D(4), pp. 691--702, April 2005.

    5. N. Ganguly, P. Maji, B. K. Sikdar, and P. Pal Chaudhuri, Design and Characterization of Cellular Automata Based Associative Memory for Pattern Recognition, IEEE Transactions on System, Man and Cybernetics, Part B, Cybernetics, 34(1), pp. 672--678, February 2004.

    6. P. Maji, C. Shaw, N. Ganguly, B. K. Sikdar, and P. Pal Chaudhuri, Theory and Application of Cellular Automata for Pattern Classification, Fundamenta Informaticae, 58(3-4), pp. 321--354, December 2003.

    7. P. Maji, N. Ganguly, and P. Pal Chaudhuri, Error Correcting Capability of Cellular Automata Based Associative Memory, IEEE Transactions on System, Man and Cybernetics, Part A, Systems and Humans, 33(4), pp. 466--480, July 2003.

    8. N. Ganguly, P. Maji, B. K. Sikdar, and P. Pal Chaudhuri, Generalized Multiple Attractor Cellular Automata (GMACA) Model for Associative Memory, International Journal of Pattern Recognition and Artificial Intelligence, 16(7), pp. 781--795, November 2002.




  Home | Biography | Publication | Awards | Project | Activity