Short Term Course on Machine Learning for Practitioners

Ankita Mandal and Aparajita Khan

Indian Statistical Institute

E-mail: amandal@isical.ac.in, aparajitak_r@isical.ac.in

November 19, 2019

Supervised Feature Selection

Hypothesis Testing based feature selection using $t$-Test

Two classes:

\begin{align} \text{Class} \space \space \boldsymbol{\omega_1} \space \space \text{with mean} \space \space \mu_1 \space \space \text{variance} \space \space \sigma_1^2 \\ \text{Class} \space \space \boldsymbol{\omega_2} \space \space \text{with mean} \space \space \mu_2 \space \space \text{variance} \space \space \sigma_2^2 \end{align}

Assumption: $ \sigma_1^2 = \sigma_2^2 = \sigma^2 $

Hypothesis design for feature selection in classification problem:

\begin{align} & H_1 \mbox{ (Difference between class means is non-zero): }& \mu_1 - \mu_2 \neq 0 \\ & H_0 \mbox{ (Difference between class means is zero): }& \mu_1 - \mu_2 = 0 \\ \end{align}

Feature with higher difference between class means has greater class separability, hence better feature.

In [1]:
N=200
w1<-cbind(rnorm(N,mean=5,sd=1),rnorm(N,mean=4,sd=1))
w2<-cbind(rnorm(N,mean=15,sd=1),rnorm(N,mean=6,sd=1))
X<-rbind(w1,w2)
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N))
colvec = c("coral3","darkseagreen3")[true]
pchs= c(22,24)[true]
plot(X,col="black",bg=colvec,pch=pchs,xlab="Feature1",ylab="Feature2",main="Scatter plot of Data")
 Glimpse of Dataset X: 
         [,1]     [,2]
[1,] 3.789171 4.237898
[2,] 6.264211 4.287199
[3,] 5.703193 3.444035
[4,] 4.090515 2.175635
[5,] 6.070596 5.547517

 Dimension of Dataset: 	 Samples: 400 	 Features: 2

Test Statistic

\begin{equation} q = \frac{(\bar{x}-\bar{y})-(\mu_1 - \mu_2)}{s \sqrt{\frac{2}{N}}} \tag{1} \end{equation}

where \begin{align} & \bar{x}= \frac{1}{N} \sum\limits_{i=1}^{N} x_i, \hspace{1cm} x_i \mbox{ are samples values of the feature in class } \boldsymbol{\omega_1} \\ & \bar{y}= \frac{1}{N} \sum\limits_{i=1}^{N} y_i, \hspace{1cm} y_i \mbox{ are samples values of the feature in class } \boldsymbol{\omega_2} \\ & s^2= \frac{1}{2N-2} \left( \sum\limits_{i=1}^{N} (x_i - \bar{x})^2 + \sum\limits_{i=1}^{N} (y_i - \bar{y})^2\right) \hspace{1cm} \mbox{class-wise sample variance.} \end{align}

According to null-hypothesis $H_0$, $\mu_1 - \mu_2=0$

$q$ follows $t$-distributution with $2N-2$ degrees of freedom.

For Feature 1:

In [2]:
f=X[,1]
fw1=f[1:N]
fw2=f[(N+1):(2*N)]
xbar=mean(fw1)
ybar=mean(fw2)
s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
q=(xbar-ybar)/(s*sqrt(2/N))
cat("\n xbar=",xbar," ybar=",ybar)
cat("\n\n Test statistic q= ",q)
cat("\n q follows t- distribution with ",2*N-2," degrees of freedom")
 xbar= 4.949148  ybar= 15.03573

 Test statistic q=  -7.19591
 q follows t- distribution with  398  degrees of freedom

Compute $p$ value

If p-value < 0.05: Feature has signficant difference between the class means

If p-value $\geq$ 0.05: Difference between class means along the feature is not significant

In [3]:
p.val=pt(q,2*N-2)
cat("\n p-value for hypothesis test on Feature 1= ",p.val)
alpha=0.05
if(p.val< alpha){
    cat("\n Feature is relevant")  
}else     
    cat("\n Feature  is not relevant, lacks class separability")
 p-value for hypothesis test on Feature 1=  1.5546e-12
 Feature is relevant

For Feature 2:

In [4]:
f=X[,2]
fw1=f[1:N]
fw2=f[(N+1):(2*N)]
xbar=mean(fw1)
ybar=mean(fw2)
s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
q=(xbar-ybar)/(s*sqrt(2/N))
cat("\n xbar=",xbar," ybar=",ybar)
cat("\n\n Test statistic q= ",q)
p.val=pt(q,2*N-2)
cat("\n p-value for hypothesis test on Feature 2= ",p.val)
if(p.val< alpha){
    cat("\n Feature is relevant")  
}else     
    cat("\n Feature  is not relevant, lacks class separability")
 xbar= 4.016379  ybar= 6.02883

 Test statistic q=  -1.330049
 p-value for hypothesis test on Feature 2=  0.09213189
 Feature  is not relevant, lacks class separability

Selection of relevant features from multiple features

In [5]:
N=200
w1<-cbind(rnorm(N,mean=4,sd=1),rnorm(N,mean=7,sd=1),rnorm(N,mean=10,sd=1),rnorm(N,mean=-3.5,sd=1))
w2<-cbind(rnorm(N,mean=7,sd=1),rnorm(N,mean=16,sd=1),rnorm(N,mean=12,sd=1),rnorm(N,mean=2,sd=1))
X<-rbind(w1,w2)
colnames(X) <- c("Feature1","Feature2","Feature3","Feature4")
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N))
colvec = c("deepskyblue3","orange2")[true]
pchs= c(22,24)[true]
pairs(X, col=colvec, pch=pchs)
 Glimpse of Dataset X: 
     Feature1 Feature2  Feature3  Feature4
[1,] 3.174616 7.783940  9.648941 -3.437333
[2,] 2.675435 7.345537  8.887468 -4.900301
[3,] 4.798304 6.754174 10.188016 -4.987679
[4,] 3.822017 5.577670 10.504906 -3.293340
[5,] 3.950374 7.028291  9.211185 -2.662460

 Dimension of Dataset: 	 Samples: 400 	 Features: 4
In [6]:
d=ncol(X)
q.val=rep(0,d)
for(i in 1:d)
{
    f=X[,i]
    fw1=f[1:N]
    fw2=f[(N+1):(2*N)]
    xbar=mean(fw1)
    ybar=mean(fw2)
    s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
    q=(xbar-ybar)/(s*sqrt(2/N))
    p.val=pt(q,2*N-2)
    cat("\n\n Test statistic q for Feature ",i," = ",q,"p-value=",p.val)
    q.val[i]=q
}
ord=sort(abs(q.val),decreasing=TRUE,index.return=TRUE)$ix
cat("\n\n Ordering of features based on t-Test=",ord)

 Test statistic q for Feature  1  =  -2.250143 p-value= 0.0124927

 Test statistic q for Feature  2  =  -6.566627 p-value= 8.052884e-11

 Test statistic q for Feature  3  =  -1.322181 p-value= 0.09343364

 Test statistic q for Feature  4  =  -3.71284 p-value= 0.0001170868

 Ordering of features based on t-Test= 2 4 1 3

Feature selection using Fisher Discriminant Ratio

Multi-class case:

Classes = $M$

\begin{equation} FDR =\sum\limits_{i=1}^M \sum\limits_{j \neq i}^M \frac{ (\mu_i -\mu_j)^2}{\sigma_i^2+\sigma_j^2} \tag{2} \end{equation}
In [7]:
cat("IRIS dataset\n")
print(iris[1:5,])
X<-iris[,-5]
class=as.numeric(iris$Species)
cat("\n Samples: ",dim(X)[1],"\t Features: ",dim(X)[2],"\t Classes: 3:- ",paste0(levels(iris$Species),collapse=", "))
colvec = c("coral3","darkseagreen3","darkgoldenrod2")[class]
pchs= c(22,23,24)[class]
pairs(X, col=colvec, pch=pchs)
IRIS dataset
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

 Samples:  150 	 Features:  4 	 Classes: 3:-  setosa, versicolor, virginica
In [8]:
nclass=length(unique(class))
FDR=rep(0,ncol(X))
for(d in 1:ncol(X))
{
    FDRd=0
    f=X[,d]
     for(i in 1:nclass)
     {
         for(j in 1:nclass)
         {
            if(i!=j)
            {
                fi=f[which(class==i)]
                fj=f[which(class==j)]
                FDRd=FDRd+((mean(fi)-mean(fj))^2)/(var(fi)+var(fj))
            }
            
         } 
     }
    cat("\n FDR for Feature ",d," = ",FDRd)
    FDR[d]=FDRd
}
ord=sort(abs(FDR),decreasing=TRUE,index.return=TRUE)$ix
cat("\n\n Ordering of features based on FDR: \n",paste0(colnames(X)[ord],collapse=", "))
 FDR for Feature  1  =  15.16455
 FDR for Feature  2  =  5.651219
 FDR for Feature  3  =  168.686
 FDR for Feature  4  =  128.2398

 Ordering of features based on FDR: 
 Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

Feature selection using Scatter Matrices

Generate Dataset

In [9]:
N=250
w1<-cbind(rnorm(N,mean=5,sd=1.5),rnorm(N,mean=4,sd=1.2))
w2<-cbind(rnorm(N,mean=9,sd=1.3),rnorm(N,mean=8,sd=1.8))
w3<-cbind(rnorm(N,mean=15,sd=1),rnorm(N,mean=1,sd=1))
X<-rbind(w1,w2,w3)
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N),rep(3,N))
colvec = c("lightpink2","turquoise2","darkolivegreen")[true]
pchs= c(22,24,21)[true]
plot(X,col="black",bg=colvec,pch=pchs,xlab="Feature1",ylab="Feature2",main="Scatter plot of Data")
 Glimpse of Dataset X: 
         [,1]     [,2]
[1,] 3.596704 2.816212
[2,] 5.376808 2.965242
[3,] 5.921716 5.442881
[4,] 4.947277 4.982256
[5,] 4.056940 5.642649

 Dimension of Dataset: 	 Samples: 750 	 Features: 2

$M$ classes: $\space \space \boldsymbol{\omega_1}, \space \boldsymbol{\omega_2}, \space ...\space, \space \boldsymbol{\omega_M}$

Class-wise prior probability: \begin{equation} P(\boldsymbol{\omega_i}) \simeq \frac{N_i}{N} \end{equation} $\space N_i$ number of samples in class $\boldsymbol{\omega_i}$ out of a total of $N$ samples.

Class specific covariance matrix: \begin{equation} \Sigma_i=E[(x-\mu_i)(x-\mu_i)^T] \end{equation}

Global mean vector: \begin{equation} \mu_0= \sum \limits_{i=1}^M P(\boldsymbol{\omega_i}) \mu_i \end{equation}

In [10]:
N=nrow(X)
nclass=length(unique(class))
mu0<-colMeans(X)
S<-list()
Mu<-list()
P<-list()
for(i in 1:nclass)
{
    wi=X[which(class==i),]
    Ni=nrow(wi)
    P[[i]]<-Ni/N
    Mu[[i]]<-colMeans(wi)
    S[[i]]<-cov(wi)
    cat("\n\n Class ",i,": ")
    cat("\n Prior P: ",P[[i]])
    cat("\n Mean Mu: ",Mu[[i]])
    cat("\n Covariance Matrix: \n")
    print(S[[i]])
}

 Class  1 : 
 Prior P:  0.06666667
 Mean Mu:  5.114443 4.004758
 Covariance Matrix: 
           [,1]       [,2]
[1,]  1.9929670 -0.4663106
[2,] -0.4663106  1.7277439


 Class  2 : 
 Prior P:  0.06666667
 Mean Mu:  5.12744 3.734254
 Covariance Matrix: 
          [,1]      [,2]
[1,] 2.0348658 0.1327746
[2,] 0.1327746 1.4389839


 Class  3 : 
 Prior P:  0.06666667
 Mean Mu:  5.376476 3.849957
 Covariance Matrix: 
            [,1]        [,2]
[1,]  2.79877673 -0.06810733
[2,] -0.06810733  1.31435905

Within-class scatter matrix

\begin{equation} S_w =\sum\limits_{i=1}^M P(\boldsymbol{\omega_i}) \Sigma_i \tag{3} \end{equation}

Between-class scatter matrix

\begin{equation} S_b=\sum \limits_{i=1}^M P(\boldsymbol{\omega_i}) (\mu_i-\mu_0)(\mu_i-\mu_0)^T \tag{4} \end{equation}

Mixture scatter matrix

\begin{equation} S_m=E[(x-\mu_0)(x-\mu_0)^T]=S_w+S_b \tag{5} \end{equation}
In [11]:
Sw=0
Sb=0
for(i in 1:nclass)
{
    mui0=Mu[[i]]-mu0
    Sw=Sw+ P[[i]]*S[[i]]
    Sb=Sb+ P[[i]]*outer(mui0,mui0)
}
Sm=Sw+Sb
cat("\n\n Within-class scatter matrix: \n")
print(Sw)
cat("\n\n Between-class scatter matrix: \n")
print(Sb)
cat("\n\n Mixture scatter matrix: \n")
print(Sm)

 Within-class scatter matrix: 
            [,1]        [,2]
[1,]  0.45510730 -0.02677622
[2,] -0.02677622  0.29873912


 Between-class scatter matrix: 
          [,1]      [,2]
[1,] 4.0280189 0.3458786
[2,] 0.3458786 0.0322358


 Mixture scatter matrix: 
          [,1]      [,2]
[1,] 4.4831262 0.3191024
[2,] 0.3191024 0.3309749

Class separability measures

\begin{equation} J_1=\frac{trace\{S_m\}}{trace\{S_w\}} \tag{5} \end{equation}\begin{equation} J_2=\frac{\lvert S_m \rvert}{\lvert S_w \rvert}=\lvert S_w^{-1} S_m \rvert \tag{6} \end{equation}\begin{equation} J_3=trace\{ S_w^{-1} S_m \} \tag{7} \end{equation}

$\lvert A \rvert$ denotes determinant of matrix $A$.

In [12]:
J1=sum(diag(Sm))/sum(diag(Sw))
J2=det(solve(Sw)%*%Sm)
J3=sum(diag(solve(Sw)%*%Sm))

cat("\n J1= ",J1,"\n J2= ",J2,"\n J3= ",J3)
 J1=  6.38605 
 J2=  10.21859 
 J3=  11.14306