1 Description of the dataset

The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA).

The BRCA data is publicly available (https://cancergenome.nih.gov/) and is decomposed into two datasets:

  1. the gene expression data, composed of 57251 variables for a total of 1222 samples with 1097 individuals. From those samples, 1102 with primary solid tumor, 7 metastatic and 113 with normal tissue;

  2. the clinical data is composed of 113 variables obtained from the same individuals. (much more is available in the gdc data variable, such as follow_up, drug, radiation, …)

2 Importing the dataset

The BRCA datasets will be imported using the package ‘brca.data’. Paired samples will be considered for model building, corresponding to individuals for whom normal and cancer tissue was sampled, i.e., 113 individuals.

# Package is retrieved from our group server
library(devtools)
# only install if version is below 1.0
if (!any(installed.packages()[,'Package'] == 'brca.data') || as.numeric(installed.packages()['brca.data','Version']) < 1.0)
  devtools::install_url('http://sels.tecnico.ulisboa.pt/gitlab/SOUND/brca.data/repository/archive.zip')
library(brca.data)

data(fpkm.per.tissue)
data(clinical)

data.short.name <- clinical$all$bcr_patient_barcode

data.tumor <- fpkm.per.tissue$primary.solid.tumor
data.normal <- fpkm.per.tissue$solid.tissue.normal

rm(fpkm.per.tissue)

tcga.barcode.pattern <- '(TCGA-[A-Z0-9a-z]{2}-[a-zA-Z0-9]{4})-([0-9]{2}).+'
data.tumor.short.name <- gsub(tcga.barcode.pattern,'\\1',colnames(data.tumor))
data.normal.short.name <- gsub(tcga.barcode.pattern,'\\1',colnames(data.normal))
data.tumor.paired <- data.tumor[,data.tumor.short.name %in% data.normal.short.name]
dim(data.tumor.paired)
## [1] 57251   117
dim(data.normal)
## [1] 57251   113
# removing samples which are duplicates and triplicates
data.tumor.paired <- data.tumor.paired[,-c(5,6,8,10)]
data.raw <- t(cbind(data.tumor.paired,data.normal))

# remove variables for with sd = 0
sd.0.ix <- which(sapply(seq(ncol(data.raw)), function(e){sd(data.raw[,e])}) == 0)
data.raw <- data.raw[, -sd.0.ix]
dim(data.raw)
## [1]   226 54820

Normalizing the data by subtracting each gene vector by its mean and dividing by its standard deviation:

data <- (data.raw-matrix(apply(data.raw,2,mean),nrow(data.raw),ncol(data.raw),byrow=TRUE))/matrix(apply(data.raw,2,sd),nrow(data.raw),ncol(data.raw),byrow=TRUE)

2.1 Building training and test sets

One third of the individuals will be assigned to test samples, as follows:

data.Y <- rbind(matrix(1,ncol(data.tumor.paired),1),matrix(0,ncol(data.normal),1))

data.train <- data[-seq(1,nrow(data),by=3),]
data.Y.train <- data.Y[-seq(1,nrow(data),by=3)]

data.test <- data[seq(1,nrow(data),by=3),]
data.Y.test <- data.Y[seq(1,nrow(data),by=3)]
set.seed(1985)

3 Generalized linear model (Binomial) with variable selection (VS)

When trying to model a dataset with 54820 variables, as those measured to create the breast carcinoma dataset, it is expected that multicollinearity problems arise. It is therefore highly recommended that the dimensionality of such a dataset is reduced, so that multicollinearity can be minimised.

Next, we will fit a logistic model after a variable selection (VS) step.

The ‘glmnet’ package will be used. This package fits LASSO and elastic-net model paths for regression, logistic and multinomial regression using coordinate descent. The algorithm is extremely fast, and exploits sparsity in the input X matrix where it exists.

3.1 With LASSO regularization

Fit the logistic model by cross-validation:

library(glmnet)
## Loading required package: Matrix
## Loading required package: parallel
## Loaded glmnet 3.0-99
fit.sparse.logit.LASSO.cv.VS <- cv.glmnet(as.matrix(data.train),as.factor(data.Y.train),family="binomial",standardize = F)
plot(fit.sparse.logit.LASSO.cv.VS)

A total of 39 variables were selected out of 54820, corresponding to the nonzero coefficients:

fit.sparse.logit.LASSO.VS.var.selected.index <- which(fit.sparse.logit.LASSO.cv.VS$glmnet.fit$beta[,which(fit.sparse.logit.LASSO.cv.VS$cvm == min(fit.sparse.logit.LASSO.cv.VS$cvm))] != 0)

length(fit.sparse.logit.LASSO.VS.var.selected.index)
## [1] 39
fit.sparse.logit.LASSO.VS.var.selected <- colnames(data.train[,fit.sparse.logit.LASSO.VS.var.selected.index])

Predictions obtained for the train set:

pred.train.sparse.logit.LASSO.VS <- predict(fit.sparse.logit.LASSO.cv.VS,as.matrix(data.train),s="lambda.min",type="class")
table(pred.train.sparse.logit.LASSO.VS,as.factor(data.Y.train))
##                                 
## pred.train.sparse.logit.LASSO.VS  0  1
##                                0 75  0
##                                1  0 75

Zero false negatives and zero false positives.

Now, predicting for a new dataset:

pred.test.sparse.logit.LASSO.VS <- predict(fit.sparse.logit.LASSO.cv.VS,as.matrix(data.test),s="lambda.min",type="class")

table(pred.test.sparse.logit.LASSO.VS,as.factor(data.Y.test))
##                                
## pred.test.sparse.logit.LASSO.VS  0  1
##                               0 38  0
##                               1  0 38

Zero false negatives and zero false positives.

3.2 With Elastic net regularization

Fit the logistic model by cross-validation:

fit.sparse.logit.EN.cv.VS <- cv.glmnet(as.matrix(data.train),as.factor(data.Y.train),family="binomial",alpha=0.7)

plot(fit.sparse.logit.EN.cv.VS)

A total of 87 variables were selected, corresponding to the nonzero coefficients:

fit.sparse.logit.EN.VS.var.selected.index <- which(fit.sparse.logit.EN.cv.VS$glmnet.fit$beta[,which(fit.sparse.logit.LASSO.cv.VS$cvm == min(fit.sparse.logit.LASSO.cv.VS$cvm))] != 0)

length(fit.sparse.logit.EN.VS.var.selected.index)
## [1] 87
fit.sparse.logit.EN.VS.var.selected <- colnames(data.train[,fit.sparse.logit.EN.VS.var.selected.index])

Predictions obtained for the train set:

pred.train.sparse.logit.EN.VS <- predict(fit.sparse.logit.EN.cv.VS,as.matrix(data.train),s="lambda.min",type="class")
table(pred.train.sparse.logit.EN.VS,as.factor(data.Y.train))
##                              
## pred.train.sparse.logit.EN.VS  0  1
##                             0 75  0
##                             1  0 75

Zero false negatives and zero false positives.

Now, predicting for a new dataset:

pred.test.sparse.logit.EN.VS <- predict(fit.sparse.logit.EN.cv.VS,as.matrix(data.test),s="lambda.min",type="class")

table(pred.test.sparse.logit.EN.VS,as.factor(data.Y.test))
##                             
## pred.test.sparse.logit.EN.VS  0  1
##                            0 38  0
##                            1  0 38

Zero false negatives and zero false positives.

4 Partial Least Squares - Discriminant Analysis (PLS-DA)

This section introduces feature extraction, which aims at creating new variables fom the original ones, accounting for as much information contained in the data as possible.

Partial least squares (PLS) regression will be used for that purpose. PLS extracts k latent variables (i.e., linear combinations of the original varibles) that minimise the covariance between the explanatory variables and the response variable. PLS-DA will be evaluated independently for feature extraction (FE), as well as combined with variable selection (VSFE).

4.1 Feature extraction without variable selection (FE)

Fit the plsda model by cross-validation:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
fit.plsda <- plsda(as.matrix(data.train),as.factor(data.Y.train), ncomp = 30, probMethod = "Softmax",method="simpls", validation="CV",segments=10,segment.type="consecutive") # The softmax function transforms the model predictions to "probability-like" values (e.g. on [0, 1] and sum to 1)

print(fit.plsda)
## Partial least squares classification, fitted with the simpls algorithm.
## Cross-validated using 10 consecutive segments.

Sum of squares of the prediction error (PRESS) obtained for k = 1,2,…ncomp latent variables:

plot(fit.plsda$validation$PRESS[1,],ylab="PRESS")

Variance explained in X for each extracted latent variable (40 % for 7 latent variables extracted):

cumsum(fit.plsda$Xvar/fit.plsda$Xtotvar) 
##     Comp 1     Comp 2     Comp 3     Comp 4     Comp 5     Comp 6 
## 0.07434659 0.24552708 0.28473613 0.31324608 0.35428621 0.38245812 
##     Comp 7     Comp 8     Comp 9    Comp 10    Comp 11    Comp 12 
## 0.39616532 0.41223681 0.43417608 0.45325027 0.46773444 0.47656015 
##    Comp 13    Comp 14    Comp 15    Comp 16    Comp 17    Comp 18 
## 0.48633951 0.49727607 0.50523121 0.51331503 0.52037251 0.52852006 
##    Comp 19    Comp 20    Comp 21    Comp 22    Comp 23    Comp 24 
## 0.53653343 0.54452078 0.55204219 0.55822986 0.56431110 0.57159306 
##    Comp 25    Comp 26    Comp 27    Comp 28    Comp 29    Comp 30 
## 0.57958883 0.58639688 0.59261207 0.60014958 0.60755271 0.61271710

PLS model using 7 latent variables:

fit.plsda <- plsda(as.matrix(data.train),as.factor(data.Y.train), ncomp = 7, probMethod = "Softmax",method="simpls",scale=FALSE) 

Predictions obtained for the train set:

pred.plsda.train <- predict(fit.plsda, as.matrix(data.train))
confusionMatrix(pred.plsda.train,as.factor(data.Y.train))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 75  0
##          1  0 75
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9757, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : 0          
## 

Zero false negatives and zero false positives.

Now, predicting for a new dataset:

pred.plsda.test <- predict(fit.plsda, as.matrix(data.test))
confusionMatrix(pred.plsda.test,as.factor(data.Y.test))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38  0
##          1  0 38
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9526, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : 0          
## 
which(pred.plsda.test != data.Y.test)
## integer(0)
rm(fit.plsda)

Zero false negatives and zero false positives.

4.2 Feature extraction with variable selection (VSFE)

By sparse PLS-DA (sPLS-DA)

sPLS-DA using the ‘spls’ library requires the lamdba parameter to be tuned, determining the amount of sparsity. It varies between 0 and 1 - the closer to 1 the smaller variable selection size. sPLSDA is performed in two steps: one step for variable selection with sPLS and one step for classification (Chun and Keles, 2010).

Fit the plsda model with variable selection:

library(spls)
## Sparse Partial Least Squares (SPLS) Regression and
## Classification (version 2.2-1)
## 
## Attaching package: 'spls'
## The following object is masked from 'package:caret':
## 
##     splsda
fit.splsda.LOGIT <- splsda(as.matrix(data.train),as.factor(data.Y.train), K = 8, eta = 0.99,kappa=0.5,classifier="logistic",scale.x=FALSE)

The number and identity of the variables selected:

length(fit.splsda.LOGIT$spls.fit$A)
## [1] 42
print(fit.splsda.LOGIT) 
## 
## Sparse Partial Least Squares Discriminant Analysis
## ----
## Parameters: eta = 0.99, K = 8
## Classifier: Logistic regression
## 
## SPLSDA chose 42 variables among 54820 variables
## 
## Selected variables: 
## ENSG00000011523  ENSG00000012822 ENSG00000065802 ENSG00000069702 ENSG00000072163 
## ENSG00000079337  ENSG00000100307 ENSG00000102543 ENSG00000106004 ENSG00000108799 
## ENSG00000108924  ENSG00000113594 ENSG00000119771 ENSG00000132561 ENSG00000136158 
## ENSG00000137070  ENSG00000137960 ENSG00000142910 ENSG00000145777 ENSG00000149451 
## ENSG00000154065  ENSG00000154240 ENSG00000154265 ENSG00000154736 ENSG00000158966 
## ENSG00000162407  ENSG00000163239 ENSG00000173757 ENSG00000177098 ENSG00000182463 
## ENSG00000184271  ENSG00000184500 ENSG00000185432 ENSG00000187824 ENSG00000196549 
## ENSG00000197576  ENSG00000198478 ENSG00000198947 ENSG00000243836 ENSG00000245025 
## ENSG00000263400  ENSG00000264016 
select.var.splsda.LOGIT <- fit.splsda.LOGIT$A # selected variables

select.coef.splsda.LOGIT <- fit.splsda.LOGIT$W # coefficients of the selected variables

42 variables selected out of 54820.

Predictions obtained for the train set:

pred.splsda.LOGIT.train <- predict(fit.splsda.LOGIT, as.matrix(data.train))
table(pred.splsda.LOGIT.train,as.factor(data.Y.train))
##                        
## pred.splsda.LOGIT.train  0  1
##                       0 75  0
##                       1  0 75

Zero false negatives and zero false positives.

Now, predicting for a new dataset:

pred.splsda.LOGIT.test <- predict(fit.splsda.LOGIT, as.matrix(data.test))
table(pred.splsda.LOGIT.test,as.factor(data.Y.test))
##                       
## pred.splsda.LOGIT.test  0  1
##                      0 38  0
##                      1  0 38
rm(fit.splsda.LOGIT)

Zero false negatives and zero false positives.

5 Outlier detection in BRCA data

An inspection on potential outliers will be approached next. The entire dataset will be used for building three different logistic models, either obtained by sparse-based variable selection (VS), by PLS feature extraction (FE), or the combination of sparse-based variable selection and feature extraction by PLS-DA (VSFE), as previously shown.

5.1 Logistic regression

5.1.1 With variable selection using LASSO regularization (VS)

fit.sparse.logit.LASSO.cv.VS <- cv.glmnet(as.matrix(data),as.factor(data.Y),family="binomial")
# print(fit.sparse.logit.cv.VS)
plot(fit.sparse.logit.LASSO.cv.VS)

Identifying the variables selected:

fit.logit.LASSO.VS.var.selected.index <- which(fit.sparse.logit.LASSO.cv.VS$glmnet.fit$beta[,which(fit.sparse.logit.LASSO.cv.VS$cvm == min(fit.sparse.logit.LASSO.cv.VS$cvm))] != 0)
length(fit.logit.LASSO.VS.var.selected.index)
## [1] 39

A total of 39 variables were selected.

Building the logistic model with the variables selected:

data.glm.out.LASSO.VS <- data.frame(data.Y,data[,fit.logit.LASSO.VS.var.selected.index])
brca.glm.out.LASSO.VS <- glm(data.Y~.,family=binomial,data=data.glm.out.LASSO.VS,control = glm.control(maxit = 1000))
# summary(brca.glm.out.LASSO.VS)

Predictions obtained by the model:

pred.glm.out.LASSO.VS <- brca.glm.out.LASSO.VS$fitted.values
plot(pred.glm.out.LASSO.VS,xlab="",ylab="")

# identify misclassified individuals
# identify(pred.glm.out.LASSO.VS,labels=as.character(seq(1,length(pred.glm.out.LASSO.VS),1))) 

# Please note that function 'identify' fails to identify observations when generating the .HTML file.

Zero misclassifications obtained.

The Cook’s distance. The Cook’s distance computes the influence exerted by each observation on the predicted outcome, i.e., for each observation i measures the change in y_hat (fitted y) for all observations with and without the presence of the observation i, so we known how much the observation i impacted the fitted values.

library(binomTools)

brca.glm.out.res.LASSO.VS <- Residuals(brca.glm.out.LASSO.VS)

brca.glm.out.cook.dist.LASSO.VS <- cooks.distance(brca.glm.out.LASSO.VS) # Cook's Distance

plot(brca.glm.out.cook.dist.LASSO.VS,pch="*",cex=2,main="Influential observations by the Cook's Distance",ylab="Cook's distance",xlab="Individuals")
abline(a=4*mean(brca.glm.out.cook.dist.LASSO.VS,na.rm=T),b=0,col="red") # add cutoff line

text(x=1:length(brca.glm.out.cook.dist.LASSO.VS)+1,y=brca.glm.out.cook.dist.LASSO.VS,labels=ifelse(brca.glm.out.cook.dist.LASSO.VS>4*mean(brca.glm.out.cook.dist.LASSO.VS,na.rm=T),names(brca.glm.out.cook.dist.LASSO.VS),""),col=2)

which(brca.glm.out.cook.dist.LASSO.VS>4*mean(brca.glm.out.cook.dist.LASSO.VS,na.rm=T))
## TCGA-BH-A1F8-01A-11R-A13Q-07 TCGA-BH-A0BM-01A-11R-A056-07 
##                           34                           46 
## TCGA-BH-A0DO-01B-11R-A12D-07 TCGA-BH-A1F0-01A-11R-A137-07 
##                           84                           89 
## TCGA-BH-A0DQ-11A-12R-A089-07 TCGA-A7-A0DC-11A-41R-A089-07 
##                          138                          166 
## TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-E9-A1ND-11A-43R-A144-07 
##                          171                          177 
## TCGA-BH-A0H7-11A-13R-A089-07 TCGA-BH-A1FG-11B-12R-A13Q-07 
##                          180                          196 
## TCGA-BH-A1EW-11B-33R-A137-07 TCGA-GI-A2C8-11A-22R-A16F-07 
##                          205                          216 
## TCGA-E9-A1N5-11A-41R-A14D-07 
##                          220

Identifies samples # 34, 46, 84, 89, 138, 166, 171, 177, 180, 196, 205, 216, 220 as influential observations.

The Bonferroni outlier test. The Bonferroni outlier test reports the Bonferroni p-values for the largest Studentized residual, i.e., the most extreme observation.

library("car")
outlierTest(brca.glm.out.LASSO.VS)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##                                   rstudent unadjusted p-value Bonferonni p
## TCGA-BH-A0H7-11A-13R-A089-07 -1.587212e-05            0.99999           NA

Observation # 180, the most extreme observation, is not an outlier.

The concordance c-index. The concordance c-index is the proportion of pairs of data with different outcome values in which the subject who experienced the event had a higher predicted probability of experiencing the event than the suject who did not experienced the event. The c-index is frequently used for assessing the discriminative ability of a logistic regression model. For outlier detection purposes, for each observation i, B bootstrap samples can be created by sampling with replacement n - 1 observations from the original data; the concordance for each bootstrap sample is computed and the proportion of bootstrap samples with an equal or lower c-index compared to the original (with all observations) corresponds to the p-value for observation i. The c-index is equivalent to the area under the ROC curve (AUC).

# library(pROC)
# library(parallel)
# source('~/aucLOGIT.boot.R')
# brca.auc.LASSO.VS <- aucLOGIT.boot.parallel(data.glm.out.LASSO.VS,1000)
# plot(brca.auc.LASSO.VS,xlab="Variables index",ylab="pvalue")

No outliers detected.

The Influence plot, i.e., the plot of Studentized residuals by hat values, with the areas of the circles representing the observations proportional to Cook’s distances:

influencePlot(brca.glm.out.LASSO.VS, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance")

Index plots of Cook’s distances, leverages expressed by hat values, Studentized residuals, and outlier significance levels by the Bonferroni outlier test:

influenceIndexPlot(brca.glm.out.LASSO.VS) 

These plots summarize the previous results, with observation # 180 the most influent provided by its Cook’s distance.

The DFBETAS (standardized difference of the beta). DFBETAS is a measure that standardizes the absolute difference in parameter estimates between a regression model based on a full set of data, and a model from which a (potentially influential) subset of data is removed.

plot(dfbetas(brca.glm.out.LASSO.VS)[47,2:ncol(dfbetas(brca.glm.out.LASSO.VS))],type='l',col="black",ylab="DFBETAS",xlab="Selected variables")
legend(0,-1.2,"obs 180",bty="n",lty=c(1,1),lwd=c(2.5,2.5),col="black")

Some of the variables show larger differences in the parameter estimates.

5.1.2 With variable selection using Elastic net regularization (VS)

fit.sparse.logit.EN.cv.VS <- cv.glmnet(as.matrix(data),as.factor(data.Y),family="binomial",alpha=0.7)
# print(fit.sparse.logit.cv.VS)
plot(fit.sparse.logit.EN.cv.VS)

Identifying the variables selected:

fit.logit.EN.VS.var.selected.index <- which(fit.sparse.logit.EN.cv.VS$glmnet.fit$beta[,which(fit.sparse.logit.EN.cv.VS$cvm == min(fit.sparse.logit.EN.cv.VS$cvm))] != 0)
length(fit.logit.EN.VS.var.selected.index)
## [1] 80

A total of 80 variables were selected.

Building the logistic model with the variables selected:

data.glm.out.EN.VS <- data.frame(data.Y,data[,fit.logit.EN.VS.var.selected.index])
brca.glm.out.EN.VS <- glm(data.Y ~., family=binomial,data=data.glm.out.EN.VS,control = glm.control(maxit = 1000))
# summary(brca.glm.out.EN.VS)

Predictions obtained by the model:

pred.glm.out.EN.VS <- brca.glm.out.EN.VS$fitted.values
plot(pred.glm.out.EN.VS)

Zero misclassifications.

The Cook’s distance. The Cook’s distance computes the influence exerted by each observation on the predicted outcome, i.e., for each observation i measures the change in y_hat (fitted y) for all observations with and without the presence of the observation i, so we known how much the observation i impacted the fitted values.

library(binomTools)

brca.glm.out.res.EN.VS <- Residuals(brca.glm.out.EN.VS)

brca.glm.out.cook.dist.EN.VS <- cooks.distance(brca.glm.out.EN.VS) # Cook's Distance

plot(brca.glm.out.cook.dist.EN.VS,pch="*",cex=2,main="Influential observations by the Cook's Distance",ylab="Cook's distance",xlab="Individuals")
abline(a=4*mean(brca.glm.out.cook.dist.EN.VS,na.rm=T),b=0,col="red") # add cutoff line

text(x=1:length(brca.glm.out.cook.dist.EN.VS)+1,y=brca.glm.out.cook.dist.EN.VS,labels=ifelse(brca.glm.out.cook.dist.EN.VS>4*mean(brca.glm.out.cook.dist.EN.VS,na.rm=T),names(brca.glm.out.cook.dist.EN.VS),""),col=2)

which(brca.glm.out.cook.dist.EN.VS>4*mean(brca.glm.out.cook.dist.EN.VS,na.rm=T))
## TCGA-BH-A1F8-01A-11R-A13Q-07 TCGA-BH-A0BM-01A-11R-A056-07 
##                           34                           46 
## TCGA-BH-A1F0-01A-11R-A137-07 TCGA-BH-A0DK-11A-13R-A089-07 
##                           89                          134 
## TCGA-BH-A1FG-11B-12R-A13Q-07 TCGA-BH-A1EW-11B-33R-A137-07 
##                          196                          205 
## TCGA-BH-A209-11A-42R-A157-07 TCGA-BH-A0BW-11A-12R-A115-07 
##                          208                          214 
## TCGA-GI-A2C8-11A-22R-A16F-07 
##                          216

Identifies samples # 34, 46, 89, 134, 196, 205, 208 and 216 as influential observations.

The Bonferroni outlier test. The Bonferroni outlier test reports the Bonferroni p-values for the largest Studentized residual, i.e., the most extreme observation.

library("car")
outlierTest(brca.glm.out.EN.VS)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##                                   rstudent unadjusted p-value Bonferonni p
## TCGA-BH-A1EW-11B-33R-A137-07 -1.295482e-05            0.99999           NA

Observation # 205, the most extreme observation, is not an outlier.

The concordance c-index. The concordance c-index is the proportion of pairs of data with different outcome values in which the subject who experienced the event had a higher predicted probability of experiencing the event than the suject who didi not experienced the event. It is equivalent to the area under the ROC curve (AUC).

# library(pROC)
# library(parallel)
# source('~/aucLOGIT.boot.R')
# brca.auc.EN.VS <- aucLOGIT.boot.parallel(data.glm.out.EN.VS,1000)
# plot(brca.auc.EN.VS,xlab="Variables index",ylab="pvalue")

No outliers detected.

The Influence plot, i.e., the plot of Studentized residuals by hat values, with the areas of the circles representing the observations proportional to Cook’s distances:

influencePlot(brca.glm.out.EN.VS, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance")

Index plots of Cook’s distances, leverages, Studentized residuals, and outlier significance levels:

influenceIndexPlot(brca.glm.out.EN.VS) 

These plots summarize the previous results.

5.1.3 With feature extraction (FE)

Building the PLS model:

fit.plsda.out <- plsda(as.matrix(data),as.factor(data.Y), ncomp = 30, probMethod = "Softmax",method="simpls", validation="CV",segments=10,segment.type="consecutive") 

plot(fit.plsda.out$validation$PRESS[1,],ylab="PRESS")

Variance explained in X for each extracted latent variable (40 % for 8 latent variables extracted)

cumsum(fit.plsda.out$Xvar/fit.plsda.out$Xtotvar) 
##     Comp 1     Comp 2     Comp 3     Comp 4     Comp 5     Comp 6 
## 0.07715368 0.24105047 0.27385614 0.30300583 0.34420550 0.36566756 
##     Comp 7     Comp 8     Comp 9    Comp 10    Comp 11    Comp 12 
## 0.39164370 0.41098434 0.43279805 0.44600679 0.46786811 0.48400417 
##    Comp 13    Comp 14    Comp 15    Comp 16    Comp 17    Comp 18 
## 0.49550583 0.50939018 0.51926742 0.52931648 0.53732578 0.54423928 
##    Comp 19    Comp 20    Comp 21    Comp 22    Comp 23    Comp 24 
## 0.54971792 0.55614727 0.56157698 0.56735428 0.57176774 0.57679384 
##    Comp 25    Comp 26    Comp 27    Comp 28    Comp 29    Comp 30 
## 0.58223806 0.58693541 0.59161861 0.59651796 0.60097278 0.60543981

PLS-DA model with the number of latent variables selected:

fit.plsda.out <- plsda(as.matrix(data),as.factor(data.Y), ncomp = 8, probMethod = "Softmax",method="simpls") 

The scores’ plot, i.e., the new samples’ coordinates in the space of reduced dimension:

plot(fit.plsda.out$scores[1:113,1],fit.plsda.out$scores[1:113,2],col="black",xlim=c(-0.15,0.28),ylim=c(-0.55,0.18),ylab="LV2",xlab="LV1")
points(fit.plsda.out$scores[113:226,1],fit.plsda.out$scores[113:226,2],col="gray60")
identify(fit.plsda.out$scores[1:226,1],fit.plsda.out$scores[1:226,2],labels=as.character(seq(1,nrow(data.train),1))) 

## integer(0)

A good separation of the ‘tumor’ (black circles) and ‘normal’ (gray circles) classes is obtained.

Predictions obtained by the model:

pred.plsda <- predict(fit.plsda.out, as.matrix(data))
confusionMatrix(pred.plsda,as.factor(data.Y))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 113   0
##          1   0 113
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9838, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : 0          
## 

Zero misclassifications.

Building the logistic model with the new samples coordinates obtained by PLS-DA accounting for 8 latent variables:

data.glm.out.FE <- as.data.frame(cbind(data.Y,fit.plsda.out$scores))
colnames(data.glm.out.FE)[1] <- 'data.Y'
brca.glm.out.FE <- glm(data.Y ~ . ,family=binomial,data = data.glm.out.FE,control = glm.control(maxit = 1000))
#summary(brca.glm.out.FE)

The Cook’s distance:

brca.glm.out.res.FE <- Residuals(brca.glm.out.FE)

brca.glm.out.cook.dist.FE <- cooks.distance(brca.glm.out.FE) # Cook's Distance

plot(brca.glm.out.cook.dist.FE,pch="*",cex=2,main="Influential observations by the Cook's Distance",ylab="Cook's distance", xlab="Individuals")
abline(a=4*mean(brca.glm.out.cook.dist.FE,na.rm=T),b=0,col="red") # add cutoff line

text(x=1:length(brca.glm.out.cook.dist.FE)+1,y=brca.glm.out.cook.dist.FE,labels=ifelse(brca.glm.out.cook.dist.FE>4*mean(brca.glm.out.cook.dist.FE,na.rm=T),names(brca.glm.out.cook.dist.FE),""),col=2) # add labels

which(brca.glm.out.cook.dist.FE>4*mean(brca.glm.out.cook.dist.FE,na.rm=T))
##   1  79 205 
##   1  79 205
#row.names(data)[c(1,79,205)]

Identifies sample # 1, 79 and 205 as influential obervations.

The Bonferroni outlier test:

outlier.test(brca.glm.out.FE)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferonni p
## 1 1.139599e-05            0.99999           NA

Observation # 1, the most extreme observation, is not an outlier.

The concordance c-index:

# library(pROC)
# library(parallel)
# source('~/aucLOGIT.boot.R')
# brca.auc.FE <- aucLOGIT.boot.parallel(data.glm.out.FE,1000)
# plot(brca.auc.FE,xlab="Variables index",ylab="pvalue")

No outliers detected.

The Influence plot:

influencePlot(brca.glm.out.FE, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance")

Index plots of Cook’s distances, leverages, Studentized residuals, and outlier significance levels:

influenceIndexPlot(brca.glm.out.FE) 

rm(fit.plsda.out)

These plots summarize the previous results.

5.1.4 With variable selection (Elastic net) and feature extraction (VSFE)

fit.splsda.out <- splsda(as.matrix(data),as.factor(data.Y), K = 8, eta = .7,kappa=0.5,classifier="logistic",scale.x=FALSE)

Predictions obtained by the model:

pred.splsda.out <- predict(fit.splsda.out, as.matrix(data))
table(pred.splsda.out,as.factor(data.Y))
##                
## pred.splsda.out   0   1
##               0 113   0
##               1   0 113

Zero misclassifications.

Building the logistic model with the new samples coordinates obtained by PLS-DA accounting for 8 latent variables based on 295 out of 54820 variables:

data.glm.out.VSFE <- data.frame(data.Y,data[,fit.splsda.out$spls.fit$A]%*%fit.splsda.out$W)
brca.glm.out.VSFE <- glm(data.Y~.,family=binomial,data=data.glm.out.VSFE,control = glm.control(maxit = 1000))

The Cook’s distance:

brca.glm.out.res.VSFE <- Residuals(brca.glm.out.VSFE)

brca.glm.out.cook.dist.VSFE <- cooks.distance(brca.glm.out.VSFE) # Cook's Distance

plot(brca.glm.out.cook.dist.VSFE,pch="*",cex=2,main="Influential observations by the Cook's Distance",ylab="Cook's distance", xlab="Individuals")
abline(a=4*mean(brca.glm.out.cook.dist.VSFE,na.rm=T),b=0,col="red") # add cutoff line

text(x=1:length(brca.glm.out.cook.dist.VSFE)+1,y=brca.glm.out.cook.dist.VSFE,labels=ifelse(brca.glm.out.cook.dist.VSFE>4*mean(brca.glm.out.cook.dist.VSFE,na.rm=T),names(brca.glm.out.cook.dist.VSFE),""),col=2) # add labels

which(brca.glm.out.cook.dist.VSFE>4*mean(brca.glm.out.cook.dist.VSFE,na.rm=T))
## TCGA-A7-A0DC-01B-04R-A22O-07 TCGA-A7-A0DB-01A-11R-A277-07 
##                            1                           93 
## TCGA-BH-A1EW-11B-33R-A137-07 
##                          205

Identifies sample # 1, 93 and 205 as the most influential observations.

The Bonferroni Outlier test:

outlier.test(brca.glm.out.VSFE)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##                                   rstudent unadjusted p-value Bonferonni p
## TCGA-BH-A1EW-11B-33R-A137-07 -1.037988e-05            0.99999           NA

Observation # 205, the most extreme observation, is not an outlier.

The concordance c-index:

# library(pROC)
# library(parallel)
# source('~/aucLOGIT.boot.R')
# brca.auc.VSFE <- aucLOGIT.boot.parallel(data.glm.out.VSFE,1000)
# plot(brca.auc.VSFE,xlab="Variables index",ylab="pvalue")

No outliers detected.

The Influence plot:

influencePlot(brca.glm.out.VSFE, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance")

Index plots of Cook’s distances, leverages, Studentized residuals, and outlier significance levels:

influenceIndexPlot(brca.glm.out.VSFE)

rm(fit.splsda.out)

These plots summarize the previous results.

5.2 Robust logist regression

The strategy of removing observations, although apparently simple, can be not only unpractical, but also very misleading. Graphical and diagnostic tools before model fitting, as well as a residual analysis after model fitting, can become dangerous, depending on the complexity of the dataset (in which concerns dimensionality), the appropriatness of the measures of ‘outlyingness’ chosen and the masking effects produced. In many cases it could be more efficient to down-weight instead of discarding certain observations, as robust regression does.

5.2.1 With variable selection (LASSO) (VS)

library(robustbase)
brca.glm.out.rob=glmrob(data.Y~.,family=binomial,data=as.data.frame(data.glm.out.LASSO.VS),control=glmrobMqle.control(tcc=1.5))
summary(brca.glm.out.rob) 
## 
## Call:  glmrob(formula = data.Y ~ ., family = binomial, data = as.data.frame(data.glm.out.LASSO.VS),      control = glmrobMqle.control(tcc = 1.5)) 
## 
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -7.773e-01  4.935e+06       0        1
## ENSG00000003096 -4.451e+00  1.090e+07       0        1
## ENSG00000065802 -3.483e+00  1.040e+07       0        1
## ENSG00000072778 -3.009e+00  9.025e+06       0        1
## ENSG00000078596 -5.475e+00  1.379e+07       0        1
## ENSG00000118946  3.415e+00  6.910e+06       0        1
## ENSG00000125841  1.013e+00  6.725e+06       0        1
## ENSG00000132561 -1.050e+00  1.232e+07       0        1
## ENSG00000135046 -2.969e+00  1.406e+07       0        1
## ENSG00000135823  4.082e+00  7.007e+06       0        1
## ENSG00000136158 -4.763e+00  1.486e+07       0        1
## ENSG00000137070 -1.442e+00  1.403e+07       0        1
## ENSG00000139874 -2.628e+00  1.010e+07       0        1
## ENSG00000141404 -2.696e+00  1.101e+07       0        1
## ENSG00000142910 -2.915e+00  1.207e+07       0        1
## ENSG00000145777 -5.172e-01  1.254e+07       0        1
## ENSG00000149090 -2.326e+00  1.151e+07       0        1
## ENSG00000154736 -4.227e+00  1.357e+07       0        1
## ENSG00000162849  5.058e+00  6.991e+06       0        1
## ENSG00000163239  6.055e-02  1.491e+07       0        1
## ENSG00000164220  4.501e+00  7.546e+06       0        1
## ENSG00000165197 -8.783e-01  1.139e+07       0        1
## ENSG00000165795 -7.605e+00  1.336e+07       0        1
## ENSG00000175806  1.070e+00  9.572e+06       0        1
## ENSG00000177098 -1.477e+00  1.579e+07       0        1
## ENSG00000180383 -6.493e+00  7.155e+06       0        1
## ENSG00000184271  3.880e+00  1.418e+07       0        1
## ENSG00000185345 -7.810e+00  1.066e+07       0        1
## ENSG00000187621  5.006e+00  5.983e+06       0        1
## ENSG00000187824 -7.823e-01  1.241e+07       0        1
## ENSG00000189134 -1.442e+01  1.432e+07       0        1
## ENSG00000196549 -1.455e+00  1.087e+07       0        1
## ENSG00000198478 -4.788e+00  1.168e+07       0        1
## ENSG00000198947  4.668e+00  1.390e+07       0        1
## ENSG00000230018 -5.975e+00  1.107e+07       0        1
## ENSG00000235505  3.671e-01  1.396e+07       0        1
## ENSG00000245857 -6.752e+00  8.889e+06       0        1
## ENSG00000254122 -5.280e+00  1.025e+07       0        1
## ENSG00000272841  1.651e+00  8.098e+06       0        1
## ENSG00000278058 -1.596e-01  1.162e+07       0        1
## Robustness weights w.r * w.x: 
##  All 226 weights are ~= 1.
## 
## Number of observations: 226 
## Fitted by method 'Mqle'  (in 50 iterations)
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## No deviance values available 
## Algorithmic parameters: 
##    acc    tcc 
## 0.0001 1.5000 
## maxit 
##    50 
## test.acc 
##   "coef"

No downweighted observations, i.e., no outliers.

5.2.2 With variable selection (Elastic net) (VS)

library(robustbase)
brca.glm.out.rob=glmrob(as.factor(data.Y)~.,family=binomial,data=as.data.frame(data.glm.out.EN.VS),control=glmrobMqle.control(tcc=1.5))
summary(brca.glm.out.rob) 
## 
## Call:  glmrob(formula = as.factor(data.Y) ~ ., family = binomial, data = as.data.frame(data.glm.out.EN.VS),      control = glmrobMqle.control(tcc = 1.5)) 
## 
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -6.899e-01  4.935e+06       0        1
## ENSG00000003096 -9.266e+00  2.067e+07       0        1
## ENSG00000012822  4.747e+00  1.473e+07       0        1
## ENSG00000015413  3.289e+00  6.689e+06       0        1
## ENSG00000064300 -6.808e+00  1.785e+07       0        1
## ENSG00000065802  4.103e-01  1.467e+07       0        1
## ENSG00000072163 -1.364e+00  2.739e+07       0        1
## ENSG00000072778 -2.283e+00  1.120e+07       0        1
## ENSG00000078596 -3.231e-01  1.681e+07       0        1
## ENSG00000081041 -3.507e+00  9.649e+06       0        1
## ENSG00000101955  4.980e+00  2.314e+07       0        1
## ENSG00000105894 -6.352e+00  1.333e+07       0        1
## ENSG00000106004 -3.860e+00  1.506e+07       0        1
## ENSG00000113578  1.887e+00  1.850e+07       0        1
## ENSG00000113594  1.100e+00  1.534e+07       0        1
## ENSG00000117640 -2.665e+00  9.878e+06       0        1
## ENSG00000118946  1.814e+00  8.183e+06       0        1
## ENSG00000125841  1.330e+00  7.367e+06       0        1
## ENSG00000125869 -9.834e-01  1.039e+07       0        1
## ENSG00000132561 -1.346e+00  1.479e+07       0        1
## ENSG00000133816 -3.324e-01  9.868e+06       0        1
## ENSG00000135046  3.344e+00  2.225e+07       0        1
## ENSG00000135823  8.196e-01  1.014e+07       0        1
## ENSG00000136158 -1.902e+00  1.850e+07       0        1
## ENSG00000137070 -2.544e+00  1.946e+07       0        1
## ENSG00000137225 -3.385e-01  1.860e+07       0        1
## ENSG00000137960  6.330e-01  1.450e+07       0        1
## ENSG00000138316 -9.958e-01  1.014e+07       0        1
## ENSG00000139874 -1.114e+00  1.323e+07       0        1
## ENSG00000140743  2.644e+00  7.726e+06       0        1
## ENSG00000141404 -1.268e+00  1.265e+07       0        1
## ENSG00000142910  1.525e+00  1.672e+07       0        1
## ENSG00000143376  5.214e-01  9.043e+06       0        1
## ENSG00000145777 -5.447e-02  1.675e+07       0        1
## ENSG00000146592 -1.876e+00  1.346e+07       0        1
## ENSG00000149090 -1.020e+00  1.730e+07       0        1
## ENSG00000149451 -3.736e+00  2.203e+07       0        1
## ENSG00000154065  7.329e-01  1.580e+07       0        1
## ENSG00000154240 -6.189e+00  2.323e+07       0        1
## ENSG00000154736 -3.433e+00  1.690e+07       0        1
## ENSG00000154856 -5.498e-01  1.420e+07       0        1
## ENSG00000158850  2.076e+00  8.843e+06       0        1
## ENSG00000158966  6.672e-01  1.774e+07       0        1
## ENSG00000162849  1.889e+00  1.096e+07       0        1
## ENSG00000163239 -2.680e+00  2.315e+07       0        1
## ENSG00000164220  4.800e+00  9.227e+06       0        1
## ENSG00000164932  3.037e-01  1.229e+07       0        1
## ENSG00000165197  3.323e-01  1.649e+07       0        1
## ENSG00000165795 -7.379e+00  2.100e+07       0        1
## ENSG00000167641  8.860e-01  1.744e+07       0        1
## ENSG00000167910 -1.694e+00  8.940e+06       0        1
## ENSG00000172831 -1.631e+00  1.001e+07       0        1
## ENSG00000172987 -3.826e+00  1.505e+07       0        1
## ENSG00000173757 -1.314e+00  1.701e+07       0        1
## ENSG00000175806 -2.414e-01  1.301e+07       0        1
## ENSG00000177098 -2.301e+00  2.050e+07       0        1
## ENSG00000179314 -3.621e+00  1.339e+07       0        1
## ENSG00000180383 -2.142e+00  8.974e+06       0        1
## ENSG00000182253  4.613e+00  1.700e+07       0        1
## ENSG00000184271  1.495e+00  1.773e+07       0        1
## ENSG00000184500 -2.424e+00  1.912e+07       0        1
## ENSG00000185345 -3.668e+00  1.272e+07       0        1
## ENSG00000187824 -4.932e+00  1.644e+07       0        1
## ENSG00000189134 -1.215e+01  1.751e+07       0        1
## ENSG00000196549 -1.391e+00  1.355e+07       0        1
## ENSG00000197576  8.299e+00  2.146e+07       0        1
## ENSG00000198478 -3.382e+00  1.510e+07       0        1
## ENSG00000198947  4.153e-01  1.812e+07       0        1
## ENSG00000230018 -6.849e+00  1.472e+07       0        1
## ENSG00000232021  2.889e+00  9.934e+06       0        1
## ENSG00000235505  6.225e-02  1.681e+07       0        1
## ENSG00000245857 -3.030e+00  1.087e+07       0        1
## ENSG00000254122 -4.177e+00  1.223e+07       0        1
## ENSG00000258331  2.965e+00  8.458e+06       0        1
## ENSG00000260329 -4.329e+00  1.371e+07       0        1
## ENSG00000267365  3.223e+00  1.224e+07       0        1
## ENSG00000267532 -1.876e+00  1.853e+07       0        1
## ENSG00000272501 -7.278e-01  1.109e+07       0        1
## ENSG00000272841  2.899e+00  9.327e+06       0        1
## ENSG00000278058 -1.552e+00  1.413e+07       0        1
## ENSG00000279587  5.792e+00  2.163e+07       0        1
## Robustness weights w.r * w.x: 
##  All 226 weights are ~= 1.
## 
## Number of observations: 226 
## Fitted by method 'Mqle'  (in 50 iterations)
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## No deviance values available 
## Algorithmic parameters: 
##    acc    tcc 
## 0.0001 1.5000 
## maxit 
##    50 
## test.acc 
##   "coef"

No downweighted observations, i.e., no outliers.

5.2.3 With feature extraction (FE)

brca.glm.out.rob=glmrob(data.Y~.,family=binomial,data=as.data.frame(data.glm.out.FE),control=glmrobMqle.control(tcc=1.5))
summary(brca.glm.out.rob) # no outliers
## 
## Call:  glmrob(formula = data.Y ~ ., family = binomial, data = as.data.frame(data.glm.out.FE),      control = glmrobMqle.control(tcc = 1.5)) 
## 
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.945e-02  4.935e+06       0        1
## `Comp 1`    1.068e+03  6.140e+07       0        1
## `Comp 2`    3.607e+02  6.690e+07       0        1
## `Comp 3`    2.316e+02  7.614e+07       0        1
## `Comp 4`    2.015e+02  7.036e+07       0        1
## `Comp 5`    1.288e+02  6.966e+07       0        1
## `Comp 6`    1.054e+02  7.198e+07       0        1
## `Comp 7`    8.583e+01  6.978e+07       0        1
## `Comp 8`    7.692e+01  7.080e+07       0        1
## Robustness weights w.r * w.x: 
##  All 226 weights are ~= 1.
## 
## Number of observations: 226 
## Fitted by method 'Mqle'  (in 50 iterations)
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## No deviance values available 
## Algorithmic parameters: 
##    acc    tcc 
## 0.0001 1.5000 
## maxit 
##    50 
## test.acc 
##   "coef"

No downweighted observations, i.e., no outliers.

5.2.4 With variable selection and feature extraction (VSFE)

brca.glm.out.rob=glmrob(as.factor(data.Y)~.,family=binomial,data=as.data.frame(data.glm.out.VSFE),control=glmrobMqle.control(tcc=1.5))
summary(brca.glm.out.rob)# no outliers
## 
## Call:  glmrob(formula = as.factor(data.Y) ~ ., family = binomial, data = as.data.frame(data.glm.out.VSFE),      control = glmrobMqle.control(tcc = 1.5)) 
## 
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.596e-01  4.935e+06       0        1
## Comp.1       1.079e+03  6.139e+07       0        1
## Comp.2       3.567e+02  6.688e+07       0        1
## Comp.3       2.284e+02  7.639e+07       0        1
## Comp.4       2.046e+02  7.047e+07       0        1
## Comp.5       1.275e+02  6.941e+07       0        1
## Comp.6       1.117e+02  7.208e+07       0        1
## Comp.7       9.418e+01  6.995e+07       0        1
## Comp.8       8.062e+01  7.049e+07       0        1
## Robustness weights w.r * w.x: 
##  All 226 weights are ~= 1.
## 
## Number of observations: 226 
## Fitted by method 'Mqle'  (in 50 iterations)
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## No deviance values available 
## Algorithmic parameters: 
##    acc    tcc 
## 0.0001 1.5000 
## maxit 
##    50 
## test.acc 
##   "coef"

No downweighted observations, i.e., no outliers.

6 Outlier detection in noisy BRCA data

From the above methods described for outlier detection, influential onservations could be identified for all datasets built based on different strategies for data dimensionality reduction. However, no significance for outlierness was obtained for any observation, which indicates that either the data are indeed clean or numerical problems might have occurred in model fitting, which indeed provided large estimated standard errors.

Departing from the assumption that the BRCA data have no outliers, the following strategy to evaluate the performance of the outlier detection methods was to introduce outliers in the response variable, i.e., ‘tumor’ versus ‘normal’.

Taking a logistic model with variable selection via LASSO regularization, we have introduced of 10 % (random) outliers in the reponse variable and evaluated the efectinevess of the outlier detection methodologies evaluated:

# index.outliers <- sample(length(data.Y),20,replace=FALSE)
# sort(index.outliers)
index.outliers <- c(23,26,29,35,41,44,54,79,84,91,98,116,129,136,154,155,161,183,187,218) 
# fixing the outliers vector for data analysis reproducibility

data.Y.outliers <- data.Y
data.Y.outliers[index.outliers] <- abs(data.Y[index.outliers]-1)

Building the logistic model:

fit.logit.LASSO.cv.outliers <- cv.glmnet(as.matrix(data),as.factor(data.Y.outliers),family="binomial",standardize = F)

Identifying the variables selected:

fit.logit.LASSO.outliers.var.selected.index <- which(fit.logit.LASSO.cv.outliers$glmnet.fit$beta[,which(fit.logit.LASSO.cv.outliers$cvm == min(fit.logit.LASSO.cv.outliers$cvm))] != 0)
# length(fit.logit.LASSO.outliers.var.selected.index)

# Fixing the variables selected for data analysis reproducibility
fit.logit.LASSO.outliers.var.selected.index <- c(98,800,1081,1204,1295,2035,3213,3543,3981,6348,6603,7428,7878,8003,8089,8246,10703,11537,12739,13212,14119,14192,14304,15927,16530,16695,17217,17410,17562,17702,17814,20244,21030,21698,22824,23334,25202,25404,25478,25682,26501,26562,27666,31078,31402,32782,33286,33561,33583,35651,37174,37221,38473,42388,43115,43140,44616,45676,53486,54816)

A total of 60 variables were selected.

Building the logistic model with the variables selected:

data.glm.LASSO.outliers <- data.frame(data.Y.outliers,data[,fit.logit.LASSO.outliers.var.selected.index])
brca.glm.LASSO.outliers <- glm(data.Y.outliers~.,family=binomial,data=data.glm.LASSO.outliers,control = glm.control(maxit = 1000))
summary(brca.glm.LASSO.outliers)
## 
## Call:
## glm(formula = data.Y.outliers ~ ., family = binomial, data = data.glm.LASSO.outliers, 
##     control = glm.control(maxit = 1000))
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -4.290e-06  -2.110e-08  -2.110e-08   2.110e-08   5.117e-06  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -5.231e-03  1.367e+05       0        1
## ENSG00000005448  2.179e+00  2.011e+05       0        1
## ENSG00000059573 -4.004e+00  1.937e+05       0        1
## ENSG00000069696  4.906e+00  1.340e+05       0        1
## ENSG00000072778  7.613e-01  3.274e+05       0        1
## ENSG00000075234  6.264e-01  1.802e+05       0        1
## ENSG00000095585  2.266e+00  1.853e+05       0        1
## ENSG00000105894 -5.498e+00  1.683e+05       0        1
## ENSG00000108551 -4.105e+00  1.169e+05       0        1
## ENSG00000111961 -1.435e+00  2.821e+05       0        1
## ENSG00000130827  5.512e-01  2.280e+05       0        1
## ENSG00000132561 -5.214e-01  4.533e+05       0        1
## ENSG00000137225  3.539e+00  2.844e+05       0        1
## ENSG00000139874  1.722e+00  2.917e+05       0        1
## ENSG00000140743  6.682e+00  1.651e+05       0        1
## ENSG00000141404 -2.159e+00  2.780e+05       0        1
## ENSG00000142611 -3.296e+00  2.428e+05       0        1
## ENSG00000162384 -6.307e+00  2.436e+05       0        1
## ENSG00000165197  3.841e+00  3.142e+05       0        1
## ENSG00000170166 -4.288e+00  1.928e+05       0        1
## ENSG00000172150 -7.523e+00  2.681e+05       0        1
## ENSG00000176754  6.255e+00  3.848e+05       0        1
## ENSG00000177098 -4.478e+00  4.314e+05       0        1
## ENSG00000177606 -3.778e+00  2.722e+05       0        1
## ENSG00000185345 -1.332e+00  1.907e+05       0        1
## ENSG00000187908 -5.895e+00  1.353e+05       0        1
## ENSG00000188580 -6.720e+00  8.735e+04       0        1
## ENSG00000196944 -6.538e+00  2.996e+05       0        1
## ENSG00000197561 -3.851e-01  2.548e+05       0        1
## ENSG00000198010 -1.693e+00  2.137e+05       0        1
## ENSG00000198478  1.956e+00  3.965e+05       0        1
## ENSG00000198788 -3.402e+00  4.426e+04       0        1
## ENSG00000206630  1.082e+00  1.564e+05       0        1
## ENSG00000207873  1.533e+00  9.777e+04       0        1
## ENSG00000212789 -8.099e-01  1.247e+05       0        1
## ENSG00000215612 -5.499e+00  2.155e+05       0        1
## ENSG00000219755 -1.459e+00  2.314e+05       0        1
## ENSG00000224958 -2.753e+00  3.355e+05       0        1
## ENSG00000225269  7.657e-01  1.866e+05       0        1
## ENSG00000225386 -3.962e+00  1.717e+05       0        1
## ENSG00000225701  5.386e+00  1.043e+05       0        1
## ENSG00000226903  5.082e+00  6.200e+04       0        1
## ENSG00000226990 -1.408e+00  1.753e+05       0        1
## ENSG00000228695 -1.472e+00  2.112e+05       0        1
## ENSG00000233866 -6.143e+00  1.025e+05       0        1
## ENSG00000234336 -3.184e-01  1.818e+05       0        1
## ENSG00000236444 -2.155e-01  1.379e+05       0        1
## ENSG00000237205  6.077e-01  9.470e+04       0        1
## ENSG00000237603  3.137e+00  1.615e+05       0        1
## ENSG00000237630  3.646e+00  6.928e+04       0        1
## ENSG00000242766  2.200e+00  9.381e+04       0        1
## ENSG00000248869  1.019e+00  2.290e+05       0        1
## ENSG00000248946  2.389e+00  1.237e+05       0        1
## ENSG00000250921 -4.954e+00  2.219e+05       0        1
## ENSG00000256790  3.943e-01  1.529e+05       0        1
## ENSG00000258331  3.977e+00  1.586e+05       0        1
## ENSG00000258376 -5.169e+00  2.118e+05       0        1
## ENSG00000260223  5.430e+00  8.375e+04       0        1
## ENSG00000261722 -3.734e+00  1.423e+05       0        1
## ENSG00000279117 -2.406e-01  1.988e+05       0        1
## ENSG00000281904  7.355e-01  1.841e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3.1323e+02  on 225  degrees of freedom
## Residual deviance: 3.9884e-10  on 165  degrees of freedom
## AIC: 122
## 
## Number of Fisher Scoring iterations: 27

Predictions obtained by the model:

pred.glm.LASSO.outliers <- brca.glm.LASSO.outliers$fitted.values
pred.glm.LASSO.outliers <- round(pred.glm.LASSO.outliers)
plot(pred.glm.LASSO.outliers)

table(pred.glm.LASSO.outliers,data.Y.outliers)
##                        data.Y.outliers
## pred.glm.LASSO.outliers   0   1
##                       0 115   0
##                       1   0 111

Introducing 20 outliers in the response variable produced yielded a model with no misclassifications, i.e., all outlier observations were correctly classified into their wrong classification.

The Cook’s distance:

library(binomTools)

brca.glm.res.LASSO.outliers <- Residuals(brca.glm.LASSO.outliers)

brca.glm.cook.dist.LASSO.outliers <- cooks.distance(brca.glm.LASSO.outliers) # Cook's Distance

plot(brca.glm.cook.dist.LASSO.outliers,pch="*",cex=2,main="Influential observations by the Cook's Distance",ylab="Cook's distance",xlab="Individuals")
abline(a=4*mean(brca.glm.cook.dist.LASSO.outliers,na.rm=T),b=0,col="red") # add cutoff line

text(x=1:length(brca.glm.cook.dist.LASSO.outliers)+1,y=brca.glm.cook.dist.LASSO.outliers,labels=ifelse(brca.glm.cook.dist.LASSO.outliers>4*mean(brca.glm.cook.dist.LASSO.outliers,na.rm=T),names(brca.glm.cook.dist.LASSO.outliers),""),col=2)

which(brca.glm.cook.dist.LASSO.outliers>4*mean(brca.glm.cook.dist.LASSO.outliers,na.rm=T))
## TCGA-BH-A0DV-11A-22R-A12P-07 TCGA-BH-A0BS-11A-11R-A12P-07 
##                          136                          187

Identify observations # 136 and 187 (real outliers) as influential observations.

The Bonferroni outlier test:

library("car")
outlierTest(brca.glm.LASSO.outliers)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##                                 rstudent unadjusted p-value Bonferonni p
## TCGA-BH-A0BS-11A-11R-A12P-07 0.000151042            0.99988           NA

Observation # 187, the most extreme observation, is not considered an outlier.

The concordance c-index:

# library(pROC)
# source('~/aucLOGIT.boot.R')
# brca.auc.LASSO.outliers <- aucLOGIT.boot.parallel(data.glm.LASSO.outliers,1000)
# plot(brca.auc.LASSO.outlier,xlab="Variables index",ylab="pvalue")

No outliers detected.

The Influence plot, i.e., the plot of Studentized residuals by hat values, with the areas of the circles representing the observations proportional to Cook’s distances:

influencePlot(brca.glm.LASSO.outliers, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance")

Performing robust logistc regression:

brca.glm.LASSO.outliers.rob=glmrob(as.factor(data.Y.outliers)~.,family=binomial,data=as.data.frame(data.glm.LASSO.outliers),control=glmrobMqle.control(tcc=1.5))
summary(brca.glm.LASSO.outliers.rob)
## 
## Call:  glmrob(formula = as.factor(data.Y.outliers) ~ ., family = binomial,      data = as.data.frame(data.glm.LASSO.outliers), control = glmrobMqle.control(tcc = 1.5)) 
## 
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -7.070e-01  4.943e+06       0        1
## ENSG00000005448  3.007e+00  6.293e+06       0        1
## ENSG00000059573 -5.128e+00  8.085e+06       0        1
## ENSG00000069696  1.020e+01  6.057e+06       0        1
## ENSG00000072778 -5.985e-01  9.110e+06       0        1
## ENSG00000075234 -1.132e+00  7.477e+06       0        1
## ENSG00000095585  5.347e+00  6.637e+06       0        1
## ENSG00000105894 -1.275e+01  8.387e+06       0        1
## ENSG00000108551 -7.369e+00  7.959e+06       0        1
## ENSG00000111961 -3.992e+00  1.024e+07       0        1
## ENSG00000130827  3.206e+00  6.209e+06       0        1
## ENSG00000132561 -8.377e-01  1.360e+07       0        1
## ENSG00000137225  6.392e+00  1.546e+07       0        1
## ENSG00000139874  5.241e+00  1.112e+07       0        1
## ENSG00000140743  1.175e+01  6.719e+06       0        1
## ENSG00000141404 -5.000e+00  1.050e+07       0        1
## ENSG00000142611 -7.117e+00  8.283e+06       0        1
## ENSG00000162384 -1.009e+01  1.000e+07       0        1
## ENSG00000165197  5.061e+00  1.217e+07       0        1
## ENSG00000170166 -1.020e+01  8.167e+06       0        1
## ENSG00000172150 -1.216e+01  8.452e+06       0        1
## ENSG00000176754  1.094e+01  1.203e+07       0        1
## ENSG00000177098 -9.763e+00  1.709e+07       0        1
## ENSG00000177606 -8.881e+00  8.400e+06       0        1
## ENSG00000185345 -2.554e+00  1.051e+07       0        1
## ENSG00000187908 -9.711e+00  7.049e+06       0        1
## ENSG00000188580 -1.332e+01  6.333e+06       0        1
## ENSG00000196944 -1.015e+01  9.641e+06       0        1
## ENSG00000197561 -2.558e-01  9.917e+06       0        1
## ENSG00000198010 -4.285e+00  9.309e+06       0        1
## ENSG00000198478  4.025e+00  1.167e+07       0        1
## ENSG00000198788 -9.042e+00  6.059e+06       0        1
## ENSG00000206630  1.495e+00  6.085e+06       0        1
## ENSG00000207873  5.710e+00  5.413e+06       0        1
## ENSG00000212789 -1.134e+00  6.518e+06       0        1
## ENSG00000215612 -1.224e+01  1.388e+07       0        1
## ENSG00000219755 -9.844e-01  8.579e+06       0        1
## ENSG00000224958 -2.245e+00  1.553e+07       0        1
## ENSG00000225269 -9.560e-01  9.790e+06       0        1
## ENSG00000225386 -7.792e+00  7.700e+06       0        1
## ENSG00000225701  7.679e+00  5.826e+06       0        1
## ENSG00000226903  1.079e+01  5.459e+06       0        1
## ENSG00000226990 -3.473e+00  8.353e+06       0        1
## ENSG00000228695 -2.501e+00  1.349e+07       0        1
## ENSG00000233866 -1.149e+01  6.629e+06       0        1
## ENSG00000234336 -2.250e+00  9.563e+06       0        1
## ENSG00000236444 -2.093e+00  7.177e+06       0        1
## ENSG00000237205  3.824e+00  5.686e+06       0        1
## ENSG00000237603  6.581e+00  5.594e+06       0        1
## ENSG00000237630  7.269e+00  5.161e+06       0        1
## ENSG00000242766  2.239e+00  6.658e+06       0        1
## ENSG00000248869  6.472e-01  1.087e+07       0        1
## ENSG00000248946  6.524e+00  5.261e+06       0        1
## ENSG00000250921 -8.654e+00  8.372e+06       0        1
## ENSG00000256790  1.197e-01  7.459e+06       0        1
## ENSG00000258331  7.806e+00  5.992e+06       0        1
## ENSG00000258376 -8.748e+00  9.202e+06       0        1
## ENSG00000260223  1.197e+01  5.167e+06       0        1
## ENSG00000261722 -1.066e+01  7.169e+06       0        1
## ENSG00000279117 -1.106e+00  8.386e+06       0        1
## ENSG00000281904  2.624e+00  7.141e+06       0        1
## Robustness weights w.r * w.x: 
##  All 226 weights are ~= 1.
## 
## Number of observations: 226 
## Fitted by method 'Mqle'  (in 50 iterations)
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## No deviance values available 
## Algorithmic parameters: 
##    acc    tcc 
## 0.0001 1.5000 
## maxit 
##    50 
## test.acc 
##   "coef"

No downweighted observations, i.e., no outliers.

For the BRCA dataset numeric problems have been encountered when applying logistic regression that invalidate the detection of influential observations and outliers, even when introducing outliers in the response data. To illustrate the applicability of the methods described for detection of outliers and influential observations, a subset of the BRCA dataset was used. This dataset was based on two variables selected by stepwise logistic regression after dimensionality reduction using LASSO regularization. The introduction of outliers by randomly changing the response variable (‘tumour’ vs ‘normal’) in 10 % of the observations was also investigated in order to evaluate the effectiveness of the methods used for the identification of outliers and influential observations.

Stepwise logistic regression was used for selecting a small set of variables to illustrate the above methodologies for outlier detection.

# brca.glm.out.LASSO.VS.step <- step(brca.glm.out.LASSO.VS,direction="backward")
# summary(brca.glm.out.LASSO.VS.step)

# fixing the variables selected for data analysis reproducibility
data.2var <- data[,c("ENSG00000072778","ENSG00000235505")]

Building the logistic model on the resulting matrix with the two variables selected by stepwise logistic regression after LASSO regularization:

data.glm.2var <- data.frame(data.Y,data.2var)
brca.glm.2var <- glm(data.Y~.,family=binomial,data=data.glm.2var,control = glm.control(maxit = 1000))
summary(brca.glm.2var)
## 
## Call:
## glm(formula = data.Y ~ ., family = binomial, data = data.glm.2var, 
##     control = glm.control(maxit = 1000))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82273  -0.01172   0.00018   0.01649   2.56918  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.6004     0.5260  -1.142    0.254    
## ENSG00000072778  -5.1556     1.2990  -3.969 7.22e-05 ***
## ENSG00000235505  -7.3072     1.7627  -4.145 3.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.303  on 225  degrees of freedom
## Residual deviance:  30.776  on 223  degrees of freedom
## AIC: 36.776
## 
## Number of Fisher Scoring iterations: 9

Predictions obtained by the model:

pred.glm.2var <- brca.glm.2var$fitted.values
plot(pred.glm.2var)

pred.glm.2var <- round(pred.glm.2var)
table(pred.glm.2var,data.Y)
##              data.Y
## pred.glm.2var   0   1
##             0 109   4
##             1   4 109
which(pred.glm.2var != data.Y)
## [1]  22  56  60  85 134 157 206 220

Eight misclassifications: 4 false positives and 4 false negatives, corresponding to individuals # 22, 56, 60, 85, 134, 157, 206 and 220.

The Cook’s distance:

library(binomTools)

brca.glm.out.cook.dist.2var <- cooks.distance(brca.glm.2var) # Cook's Distance

plot(brca.glm.out.cook.dist.2var,pch="*",cex=2,main="Influential observations by the Cook's Distance",ylab="Cook's distance",xlab="Individuals")
abline(a=4*mean(brca.glm.out.cook.dist.2var,na.rm=T),b=0,col="red") # add cutoff line

text(x=1:length(brca.glm.out.cook.dist.2var)+1,y=brca.glm.out.cook.dist.2var,labels=ifelse(brca.glm.out.cook.dist.2var>4*mean(brca.glm.out.cook.dist.2var,na.rm=T),names(brca.glm.out.cook.dist.2var),""),col=2)

which(brca.glm.out.cook.dist.2var>4*mean(brca.glm.out.cook.dist.2var,na.rm=T))
## TCGA-BH-A0H5-01A-21R-A115-07 TCGA-BH-A0BT-01A-11R-A12P-07 
##                           22                           56 
## TCGA-E2-A15M-01A-11R-A12D-07 TCGA-BH-A0E1-01A-11R-A056-07 
##                           60                           85 
## TCGA-BH-A0DK-11A-13R-A089-07 TCGA-BH-A0BA-11A-22R-A19E-07 
##                          134                          157 
## TCGA-BH-A0E1-11A-13R-A089-07 TCGA-E9-A1N5-11A-41R-A14D-07 
##                          206                          220

Identifies samples # 22, 56, 60, 85, 134, 157, 206 and 220 as influential observations, misclassified by logistic regression.

The Bonferroni outlier test:

library("car")
outlierTest(brca.glm.2var)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##                              rstudent unadjusted p-value Bonferonni p
## TCGA-BH-A0BT-01A-11R-A12P-07 2.895083          0.0037906      0.85667

Observation # 56, the most extreme observation, is not an outlier.

The concordance c-index:

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:glmnet':
## 
##     auc
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(parallel)
source('./aucLOGIT.boot.R')
brca.auc.2var <- aucLOGIT.boot.parallel(data.glm.2var,1000)
plot(brca.auc.2var,xlab="Variables index",ylab="pvalue")

which(brca.auc.2var < 0.25)
## [1]  56  60 134 206 220
# identify(brca.auc.2var,labels=as.character(seq(1,length(brca.auc.2var),1)))
# Please note that function 'identify' fails to identify observations when generating the .HTML file.

The most outlying observations provided by the c-index are observations # 56, 60, 134, 206 and 220.

The Influence plot, i.e., the plot of Studentized residuals by hat values, with the areas of the circles representing the observations proportional to Cook’s distances:

influencePlot(brca.glm.2var, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance")

Index plots of Cook’s distances, leverages, Studentized residuals, and outlier significance levels:

influenceIndexPlot(brca.glm.2var) 

These plots summarize the previous results, with observation # 56, with the largest Studentized residual and the most influent by the Cook’s distance, not being considered an outlier by the Bonferroni outlier test.

Performing robust logistic regression:

library(robustbase)
brca.glm.2var.rob=glmrob(data.Y~.,family=binomial,data=as.data.frame(data.glm.2var),control=glmrobMqle.control(tcc=1.5))
summary(brca.glm.2var.rob) 
## 
## Call:  glmrob(formula = data.Y ~ ., family = binomial, data = as.data.frame(data.glm.2var),      control = glmrobMqle.control(tcc = 1.5)) 
## 
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.5306     0.5673  -0.935 0.349629    
## ENSG00000072778  -5.3685     1.6523  -3.249 0.001158 ** 
## ENSG00000235505  -6.8525     2.0051  -3.417 0.000632 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Robustness weights w.r * w.x: 
##  223 weights are ~= 1. The remaining 3 ones are
##     56     85    134 
## 0.1971 0.9762 0.9036 
## 
## Number of observations: 226 
## Fitted by method 'Mqle'  (in 9 iterations)
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## No deviance values available 
## Algorithmic parameters: 
##    acc    tcc 
## 0.0001 1.5000 
## maxit 
##    50 
## test.acc 
##   "coef"

Three observationsare down-weighted with robust logisted regression.

Inspection of down-weighted observations:

plot(brca.glm.2var.rob$w.r)

which(brca.glm.2var.rob$w.r < 1)
## [1]  56  85 134

Down-weighted observations are # 56 (the largest down-weighted observation), 85 and 134, previously identified as influential observations.

pred.brca.glm.2var.rob <- brca.glm.2var.rob$fitted.values
pred.brca.glm.2var.rob <- round(pred.brca.glm.2var.rob)
table(pred.brca.glm.2var.rob,data.Y)
##                       data.Y
## pred.brca.glm.2var.rob   0   1
##                      0 109   3
##                      1   4 110
which(pred.brca.glm.2var.rob != data.Y)
## [1]  56  60  85 134 157 206 220

A total of 7 misclassifications were obtained by robust logistic regression: 4 false positives and 3 false negatives. These observation correspond to those misclassified by logistic regression, except observation # 22.

To evaluated the efectinevess of the outlier detection methodologies described above, the reponse value was changed for the opposite class for 10 % (random) individuals:

index.outliers <- c(23,26,29,35,41,44,54,79,84,91,98,116,129,136,154,155,161,183,187,218)
# the same vector generated before, for the analysis of the full BRCA dataset.
data.Y.outliers <- data.Y
data.Y.outliers[index.outliers] <- abs(data.Y[index.outliers]-1)

Building the logistic model with the two variables selected by stepwise logistic regression:

data.glm.2var.out <- data.frame(data.Y.outliers,data.2var)
brca.glm.2var.out <- glm(data.Y.outliers~.,family=binomial,data=data.glm.2var.out,control = glm.control(maxit = 1000))
summary(brca.glm.2var.out)
## 
## Call:
## glm(formula = data.Y.outliers ~ ., family = binomial, data = data.glm.2var.out, 
##     control = glm.control(maxit = 1000))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.66490  -0.48270  -0.07846   0.49653   2.51459  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.1724     0.2045  -0.843    0.399    
## ENSG00000072778  -1.3700     0.2594  -5.281 1.29e-07 ***
## ENSG00000235505  -1.5750     0.2552  -6.172 6.76e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.23  on 225  degrees of freedom
## Residual deviance: 162.61  on 223  degrees of freedom
## AIC: 168.61
## 
## Number of Fisher Scoring iterations: 5

Predictions obtained by the model:

pred.glm.2var.out <- brca.glm.2var.out$fitted.values
plot(pred.glm.2var.out)

pred.glm.2var.out <- round(pred.glm.2var.out)

table(pred.glm.2var.out,data.Y.outliers)
##                  data.Y.outliers
## pred.glm.2var.out   0   1
##                 0 101  13
##                 1  14  98
which(pred.glm.2var.out !=data.Y.outliers)
##  [1]  23  26  29  35  41  44  54  56  60  75  79  84  85  91  98 116 129
## [18] 134 136 154 155 157 161 183 187 206 218

27 misclassifications: 14 false positives and 13 false negatives. From these, the 20 outliers were misclassified into the original class, and 7 more observations were misclassified (# 56, 60, 75, 85, 134, 157 and 206).

The Cook’s distance:

library(binomTools)

brca.glm.cook.dist.2var.out <- cooks.distance(brca.glm.2var.out) # Cook's Distance

plot(brca.glm.cook.dist.2var.out,pch="*",cex=2,main="Influential observations by the Cook's Distance",ylab="Cook's distance",xlab="Individuals")
abline(a=4*mean(brca.glm.cook.dist.2var.out,na.rm=T),b=0,col="red") # add cutoff line

text(x=1:length(brca.glm.cook.dist.2var.out)+1,y=brca.glm.cook.dist.2var.out,labels=ifelse(brca.glm.cook.dist.2var.out>4*mean(brca.glm.cook.dist.2var.out,na.rm=T),names(brca.glm.cook.dist.2var.out),""),col=2)

which(brca.glm.cook.dist.2var.out>4*mean(brca.glm.cook.dist.2var.out,na.rm=T))
## TCGA-E9-A1N6-01A-11R-A144-07 TCGA-A7-A13F-01A-11R-A12P-07 
##                           23                           29 
## TCGA-BH-A0DG-01A-21R-A12P-07 TCGA-BH-A0DK-01A-21R-A056-07 
##                           35                           41 
## TCGA-E9-A1R7-01A-11R-A14M-07 TCGA-BH-A209-01A-11R-A157-07 
##                           44                           54 
## TCGA-BH-A0BT-01A-11R-A12P-07 TCGA-A7-A13E-01B-06R-A277-07 
##                           56                           79 
## TCGA-BH-A0E1-01A-11R-A056-07 TCGA-BH-A1EO-01A-11R-A137-07 
##                           85                           91 
## TCGA-BH-A203-01A-12R-A169-07 TCGA-A7-A0CH-11A-32R-A089-07 
##                           98                          116 
## TCGA-BH-A1EN-11A-23R-A13Q-07 TCGA-BH-A1FN-11A-34R-A13Q-07 
##                          154                          155 
## TCGA-BH-A1F6-11B-94R-A13Q-07 TCGA-BH-A1EV-11A-24R-A137-07 
##                          161                          183 
## TCGA-BH-A0BS-11A-11R-A12P-07 TCGA-E9-A1N6-11A-32R-A144-07 
##                          187                          218

Identifies as influential observations 16 simulated outliers (except samples # 26, 84, 129 and 136) . Aditional samples were identified as influential: samples # 56 and # 85. The most influential by the Cook’s distance is observation # 155.

The Bonferroni outlier test:

library("car")
outlierTest(brca.glm.2var.out)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##                               rstudent unadjusted p-value Bonferonni p
## TCGA-E9-A1R7-01A-11R-A14M-07 -2.704399          0.0068428           NA

Observation # 44, the most extreme observation (Studentized residual), is not an outlier.

The concordance c-index:

library(pROC)
library(parallel)
source('./aucLOGIT.boot.R')
brca.auc.2var.out <- aucLOGIT.boot.parallel(data.glm.2var.out, 1000)
plot(brca.auc.2var.out,xlab="Variables index",ylab="pvalue")

which(brca.auc.2var.out < 0.50)
##  [1]  23  25  26  29  35  41  44  54  55  56  79  84  91  98 116 122 129
## [18] 136 154 155 157 161 185 187 218

The most outlying observation provided by the c-index include all outlier observations.

The Influence plot, i.e., the plot of Studentized residuals by hat values, with the areas of the circles representing the observations proportional to Cook’s distances:

influencePlot(brca.glm.2var.out, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance")

Index plots of Cook’s distances, leverages, Studentized residuals, and outlier significance levels:

influenceIndexPlot(brca.glm.2var.out) 

Robust regression:

library(robustbase)
brca.glm.2var.out.rob=glmrob(data.Y.outliers~.,family=binomial,data=as.data.frame(data.glm.2var.out),control=glmrobMqle.control(tcc=1.5))
summary(brca.glm.2var.out.rob) 
## 
## Call:  glmrob(formula = data.Y.outliers ~ ., family = binomial, data = as.data.frame(data.glm.2var.out),      control = glmrobMqle.control(tcc = 1.5)) 
## 
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.2842     0.2410  -1.179    0.238    
## ENSG00000072778  -1.3644     0.2905  -4.697 2.64e-06 ***
## ENSG00000235505  -1.9491     0.3276  -5.950 2.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Robustness weights w.r * w.x: 
##  206 weights are ~= 1. The remaining 20 ones are summarized as
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2222  0.3460  0.4910  0.5048  0.6115  0.9219 
## 
## Number of observations: 226 
## Fitted by method 'Mqle'  (in 10 iterations)
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## No deviance values available 
## Algorithmic parameters: 
##    acc    tcc 
## 0.0001 1.5000 
## maxit 
##    50 
## test.acc 
##   "coef"

Robust regression down-weighted 20 observations.

Inspection of down-weighted observations:

plot(brca.glm.2var.out.rob$w.r)

which(brca.glm.2var.out.rob$w.r < 1)
##  [1]  23  26  29  35  41  44  54  56  84  91  98 116 129 136 154 155 161
## [18] 183 187 218

Down-weighted observations correspond to all simulated outliers except # 79, plus observation # 56, previously down-weighted for the original dataset (with no simulated outliers).

pred.brca.glm.2var.out.rob <- brca.glm.2var.out.rob$fitted.values
pred.brca.glm.2var.out.rob <- round(pred.brca.glm.2var.out.rob)
table(pred.brca.glm.2var.out.rob,data.Y.outliers)
##                           data.Y.outliers
## pred.brca.glm.2var.out.rob   0   1
##                          0 101  14
##                          1  14  97
which(pred.brca.glm.2var.out.rob != data.Y.outliers)
##  [1]  22  23  26  29  35  41  44  54  56  60  79  84  85  89  91  98 116
## [18] 129 134 136 154 155 161 183 187 206 218 220

A total of 28 misclassifications were obtained: 14 false positives and 14 false negatives. From these, the 20 outliers were misclassified into the original class, and 8 more observations were misclassified (# 22, 56, 60, 85, 89, 134, 206 and 220).

pred.table <- cbind(data.Y[index.outliers],data.Y.outliers[index.outliers],pred.glm.2var.out[index.outliers],pred.brca.glm.2var.out.rob[index.outliers])
colnames(pred.table) <- c("Y","Y.outlier","Y.GLM","Y.robustGLM")
pred.table
##                              Y Y.outlier Y.GLM Y.robustGLM
## TCGA-E9-A1N6-01A-11R-A144-07 1         0     1           1
## TCGA-BH-A0BJ-01A-11R-A056-07 1         0     1           1
## TCGA-A7-A13F-01A-11R-A12P-07 1         0     1           1
## TCGA-BH-A0DG-01A-21R-A12P-07 1         0     1           1
## TCGA-BH-A0DK-01A-21R-A056-07 1         0     1           1
## TCGA-E9-A1R7-01A-11R-A14M-07 1         0     1           1
## TCGA-BH-A209-01A-11R-A157-07 1         0     1           1
## TCGA-A7-A13E-01B-06R-A277-07 1         0     1           1
## TCGA-BH-A0DO-01B-11R-A12D-07 1         0     1           1
## TCGA-BH-A1EO-01A-11R-A137-07 1         0     1           1
## TCGA-BH-A203-01A-12R-A169-07 1         0     1           1
## TCGA-A7-A0CH-11A-32R-A089-07 0         1     0           0
## TCGA-BH-A0HA-11A-31R-A12P-07 0         1     0           0
## TCGA-BH-A0DV-11A-22R-A12P-07 0         1     0           0
## TCGA-BH-A1EN-11A-23R-A13Q-07 0         1     0           0
## TCGA-BH-A1FN-11A-34R-A13Q-07 0         1     0           0
## TCGA-BH-A1F6-11B-94R-A13Q-07 0         1     0           0
## TCGA-BH-A1EV-11A-24R-A137-07 0         1     0           0
## TCGA-BH-A0BS-11A-11R-A12P-07 0         1     0           0
## TCGA-E9-A1N6-11A-32R-A144-07 0         1     0           0

7 Discussion

Given the high dimensionality of the data and the fact that most data varibility might be contained in a small subset of the original variables, three strategies for data dimension reduction were considered before outlier detection:

  1. variable selection by sparse logistic regression (VS) based on LASSO and Elastic net regularization, using the ‘glmnet’ R package;

  2. feature extraction by PLS-DA (FE), using the ‘caret’ R package;

  3. sparse variable selection and feature extraction by sPLS-DA (VSFE), using the ‘spls’ R package.

The first strategy, VS, enabled the selection of 39 variables out of 54820 original variables. With the FE strategy 7 latent variables, which are linear combinations of the original variables, were extracted from the original data, summarizing 40 % of the data variance. Finally, VSFE extracted 8 latent variables built based on 42 original variables. The three model were based on a train set and the predictive performance evaluated for a test set. For both datasets, the three models yielded no misclassifications.

When searching for real and simulated outliers in the entire dataset following the above dimensionality reduction strategies, numerical problems were encountered, with model fitting producing large standard deviations and zero values for the z statistic. Model fitting via logistic regression is sensitive to collinearities among the independent variables in the model. Numerical problems associated to complete separation and collinearity are manifested by extraordinarily large estimated standard errors and sometimes by a large estimated coefficient as well. This corresponds to the problems encountered for the BRCA dataset, which yielded non-trusted models that compromised the performance of further detection of outliers and influential observations.

To illustrate the suitability of the methods described for detection of outliers and influential observations, a subset of the BRCA dataset was used. This dataset was based on two variables selected by stepwise logistic regression after dimensionality reduction using Lasso regularization. No outliers were detected, only influential observations.

The introduction of outliers by randomly changing the response variable (‘tumour’ vs ‘normal’) in 10 % of the observations was also investigated in order to evaluate the effectiveness of the methods used for the identification of outliers and influential observations. Both logistic and robust logistic regression were able to identify the 20 simulated outliers. While with logistic regression the outlier detection measures enabled the identification of the outliers as influential obervations and misclassified them into the original classes, robust logistic regression down-weighted simulated outlier observations and also classified them into the original classes.