Brain decoding with MLP

This part of the session aims to make participants familiar with Multilayer Peceptrons as one possible decoding model that can be applied to brain data. The objectives 📍 are:

  • get to know the basics of Multilayer Peceptrons

    • model creation

    • model training

    • model testing

Multilayer Perceptron

_images/multilayer-perceptron.png

Fig. 5 A multilayer perceptron with 25 units on the input layer, a single hidden layer with 17 units, and an output layer with 9 units. Figure generated with the NN-SVG tool by [Alexander Lenail]. The figure is shared under a CC-BY 4.0 license.

We are going to train a Multilayer Perceptron (MLP) classifier for brain decoding on the Haxby dataset. MLPs are one of the most basic architecture of artificial neural networks. As such, MLPs consist of input and output layers as well as hidden layers that process the input through a succession of transformations towards the output layer that performs the task at hand, e.g. a classification or regression. Like other machine learning models for supervised learning, an MLP initially goes through a training phase. During this supervised phase, the network is taught what to look for and what is the desired output via its objective function. This refers to, minimizing the loss, ie the deviation of predictions from the “ground truth”, and thus increasing its performance.

MLPs were actually among the first ANNs to appear, specifically the Mark I Peceptron which you can see below.

https://preview.redd.it/wgzps0pvcny91.jpg?width=640&crop=smart&auto=webp&s=0b2e56dc4eaa886ebd01ac0cd8e51fc4efdb1d01

Fig. 6 Frank Rosenblatt with a Mark I Perceptron computer in 1960.

In this tutorial, we are going to train the simplest MLP architecture featuring one input layer, one output layer and just one hidden layer.

Theoretical motivation

The previous tutorial on brain decoding with SVM shows how to use a linear combination of brain features to train a predictor.

Let’s take a moment to consider this: a 1-layer perceptron with a sigmoid activation function models the relation between X (the input data) and y (the predicted data) the same way a logistic regression would: \(\hat{y} = \sigma(X \beta + \beta_0)\)

_images/logistic_regression.png

Fig. 7 A fitted logistic regression function classifying two different classes. Courtesy of Jérôme Dockès.

If one optimizes the parameters of this MLP to minimize a cross-entropy loss, they’re actually optimizing for the same objective function as in a classical logistic regression problem: \(\underset{\beta, \beta_0}{\min} \sum_k y_k \log(\hat{y_k}) + (1 - y_k) \log(1 - \hat{y_k})\)

As a rule of thumb, one can consider that a 1-layer perceptron (and therefore any last layer of a multi-layer perceptron) works similarly to an SVC.

A big motivation for using multiple-layer perceptrons is that they can introduce non-linearities in our data. When training such models, the hope is that the hidden layers of the model will find meaningful non-linear combinations of the input features which help us solve our decoding problem.

Getting the data

We are going to work with the Haxby dataset [HGF+01] again. You can check the section An overview of the Haxby dataset for more details on that dataset. Here we are going to quickly download and prepare it for machine learning applications with a set of predictive variables, the brain time series X, and a dependent variable, the respective cognitive processes/function/percepts y.

import os
import warnings
warnings.filterwarnings(action='once')

from nilearn import datasets
# We are fetching the data for subject 4
data_dir = os.path.join('..', 'data')
sub_no = 4
haxby_dataset = datasets.fetch_haxby(subjects=[sub_no], fetch_stimuli=True, data_dir=data_dir)
func_file = haxby_dataset.func[0]

# mask the data
from nilearn.input_data import NiftiMasker
mask_filename = haxby_dataset.mask_vt[0]
masker = NiftiMasker(mask_img=mask_filename, standardize=True, detrend=True)
X = masker.fit_transform(func_file)

# cognitive annotations
import pandas as pd
behavioral = pd.read_csv(haxby_dataset.session_target[0], delimiter=' ')
y = behavioral['labels']
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/nilearn/datasets/func.py:20: DeprecationWarning: Please use `MatReadError` from the `scipy.io.matlab` namespace, the `scipy.io.matlab.miobase` namespace is deprecated.
  from scipy.io.matlab.miobase import MatReadError
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/nilearn/datasets/__init__.py:93: FutureWarning: Fetchers from the nilearn.datasets module will be updated in version 0.9 to return python strings instead of bytes and Pandas dataframes instead of Numpy arrays.
  warn("Fetchers from the nilearn.datasets module will be "

As an initial check, we’ll have a look at the size of X and y:

categories = y.unique()
print(categories)
print(y.shape)
print(X.shape)
['rest' 'face' 'chair' 'scissors' 'shoe' 'scrambledpix' 'house' 'cat'
 'bottle']
(1452,)
(1452, 675)

So we have 1452 time points, with one label for the respective stimulus percept each, and for each time point we have recordings of brain activity obtained via fMRI across 675 voxels (within the VT mask). We can also see that the stimulus percepts span 9 different categories.

However, concerning our planned analyses, we need to convert our categories into a one-hot encoder:

# creating instance of one-hot-encoder
from sklearn.preprocessing import OneHotEncoder
import numpy as np
enc = OneHotEncoder(handle_unknown='ignore')
y_onehot = enc.fit_transform(np.array(y).reshape(-1, 1))
# turn the sparse matrix into a pandas dataframe
y = pd.DataFrame(y_onehot.toarray())
display(y[:10])
0 1 2 3 4 5 6 7 8
0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
6 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

Training a model

As introduced in the prior tutorials, one of the most important aspects of machine learning is the split between train and tests. MLPs are no exception to that and thus we need to split our dataset accordingly. We will keep 20% of the time points as test, and then set up a 10 fold cross validation for training/validation.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)   

With that, we can already build our MLP. Here, we are going to use Tensorflow and Keras. As with every other ANN, we need to import the respective components, here, the model and layer type. In our case we will use a Sequential model and Dense layers.

from keras.models import Sequential
from keras.layers import Dense
2023-05-22 08:57:18.709929: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-05-22 08:57:18.709957: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

A note regarding our MLP

Please note that the example MLP we are going to create and train here is rather simple as we want to enable its application on machines with rather limited computational resources (ie your laptops or binder). “Real-world” models are usually more complex and might also entail different types and layers.

Initially, we need to create our, so far, empty model.

# number of unique conditions that we have
model_mlp = Sequential()
2023-05-22 08:57:20.117988: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-05-22 08:57:20.118014: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-05-22 08:57:20.118035: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fv-az1102-220): /proc/driver/nvidia/version does not exist
2023-05-22 08:57:20.118486: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Next, we can add the layers to our model, starting with the input layer. Given this is a rather short introduction to the topic and does not focus on ANNs, we are going to set the kernel initialization and activation function to appropriate defaults (Please have a look at the Introduction to deep learning session for more information.).

model_mlp.add(Dense(50 , input_dim = 675, kernel_initializer="uniform", activation = 'relu'))

As noted above, we are using Dense layers and as you can see, we set the input dimensions to 675. You might have already notices that this is the number of voxels we have data from. Setting the input dimension according to the data dimensions is rather important is referred to as the semantic gap: the transformation of actions & percepts conducted/perceived by humans into computational representations. For example, pictures are “nothing” but a huge array for a computer and what will be submitted to the input layer of an ANN (note: this also holds true for basically any other type of data). Here, our MLP receives the extracted brain activity patterns as input which are already in the right array format thanks to nilearn. Thus, always carefully think about what your input data entails and how it is structured to then setup your input layer accordingly.

Next, we are going to add one hidden layer.

model_mlp.add(Dense(30, kernel_initializer="uniform", activation = 'relu'))

And because we are creating a very simple MLP with only three layers, we already add our output layer, using the softmax activation function given that we aim to train our MLP to predict the different categories that were perceived by the participants from their brain activity patterns.

model_mlp.add(Dense(len(categories), activation = 'softmax'))

To get a nice overview of our ANN, we can now use the .summary() function, which will provide us with the model type, model parameters and for each layer, the its type, shape and parameters.

model_mlp.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 50)                33800     
                                                                 
 dense_1 (Dense)             (None, 30)                1530      
                                                                 
 dense_2 (Dense)             (None, 9)                 279       
                                                                 
=================================================================
Total params: 35,609
Trainable params: 35,609
Non-trainable params: 0
_________________________________________________________________

With that, we already created our MLP architecture, which is now ready to be compiled! Within this step, we will set the optimizer, loss function and metric, ie components that define how our MLP will learn.

model_mlp.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

Now it’s to train our MLP. Thus, we have to fit it to our data, specifically only the training data. Here, we are going to provide a few more hyperparameters that will define how our MLP is going to learn. This entails the batch size, the epochs and split of validation sets. We will assign the respective output to a variable so that we can investigate our MLP’s learning process.

history = model_mlp.fit(X_train, y_train, batch_size = 10,
                             epochs = 10, validation_split = 0.2)
Epoch 1/10
 1/93 [..............................] - ETA: 35s - loss: 2.2229 - accuracy: 0.1000

42/93 [============>.................] - ETA: 0s - loss: 1.9819 - accuracy: 0.3429 

87/93 [===========================>..] - ETA: 0s - loss: 1.7037 - accuracy: 0.4138

93/93 [==============================] - 1s 3ms/step - loss: 1.6752 - accuracy: 0.4224 - val_loss: 1.4664 - val_accuracy: 0.4506
Epoch 2/10
 1/93 [..............................] - ETA: 0s - loss: 1.5577 - accuracy: 0.4000

48/93 [==============>...............] - ETA: 0s - loss: 1.1682 - accuracy: 0.6000

92/93 [============================>.] - ETA: 0s - loss: 1.1279 - accuracy: 0.6217

93/93 [==============================] - 0s 1ms/step - loss: 1.1285 - accuracy: 0.6207 - val_loss: 1.2011 - val_accuracy: 0.5665
Epoch 3/10
 1/93 [..............................] - ETA: 0s - loss: 0.3257 - accuracy: 1.0000

47/93 [==============>...............] - ETA: 0s - loss: 0.8161 - accuracy: 0.7447

93/93 [==============================] - ETA: 0s - loss: 0.8159 - accuracy: 0.7392

93/93 [==============================] - 0s 1ms/step - loss: 0.8159 - accuracy: 0.7392 - val_loss: 1.0526 - val_accuracy: 0.6266
Epoch 4/10
 1/93 [..............................] - ETA: 0s - loss: 0.2624 - accuracy: 1.0000

46/93 [=============>................] - ETA: 0s - loss: 0.5678 - accuracy: 0.8196

88/93 [===========================>..] - ETA: 0s - loss: 0.5670 - accuracy: 0.8330

93/93 [==============================] - 0s 2ms/step - loss: 0.5575 - accuracy: 0.8394 - val_loss: 0.9237 - val_accuracy: 0.6781
Epoch 5/10
 1/93 [..............................] - ETA: 0s - loss: 0.2386 - accuracy: 1.0000

45/93 [=============>................] - ETA: 0s - loss: 0.3832 - accuracy: 0.8844

91/93 [============================>.] - ETA: 0s - loss: 0.3663 - accuracy: 0.8912

93/93 [==============================] - 0s 2ms/step - loss: 0.3661 - accuracy: 0.8912 - val_loss: 0.8652 - val_accuracy: 0.7124
Epoch 6/10
 1/93 [..............................] - ETA: 0s - loss: 0.3449 - accuracy: 0.9000

46/93 [=============>................] - ETA: 0s - loss: 0.2661 - accuracy: 0.9261

91/93 [============================>.] - ETA: 0s - loss: 0.2419 - accuracy: 0.9363

93/93 [==============================] - 0s 2ms/step - loss: 0.2418 - accuracy: 0.9364 - val_loss: 0.8276 - val_accuracy: 0.7554
Epoch 7/10
 1/93 [..............................] - ETA: 0s - loss: 0.1631 - accuracy: 1.0000

47/93 [==============>...............] - ETA: 0s - loss: 0.1334 - accuracy: 0.9745

91/93 [============================>.] - ETA: 0s - loss: 0.1356 - accuracy: 0.9747

93/93 [==============================] - 0s 2ms/step - loss: 0.1345 - accuracy: 0.9752 - val_loss: 0.7945 - val_accuracy: 0.7639
Epoch 8/10
 1/93 [..............................] - ETA: 0s - loss: 0.0720 - accuracy: 1.0000

48/93 [==============>...............] - ETA: 0s - loss: 0.0707 - accuracy: 0.9958

92/93 [============================>.] - ETA: 0s - loss: 0.0737 - accuracy: 0.9946

93/93 [==============================] - 0s 2ms/step - loss: 0.0738 - accuracy: 0.9946 - val_loss: 0.8261 - val_accuracy: 0.7468
Epoch 9/10
 1/93 [..............................] - ETA: 0s - loss: 0.1264 - accuracy: 0.9000

48/93 [==============>...............] - ETA: 0s - loss: 0.0799 - accuracy: 0.9833

92/93 [============================>.] - ETA: 0s - loss: 0.0833 - accuracy: 0.9804

93/93 [==============================] - 0s 1ms/step - loss: 0.0827 - accuracy: 0.9806 - val_loss: 0.8526 - val_accuracy: 0.7639
Epoch 10/10
 1/93 [..............................] - ETA: 0s - loss: 0.0169 - accuracy: 1.0000

47/93 [==============>...............] - ETA: 0s - loss: 0.0302 - accuracy: 1.0000

93/93 [==============================] - 0s 1ms/step - loss: 0.0321 - accuracy: 0.9989 - val_loss: 0.8581 - val_accuracy: 0.7639

This looks about and what we would expect the learning process to be: across epochs, the loss is decreasing and the accuracy is increasing.

A note regarding the learning process of our MLP

Comparable to its architecture, our MLP’s learning process is also not really what you would see on the “real world”. Usually, ANNs are trained way more, for longer periods of times, more epochs and on more data. However, we keep it rather short as we want to enable its application on machines with rather limited computational resources (ie your laptops or binder).

While this is already informative, we can also plot the loss and accuracy in the training and validation sets respectively. Let’s start with the loss.

import matplotlib.pyplot as plt
import seaborn as sns

plt.plot(history.history['loss'], color='m')
plt.plot(history.history['val_loss'], color='c')
plt.title('MLP loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc = 'upper right')

sns.despine(offset=5)

plt.show()
_images/mlp_decoding_26_0.png

And now the same for the accuracy.

import matplotlib.pyplot as plt
import seaborn as sns

plt.plot(history.history['accuracy'], color='m')
plt.plot(history.history['val_accuracy'], color='c')
plt.title('MLP accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc = 'upper left')

sns.despine(offset=5)

plt.show()
_images/mlp_decoding_28_0.png

How would you interpret these plots…

concerning our MLP’s learning process? Does it make sense? If not, how should it look like? Could you use these plots to evaluate certain aspects of the learning process, e.g. regularization?

Assessing performance

After evaluating the training of our MLP, we of course also need to evaluate its (predictive) performance. Here, this refers to the accuracy of our MLP’s outcomes, ie its predictions. We already saw this in the above plots and during the training across epochs but let’s check the accuracy of the prediction on the training set again:

from sklearn.metrics import classification_report
y_train_pred = model_mlp.predict(X_train)
print(classification_report(y_train.values.argmax(axis = 1), y_train_pred.argmax(axis=1)))
              precision    recall  f1-score   support

           0       0.84      0.91      0.87        85
           1       0.97      0.97      0.97        88
           2       0.99      0.88      0.93        90
           3       0.99      0.96      0.97        81
           4       0.99      0.95      0.97        91
           5       0.97      0.98      0.97       471
           6       0.87      0.96      0.91        81
           7       0.97      0.97      0.97        90
           8       0.93      0.88      0.90        84

    accuracy                           0.95      1161
   macro avg       0.94      0.94      0.94      1161
weighted avg       0.95      0.95      0.95      1161

Why you might think: “Oh, that’s awesome, great performance.”, such outcomes are usually perceived as dangerously high and indicate that something is off…

Why should a close-to-perfect performance indicate that something is wrong?

What do you think is the rationale to say that very high scores are actually “suspicious” and tells us that something is most likely wrong? Try thinking about the things you’ve learned so far: training/test/validation datasets and their size, models, predictions, etc. .

Luckily, we did split our dataset into independent training and test sets. So, let’s check our MLP’s performance on the test set:

y_test_pred = model_mlp.predict(X_test)
print(classification_report(y_test.values.argmax(axis = 1), y_test_pred.argmax(axis=1)))
              precision    recall  f1-score   support

           0       0.61      0.74      0.67        23
           1       1.00      0.55      0.71        20
           2       0.69      0.61      0.65        18
           3       0.93      0.96      0.95        27
           4       1.00      0.82      0.90        17
           5       0.87      0.91      0.89       117
           6       0.67      0.81      0.73        27
           7       0.94      0.89      0.91        18
           8       0.68      0.62      0.65        24

    accuracy                           0.82       291
   macro avg       0.82      0.77      0.78       291
weighted avg       0.83      0.82      0.82       291

As you can see, the scores, ie performance, drops quite a bit. Do you know why and which you would report, e.g. in a publication?

Beside checking the overall scores, there are other options to further evaluate our MLP’s (or basically any other model’s) performance. One of the most commonly used ones is called confusion matrix (which you most likely have seen before in this course). A confusion matrix displays how often a given sample was predicted as a certain label, thus, for example, providing insights into differentiability, etc. . To implement this, we initially have to compute the confusion matrix:

import numpy as np
from sklearn.metrics import confusion_matrix

cm_svm = confusion_matrix(y_test.values.argmax(axis = 1), y_test_pred.argmax(axis=1))
model_conf_matrix = cm_svm.astype('float') / cm_svm.sum(axis = 1)[:, np.newaxis]

After that, we can plot it for evaluation.

import pandas as pd
import seaborn as sns

df_cm = pd.DataFrame(model_conf_matrix, index = categories,
                     columns = categories)

plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot = True, cmap = 'Blues', square = True)
plt.xticks(rotation = 45)
plt.title('MLP decoding results - confusion matrix' , fontsize = 15, fontweight = 'bold')
plt.xlabel("true labels", fontsize = 14, fontweight = 'bold')
plt.ylabel("predicted labels", fontsize = 14, fontweight = 'bold')
plt.show()
_images/mlp_decoding_37_0.png

Based on this outcome: how would you interpret the confusion matrix? Are some categories better "decodable" than others? Could even make such a statement?

Summary

With that, we already reached the end of this tutorial within which we talked about how to create, train and evaluate a MLP as one possible decoding model that can be applied to brain data. As mentioned before, the MLP utilized here is rather simple and models you see (and maybe use) out in the “real world” will most likely be way more complex. However, their application to brain data concerning input, hidden and output layers follows the same outline.

Tip

Unfortunately, visualizing the features/transformations of an ANN is quite often not straightforward as it depends on the given ANN architecture. However, you can check this fantastic distill article to learn more about feature visualization in artificial neural networks.

Exercises

  • What is the most difficult category to decode? Why?

  • The model seemed to overfit. Try adding a Dropout layer to regularize the model. You can read about dropout in keras in this blog post.

  • Try to add layers or hidden units, and observe the impact on overfitting and training time.