Shapley values - prediction of lung cancer histological subtype with an aggregated radiomic model
In this jupyter notebook we provide a short example for the use of Shapley values to explain aggregated radiomic models. We study the classification of the histological subtype of non-small cell lung cancer (NSCLC) from baseline 18F-FDG PET images. To do so, we use a radiomic model that aggregates features extracted from all the metastases of the patient.
[1]:
%load_ext autoreload
%autoreload 2
# Setup for local running - please delete this block
import sys
sys.path.append('C:\\Users\\ncaptier\\Documents\\GitHub\\radshap')
import collections
from tqdm import tqdm
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, recall_score, balanced_accuracy_score
from sklearn.base import clone
from radshap.shapley import Shapley
from radshap.plot import plot_bars, plot_pet
0. Load radiomic and clinical data
Our cohort gathers 110 NSCLC metastatic patients with a baseline 18F-FDG PET scan (for the diagnosis of the metastatic disease) acquired before the beginning of their first line treatment. All their metastases were segmented by a 12 years experienced nuclear medicine physician. Here we study the binary classification task that consists in distinguishing the adenocarcinomas (i.e most common subtype of NSCLC) from the other subtypes (including squamous cell carcinomas).
Note: We will need to adress the class imbalance to avoid ending up with a very low specificity.
[2]:
df_clinicals = pd.read_csv("..\\data\\data_clinicals.csv", index_col=0)
fig, ax = plt.subplots(figsize=(8, 6))
sns.barplot(data=df_clinicals["Histology"].value_counts().reset_index(), x="index", y="Histology", ax=ax)
ax.set(xlabel=None, ylabel=None)
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
sns.despine()
We processed the segmented lesions with 3 steps:
We applied an absolute threshold of 2.5 standardized uptake value (SUV) units, meaning that all voxel values less than 2.5 were excluded from the segmented tumor regions.
We filtered out very small lesions (< 1 mL) as well as diffuse lesions whose precise segmentation and characterization by shape features is questionable (lymphangitic spread, diffuse pleural metastases, diffuse myocardial metastases, and diffuse subdiaphragmatic metastases).
For all the remaining lesions we used pyradiomics to extract 4 shape radiomic features and 6 first-order radiomic features.
We first load the radiomic features of all the filtered lesions of every patients (each row corresponds to one lesion of one patient). In the following we will use this dataframe to aggregate features accross all the lesions and obtain one vector of features per patient.
[3]:
df_radiomics_all = pd.read_csv("..\\data\\data_allmeta.csv", index_col=0)
# we log-transform the volume feature to avoid linear model to deal with highly right skewed distributions
df_radiomics_all.loc[:, "original_shape_VoxelVolume"] = df_radiomics_all["original_shape_VoxelVolume"].apply(lambda x: np.log(x+1))
df_radiomics_all.iloc[:, 2:].head()
[3]:
| original_shape_Sphericity | original_shape_Elongation | original_shape_Flatness | original_shape_VoxelVolume | original_firstorder_Entropy | original_firstorder_Kurtosis | original_firstorder_Maximum | original_firstorder_Mean | original_firstorder_QuartileDispersion | original_firstorder_Skewness | |
|---|---|---|---|---|---|---|---|---|---|---|
| patient_0 | 0.532479 | 0.741784 | 0.388096 | 9.558812 | 2.746364 | 4.173133 | 5.917239 | 3.208351 | 0.118057 | 1.088778 |
| patient_0 | 0.651127 | 0.581162 | 0.453587 | 8.592857 | 1.959371 | 3.138165 | 3.937612 | 2.918054 | 0.067240 | 0.769157 |
| patient_0 | 0.616592 | 0.451983 | 0.247072 | 7.636752 | 2.197037 | 2.902717 | 4.058545 | 2.969259 | 0.093076 | 0.728094 |
| patient_0 | 0.863482 | 0.942119 | 0.830428 | 9.157889 | 4.209365 | 2.012266 | 8.961629 | 4.781307 | 0.290141 | 0.368464 |
| patient_0 | 0.624193 | 0.811512 | 0.427006 | 10.863470 | 3.150671 | 22.400791 | 13.207974 | 3.491949 | 0.129028 | 3.597876 |
We then load the radiomic features of the primary tumor of each patient. In the following we will use this dataframe to build a model only based on the radiomic characterization of the primary tumor and compare it to the aggregated radiomic model.
[4]:
df_radiomics_primary = pd.read_csv("..\\data\\data_primary.csv", index_col = 0)
# we log-transform the volume feature to avoid linear model to deal with highly right skewed distributions
df_radiomics_primary.loc[:, "original_shape_VoxelVolume"] = df_radiomics_primary["original_shape_VoxelVolume"].apply(lambda x: np.log(x+1))
df_radiomics_primary.iloc[:, 2:].head()
[4]:
| original_shape_Sphericity | original_shape_Elongation | original_shape_Flatness | original_shape_VoxelVolume | original_firstorder_Entropy | original_firstorder_Kurtosis | original_firstorder_Maximum | original_firstorder_Mean | original_firstorder_QuartileDispersion | original_firstorder_Skewness | |
|---|---|---|---|---|---|---|---|---|---|---|
| patient_0 | 0.624193 | 0.811512 | 0.427006 | 10.863470 | 3.150671 | 22.400791 | 13.207974 | 3.491949 | 0.129028 | 3.597876 |
| patient_1 | 0.436950 | 0.604767 | 0.406622 | 10.511757 | 3.481742 | 5.855306 | 9.717072 | 3.664580 | 0.185013 | 1.707722 |
| patient_2 | 0.780999 | 0.705620 | 0.581406 | 8.998260 | 3.800767 | 2.483848 | 7.343111 | 4.192277 | 0.211119 | 0.623496 |
| patient_3 | 0.668431 | 0.595939 | 0.350714 | 11.943259 | 4.430960 | 2.596109 | 12.187289 | 7.437633 | 0.151533 | 0.041012 |
| patient_4 | 0.699124 | 0.642183 | 0.575850 | 12.270703 | 5.803551 | 3.076536 | 34.302750 | 9.359064 | 0.411648 | 0.582178 |
1. Aggregated model vs primary tumor model
Here we compare two simple logistic regression classifiers: * A classifier which takes as input 10 radiomic features from the primary tumor - primary tumor model * A classifier which takes as input the minimum value, the maximum value, and the standard deviation of the same 10 radiomic features over all the metastases of the patient - aggregated radiomic model
1.1 Aggregate features with min, max and std
We apply min, max, and standard deviation (i.e., std) functions over all the lesions of every patients for each radiomic features.
Note: We need to make sure that the features are properly ordered (i.e all the max features followed by all the min features and all the std features). This reordering will greatly facilitate the application of our Shapley function to interpret the aggregated radiomic model. By default by pandas agg function orders columns like this | max_feature1 | min_feature1 | std_feature_1 | … | max_feature10 | min_feature10 | std_feature10 |, we rather want the columns to be ordered like | max_feature1 … | max_feature10 | min_feature1 | … | min_feature10 | std_feature1 | … | std_feature10 |
[5]:
def reorder_columns(columns, function_names):
d = {f_name: [] for f_name in function_names}
for col in columns:
d[col[1]].append(col)
return sum(d.values(), [])
[6]:
fun_std = lambda x: x.std(ddof=0) #ddof is set to 0 to normalize by N and not N-1 and thus handle the cases where there is only one lesion
fun_std.__name__ = "std"
df_agg = df_radiomics_all.iloc[:, 2:].groupby(level=0).agg(("max", "min", fun_std)).loc[df_radiomics_primary.index]
#reorder columns
df_agg = df_agg[reorder_columns(df_agg.columns, function_names = ("max", "min", "std"))]
df_agg.columns = df_agg.columns.map('_'.join)
df_agg.head()
[6]:
| original_shape_Sphericity_max | original_shape_Elongation_max | original_shape_Flatness_max | original_shape_VoxelVolume_max | original_firstorder_Entropy_max | original_firstorder_Kurtosis_max | original_firstorder_Maximum_max | original_firstorder_Mean_max | original_firstorder_QuartileDispersion_max | original_firstorder_Skewness_max | ... | original_shape_Sphericity_std | original_shape_Elongation_std | original_shape_Flatness_std | original_shape_VoxelVolume_std | original_firstorder_Entropy_std | original_firstorder_Kurtosis_std | original_firstorder_Maximum_std | original_firstorder_Mean_std | original_firstorder_QuartileDispersion_std | original_firstorder_Skewness_std | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| patient_0 | 0.863482 | 0.942119 | 0.830428 | 10.863470 | 4.209365 | 22.400791 | 13.207974 | 4.781307 | 0.290141 | 3.597876 | ... | 0.101204 | 0.160786 | 0.177240 | 0.976948 | 0.726723 | 7.276279 | 3.289046 | 0.626308 | 0.071618 | 1.105949 |
| patient_1 | 0.837041 | 0.900909 | 0.596313 | 10.511757 | 4.305888 | 5.855306 | 10.006734 | 4.819726 | 0.283288 | 1.707722 | ... | 0.162958 | 0.138764 | 0.117702 | 1.389979 | 0.537648 | 1.340308 | 1.941857 | 0.531018 | 0.054307 | 0.387661 |
| patient_2 | 0.780999 | 0.705620 | 0.581406 | 8.998260 | 3.800767 | 2.483848 | 7.343111 | 4.192277 | 0.211119 | 0.623496 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| patient_3 | 0.867612 | 0.825841 | 0.696419 | 11.943259 | 4.430960 | 2.748872 | 12.187289 | 7.437633 | 0.219823 | 0.692009 | ... | 0.081349 | 0.103375 | 0.144040 | 1.953295 | 0.227022 | 0.062997 | 1.544987 | 1.066581 | 0.027920 | 0.270425 |
| patient_4 | 0.888275 | 0.899604 | 0.889127 | 12.270703 | 5.803551 | 7.975884 | 34.302750 | 9.359064 | 0.476792 | 2.106428 | ... | 0.086014 | 0.146873 | 0.145913 | 1.590111 | 0.915027 | 1.607292 | 8.746587 | 1.971811 | 0.124980 | 0.451651 |
5 rows × 30 columns
1.2 Train and test logistic regression models with repeated cross-validation schemes
We use a repeated stratified cross-validation scheme with 5 folds and 100 repeats to train and test our models. Several classification metrics are computed for each cross-validation scheme and averaged over the 100 repeats (balanced accuracy, ROC AUC, sensitivity, and specificity).
[7]:
# define specificity scorer
def specificity_score(y_true, y_pred):
TN = ((y_true == 0) & (y_pred == 0)).sum()
N = (1-y_true).sum()
return TN/N
def fit_and_score(clf, X, y, train, test, results_dict):
Xtrain, Xtest, ytrain, ytest = X[train, :], X[test, :], y[train], y[test]
clf.fit(Xtrain, ytrain)
preds = clf.predict(Xtest)
probas = clf.predict_proba(Xtest)[:,1]
results_dict["roc_auc"].append(roc_auc_score(ytest, probas))
results_dict["balanced_accuracy"].append(balanced_accuracy_score(ytest, preds))
results_dict["sensitivity"].append(recall_score(ytest, preds))
results_dict["specificity"].append(specificity_score(ytest, preds))
return probas
[8]:
X_all = df_agg.values
X_primary = df_radiomics_primary.iloc[:, 2:].values
y = 1*(df_clinicals["Histology"] == 'adenocarcinoma').values
[9]:
pipe = Pipeline(steps = [("scaling", StandardScaler()),
("logistic", LogisticRegression(class_weight = "balanced"))
]
)
[10]:
np.random.seed(0) #fix seed for reproducibility
results_all = collections.defaultdict(list)
results_primary = collections.defaultdict(list)
for i in tqdm(range(100)):
prim_dict = collections.defaultdict(list)
all_dict = collections.defaultdict(list)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
# Iterate over the 5 folds of the cross-validation scheme
for train, test in cv.split(np.zeros(len(y)), y):
probas_primary = fit_and_score(clone(pipe), X_primary, y, train, test, results_dict=prim_dict)
probas_all = fit_and_score(clone(pipe), X_all, y, train, test, results_dict=all_dict)
# Aggregate the results of the 5 folds
for metric in prim_dict.keys():
results_primary[metric].append(np.mean(prim_dict[metric]))
results_all[metric].append(np.mean(all_dict[metric]))
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 13.76it/s]
1.3 Display the results
[11]:
results_all = pd.DataFrame(results_all)
results_primary = pd.DataFrame(results_primary)
results_all['model'] = 'all'
results_primary['model'] = 'primary'
results = pd.concat([results_all, results_primary], axis=0)
results.groupby("model").mean()
[11]:
| roc_auc | balanced_accuracy | sensitivity | specificity | |
|---|---|---|---|---|
| model | ||||
| all | 0.777538 | 0.710908 | 0.762750 | 0.659067 |
| primary | 0.722199 | 0.661276 | 0.702551 | 0.620000 |
[12]:
results.groupby("model").std()
[12]:
| roc_auc | balanced_accuracy | sensitivity | specificity | |
|---|---|---|---|---|
| model | ||||
| all | 0.038762 | 0.046618 | 0.035242 | 0.077155 |
| primary | 0.023648 | 0.027652 | 0.027423 | 0.047690 |
[13]:
fig, ax = plt.subplots(figsize=(10, 7))
sns.barplot(data = results.melt(id_vars = ["model"]), x="variable", y="value", hue="model", ci=None, ax=ax)
for i, m in enumerate(['all', 'primary']):
ax.errorbar(x=np.arange(4) + 0.2*(-1 + 2*i),
y=results[results["model"]== m].iloc[:, :-1].mean(axis=0).values,
yerr=results[results["model"]== m].iloc[:, :-1].std(axis=0).values,
fmt='none',
ecolor='k',
capsize=5,
elinewidth=1)
ax.set_ylim(0.5, 0.82)
ax.set(xlabel=None, ylabel=None)
ax.set_axisbelow(True)
ax.yaxis.grid(color='gray', linestyle='dashed')
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
ax.legend(fontsize=12)
sns.despine()
The aggregated radiomic model that takes into account all the metastases of the patient outperforms the model that only considers his primary lung tumor. In particular, it is more sensitive and detects better the adenocarcinomas. It is a promising result since the aggregation functions that we used were very simple and we can thus hope for more striking improvements with more complex aggregation strategies.
2. Interpretation of the aggregated model with Shapley values
Here we use the radshap package to interpret our aggregated radiomic model and highlight the importance of each lesion for the prediction of the histological subtype. In particular we use the radshap package to study which metastases the aggregated model uses (other than the primary tumor) to better predict the histological subtype of the patient
We first re-train both the aggregated radiomic model and the primary model on the whole data set (i.e 110 patients).
We use these trained models to predict the histological subtype of each patient. We then focus on those who were missclassified by the primary model and correctly classified by the aggregated model.
For such patients we compute the Shapley values of each of their metastases to highlight their importance for the prediction made by the aggregated radiomic model.
2.1 Train and predict on the whole data set
[14]:
model_all = clone(pipe).fit(X_all, y)
model_primary = clone(pipe).fit(X_primary, y)
[15]:
df_probas = pd.DataFrame(np.hstack([model_all.predict_proba(X_all)[:, 1].reshape(-1, 1),
model_primary.predict_proba(X_primary)[:, 1].reshape(-1, 1),
y.reshape(-1, 1)]
),
index=df_agg.index,
columns = ["probas_all", "probas_primary", "labels"])
df_probas["probas_gap"] = np.abs(df_probas["probas_all"] - df_probas["probas_primary"])
mask = (model_all.predict(X_all) == y) & (model_primary.predict(X_primary) != y)
df_probas[mask].sort_values(by="probas_gap", ascending = False).iloc[:, :-1].head(10)
[15]:
| probas_all | probas_primary | labels | |
|---|---|---|---|
| patient_5 | 0.918168 | 0.400524 | 1.0 |
| patient_106 | 0.640970 | 0.150564 | 1.0 |
| patient_7 | 0.702580 | 0.229073 | 1.0 |
| patient_31 | 0.316072 | 0.789244 | 0.0 |
| patient_27 | 0.945055 | 0.485545 | 1.0 |
| patient_107 | 0.872115 | 0.454678 | 1.0 |
| patient_109 | 0.246746 | 0.590126 | 0.0 |
| patient_72 | 0.191475 | 0.530324 | 0.0 |
| patient_44 | 0.745959 | 0.418653 | 1.0 |
| patient_73 | 0.816111 | 0.497330 | 1.0 |
Overall 20 patients were misclassified by the primary model and correctly classified by the aggregated radiomic model. The table above shows the predicted probabilities of the class 1 (i.e adenocarcinoma) for both models as well as the true label of each patient. The 20 patients are ordered by the gap in absolute value between the predicted probability of the primary model and the predicted probability of the aggregated model (from top to bottom).
2.2 Classical feature importance analysis
We can first take advantage of the coefficients of the logistic regression to highlight the importance of each aggregated radiomic feature. A positive coefficient means that a high value of the associated feature increases the probability of adenocarcinoma while a negative coefficient means that a high value of the associated feature decreases the probability of adenocarcinoma.
Note: Since the features are standardized before training the logisitic regression, a “high value” should be understood as “a value greater than the average of this feature”.
[16]:
fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(data = pd.DataFrame(model_all["logistic"].coef_.reshape(-1,1), index = df_agg.columns).sort_values(by=0, ascending=False).reset_index(),
y="index",
x=0,
orient="h",
ax=ax,
color = "gray")
ax.set(xlabel=None, ylabel=None)
ax.set_axisbelow(True)
ax.yaxis.grid(color="gray", linestyle="dashed")
Even for such a simple model we see that it is not straightforward to understand the model’s behavior from this feature importance analysis alone.
2.3 Compute and display Shapley values
Contrary to the analysis above, Shapley values computed by the radshap package allow us to study each patient specifically and highlight the importance of their metastases for the prediction of the model. Besides it can explain the decision of the model for new unseen patients.
[17]:
shap = Shapley(predictor = lambda x: model_all.predict_proba(x)[:, 1], aggregation = [('max', None), ('min', None), ('std', None)])
Patient 31 (diagnosed with another subtype)
[23]:
# extract the radiomic features of all the lesions of the patient 31
x_explain = df_radiomics_all.loc["patient_31"].iloc[:, 2:].copy().values
shap_values = shap.explain(x_explain)
We first use a bar plot to display the Shapley values. They are ranked from top to bottom in decreasing order with respect to their absolute value. A negative Shapley value means that considering the associated lesion for the prediction of the histological subtype with the aggregated model decreases the probability of the class 1 (i.e adenocarcinoma) while a positive Shapley value means that it increases the probability of the class 1.
For this patient with only two lesions it is quite simple to explicitly detail the formula of the Shapley value. It will help us better understand its meaning:
\begin{align*} &\phi(\text{META_LUNG}) = \dfrac{1}{2} \left( pred(\{\text{META_LUNG, PRIMARY}\}) - pred(\{\text{PRIMARY}\}) \right) + \dfrac{1}{2} \left( pred(\{\text{META_LUNG}\}) - pred(\emptyset) \right)\\ &\phi(\text{META_LUNG}) = \dfrac{1}{2} \left(0.32 - 0.84 \right) + \dfrac{1}{2} \left(0.27 - 0.5 \right) = -0.38 \end{align*}
It averages the marginal contributions of the lesion when it is considered on its own and when it is added to the other lesion.
[37]:
plot_bars(shap_values, names=["META_LUNG_CL", "PRIMARY_LUNG"], nbest=15)
We can also use the Maximum Intensity Projection (MIP) of the baseline PET image as well as a color scale to display the Shapley values.
[38]:
image_path = "E:\\Images\\patient_31\\baseline_PET\\PET.nii.gz"
mask_path = "E:\\Images\\patient_31\\baseline_PET\\segmentations\\"
mask_paths = []
for name in df_radiomics_all.loc["patient_31", "localization"].values:
mask_paths.append(mask_path + name + '.uint16.nii.gz')
plot_pet(shap_values, image_path=image_path, masks_paths=mask_paths, cmap_lim = 0.3)
For the patient 31, the radiomic characteristics of the primary tumor bias the model towards the prediction of an adenocarcinoma. It is the presence of the controlateral lung metastasis (and its radiomic characteristics) that helps the model to correctly predict another histological subtype than adenocarcinoma.
Patient 106 (diagnosed with adenocarcinoma)
[39]:
# extract the radiomic features of all the lesions of the patient 106
x_explain = df_radiomics_all.loc["patient_106"].iloc[:, 2:].copy().values
shap_values = shap.explain(x_explain, n_jobs=-1)
[40]:
plot_bars(shap_values, names=df_radiomics_all.loc["patient_106", "localization"].values, nbest=15)
[42]:
image_path = "E:\\Images\\patient_106\\baseline_PET\\PET.nii.gz"
mask_path = "E:\\Images\\patient_106\\baseline_PET\\segmentations\\"
mask_paths = []
for name in df_radiomics_all.loc["patient_106", "localization"].values:
mask_paths.append(mask_path + name + '.uint16.nii.gz')
plot_pet(shap_values, image_path=image_path, masks_paths=mask_paths, cmap_lim = 0.3)
For the patient 106, the radiomic characteristics of the primary tumor bias the model towards the prediction of another subtype. It is the presence of mediastinal lymph node metastases (and their radiomic characteristics) that help the model to correctly predict an adenocarcinoma subtype.