Shapley values - prediction of lung cancer histological subtype with an aggregated radiomic model

In this jupyter notebook we provide a short example for the use of Shapley values to explain aggregated radiomic models. We study the classification of the histological subtype of non-small cell lung cancer (NSCLC) from baseline 18F-FDG PET images. To do so, we use a radiomic model that aggregates features extracted from all the metastases of the patient.

[1]:

%load_ext autoreload
%autoreload 2

# Setup for local running - please delete this block
import sys
sys.path.append('C:\\Users\\ncaptier\\Documents\\GitHub\\radshap')

import collections
from tqdm import tqdm
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, recall_score, balanced_accuracy_score
from sklearn.base import clone

from radshap.shapley import Shapley
from radshap.plot import plot_bars, plot_pet

0. Load radiomic and clinical data

Our cohort gathers 110 NSCLC metastatic patients with a baseline 18F-FDG PET scan (for the diagnosis of the metastatic disease) acquired before the beginning of their first line treatment. All their metastases were segmented by a 12 years experienced nuclear medicine physician. Here we study the binary classification task that consists in distinguishing the adenocarcinomas (i.e most common subtype of NSCLC) from the other subtypes (including squamous cell carcinomas).

Note: We will need to adress the class imbalance to avoid ending up with a very low specificity.

[2]:

df_clinicals = pd.read_csv("..\\data\\data_clinicals.csv", index_col=0)

fig, ax = plt.subplots(figsize=(8, 6))
sns.barplot(data=df_clinicals["Histology"].value_counts().reset_index(), x="index", y="Histology", ax=ax)
ax.set(xlabel=None, ylabel=None)
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
sns.despine()

../_images/_examples_nsclc_subtype_classification_4_0.png

We processed the segmented lesions with 3 steps:

We applied an absolute threshold of 2.5 standardized uptake value (SUV) units, meaning that all voxel values less than 2.5 were excluded from the segmented tumor regions.
We filtered out very small lesions (< 1 mL) as well as diffuse lesions whose precise segmentation and characterization by shape features is questionable (lymphangitic spread, diffuse pleural metastases, diffuse myocardial metastases, and diffuse subdiaphragmatic metastases).
For all the remaining lesions we used pyradiomics to extract 4 shape radiomic features and 6 first-order radiomic features.

We first load the radiomic features of all the filtered lesions of every patients (each row corresponds to one lesion of one patient). In the following we will use this dataframe to aggregate features accross all the lesions and obtain one vector of features per patient.

[3]:

df_radiomics_all = pd.read_csv("..\\data\\data_allmeta.csv", index_col=0)

# we log-transform the volume feature to avoid linear model to deal with highly right skewed distributions
df_radiomics_all.loc[:, "original_shape_VoxelVolume"] = df_radiomics_all["original_shape_VoxelVolume"].apply(lambda x: np.log(x+1))

df_radiomics_all.iloc[:, 2:].head()

[3]:

	original_shape_Sphericity	original_shape_Elongation	original_shape_Flatness	original_shape_VoxelVolume	original_firstorder_Entropy	original_firstorder_Kurtosis	original_firstorder_Maximum	original_firstorder_Mean	original_firstorder_QuartileDispersion	original_firstorder_Skewness
patient_0	0.532479	0.741784	0.388096	9.558812	2.746364	4.173133	5.917239	3.208351	0.118057	1.088778
patient_0	0.651127	0.581162	0.453587	8.592857	1.959371	3.138165	3.937612	2.918054	0.067240	0.769157
patient_0	0.616592	0.451983	0.247072	7.636752	2.197037	2.902717	4.058545	2.969259	0.093076	0.728094
patient_0	0.863482	0.942119	0.830428	9.157889	4.209365	2.012266	8.961629	4.781307	0.290141	0.368464
patient_0	0.624193	0.811512	0.427006	10.863470	3.150671	22.400791	13.207974	3.491949	0.129028	3.597876

We then load the radiomic features of the primary tumor of each patient. In the following we will use this dataframe to build a model only based on the radiomic characterization of the primary tumor and compare it to the aggregated radiomic model.

[4]:

df_radiomics_primary = pd.read_csv("..\\data\\data_primary.csv", index_col = 0)

# we log-transform the volume feature to avoid linear model to deal with highly right skewed distributions
df_radiomics_primary.loc[:, "original_shape_VoxelVolume"] = df_radiomics_primary["original_shape_VoxelVolume"].apply(lambda x: np.log(x+1))

df_radiomics_primary.iloc[:, 2:].head()

[4]:

	original_shape_Sphericity	original_shape_Elongation	original_shape_Flatness	original_shape_VoxelVolume	original_firstorder_Entropy	original_firstorder_Kurtosis	original_firstorder_Maximum	original_firstorder_Mean	original_firstorder_QuartileDispersion	original_firstorder_Skewness
patient_0	0.624193	0.811512	0.427006	10.863470	3.150671	22.400791	13.207974	3.491949	0.129028	3.597876
patient_1	0.436950	0.604767	0.406622	10.511757	3.481742	5.855306	9.717072	3.664580	0.185013	1.707722
patient_2	0.780999	0.705620	0.581406	8.998260	3.800767	2.483848	7.343111	4.192277	0.211119	0.623496
patient_3	0.668431	0.595939	0.350714	11.943259	4.430960	2.596109	12.187289	7.437633	0.151533	0.041012
patient_4	0.699124	0.642183	0.575850	12.270703	5.803551	3.076536	34.302750	9.359064	0.411648	0.582178

1. Aggregated model vs primary tumor model

Here we compare two simple logistic regression classifiers: * A classifier which takes as input 10 radiomic features from the primary tumor - primary tumor model * A classifier which takes as input the minimum value, the maximum value, and the standard deviation of the same 10 radiomic features over all the metastases of the patient - aggregated radiomic model

1.1 Aggregate features with min, max and std

We apply min, max, and standard deviation (i.e., std) functions over all the lesions of every patients for each radiomic features.

[5]:

def reorder_columns(columns, function_names):
    d = {f_name: [] for f_name in function_names}
    for col in columns:
        d[col[1]].append(col)
    return sum(d.values(), [])

[6]:

fun_std = lambda x: x.std(ddof=0) #ddof is set to 0 to normalize by N and not N-1 and thus handle the cases where there is only one lesion
fun_std.__name__ = "std"

df_agg = df_radiomics_all.iloc[:, 2:].groupby(level=0).agg(("max", "min", fun_std)).loc[df_radiomics_primary.index]

#reorder columns
df_agg = df_agg[reorder_columns(df_agg.columns, function_names = ("max", "min", "std"))]
df_agg.columns = df_agg.columns.map('_'.join)

df_agg.head()

[6]:

	original_shape_Sphericity_max	original_shape_Elongation_max	original_shape_Flatness_max	original_shape_VoxelVolume_max	original_firstorder_Entropy_max	original_firstorder_Kurtosis_max	original_firstorder_Maximum_max	original_firstorder_Mean_max	original_firstorder_QuartileDispersion_max	original_firstorder_Skewness_max	...	original_shape_Sphericity_std	original_shape_Elongation_std	original_shape_Flatness_std	original_shape_VoxelVolume_std	original_firstorder_Entropy_std	original_firstorder_Kurtosis_std	original_firstorder_Maximum_std	original_firstorder_Mean_std	original_firstorder_QuartileDispersion_std	original_firstorder_Skewness_std
patient_0	0.863482	0.942119	0.830428	10.863470	4.209365	22.400791	13.207974	4.781307	0.290141	3.597876	...	0.101204	0.160786	0.177240	0.976948	0.726723	7.276279	3.289046	0.626308	0.071618	1.105949
patient_1	0.837041	0.900909	0.596313	10.511757	4.305888	5.855306	10.006734	4.819726	0.283288	1.707722	...	0.162958	0.138764	0.117702	1.389979	0.537648	1.340308	1.941857	0.531018	0.054307	0.387661
patient_2	0.780999	0.705620	0.581406	8.998260	3.800767	2.483848	7.343111	4.192277	0.211119	0.623496	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
patient_3	0.867612	0.825841	0.696419	11.943259	4.430960	2.748872	12.187289	7.437633	0.219823	0.692009	...	0.081349	0.103375	0.144040	1.953295	0.227022	0.062997	1.544987	1.066581	0.027920	0.270425
patient_4	0.888275	0.899604	0.889127	12.270703	5.803551	7.975884	34.302750	9.359064	0.476792	2.106428	...	0.086014	0.146873	0.145913	1.590111	0.915027	1.607292	8.746587	1.971811	0.124980	0.451651

5 rows × 30 columns

1.2 Train and test logistic regression models with repeated cross-validation schemes

We use a repeated stratified cross-validation scheme with 5 folds and 100 repeats to train and test our models. Several classification metrics are computed for each cross-validation scheme and averaged over the 100 repeats (balanced accuracy, ROC AUC, sensitivity, and specificity).

[7]:

# define specificity scorer
def specificity_score(y_true, y_pred):
    TN = ((y_true == 0) & (y_pred == 0)).sum()
    N = (1-y_true).sum()
    return TN/N

def fit_and_score(clf, X, y, train, test, results_dict):
    Xtrain, Xtest, ytrain,  ytest = X[train, :], X[test, :], y[train], y[test]
    clf.fit(Xtrain, ytrain)
    preds = clf.predict(Xtest)
    probas = clf.predict_proba(Xtest)[:,1]
    results_dict["roc_auc"].append(roc_auc_score(ytest, probas))
    results_dict["balanced_accuracy"].append(balanced_accuracy_score(ytest, preds))
    results_dict["sensitivity"].append(recall_score(ytest, preds))
    results_dict["specificity"].append(specificity_score(ytest, preds))
    return probas

[8]:

X_all = df_agg.values
X_primary = df_radiomics_primary.iloc[:, 2:].values
y = 1*(df_clinicals["Histology"] == 'adenocarcinoma').values

[9]:

pipe = Pipeline(steps = [("scaling", StandardScaler()),
                         ("logistic", LogisticRegression(class_weight = "balanced"))
                        ]
               )

[10]:

np.random.seed(0) #fix seed for reproducibility

results_all =  collections.defaultdict(list)
results_primary =  collections.defaultdict(list)

for i in tqdm(range(100)):

    prim_dict = collections.defaultdict(list)
    all_dict = collections.defaultdict(list)

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)

    # Iterate over the 5 folds of the cross-validation scheme
    for train, test in cv.split(np.zeros(len(y)), y):

        probas_primary = fit_and_score(clone(pipe), X_primary, y, train, test, results_dict=prim_dict)
        probas_all = fit_and_score(clone(pipe), X_all, y, train, test, results_dict=all_dict)

    # Aggregate the results of the 5 folds
    for metric in prim_dict.keys():
        results_primary[metric].append(np.mean(prim_dict[metric]))
        results_all[metric].append(np.mean(all_dict[metric]))

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 13.76it/s]

1.3 Display the results

[11]:

results_all = pd.DataFrame(results_all)
results_primary = pd.DataFrame(results_primary)
results_all['model'] = 'all'
results_primary['model'] = 'primary'
results = pd.concat([results_all, results_primary], axis=0)

results.groupby("model").mean()

[11]:

	roc_auc	balanced_accuracy	sensitivity	specificity
model
all	0.777538	0.710908	0.762750	0.659067
primary	0.722199	0.661276	0.702551	0.620000

[12]:

results.groupby("model").std()

[12]:

	roc_auc	balanced_accuracy	sensitivity	specificity
model
all	0.038762	0.046618	0.035242	0.077155
primary	0.023648	0.027652	0.027423	0.047690

[13]:

fig, ax = plt.subplots(figsize=(10, 7))
sns.barplot(data = results.melt(id_vars = ["model"]), x="variable", y="value", hue="model", ci=None, ax=ax)

for i, m in enumerate(['all', 'primary']):
    ax.errorbar(x=np.arange(4) + 0.2*(-1 + 2*i),
                y=results[results["model"]== m].iloc[:, :-1].mean(axis=0).values,
                yerr=results[results["model"]== m].iloc[:, :-1].std(axis=0).values,
                fmt='none',
                ecolor='k',
                capsize=5,
                elinewidth=1)

ax.set_ylim(0.5, 0.82)
ax.set(xlabel=None, ylabel=None)
ax.set_axisbelow(True)
ax.yaxis.grid(color='gray', linestyle='dashed')
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
ax.legend(fontsize=12)
sns.despine()

../_images/_examples_nsclc_subtype_classification_24_0.png

The aggregated radiomic model that takes into account all the metastases of the patient outperforms the model that only considers his primary lung tumor. In particular, it is more sensitive and detects better the adenocarcinomas. It is a promising result since the aggregation functions that we used were very simple and we can thus hope for more striking improvements with more complex aggregation strategies.

2. Interpretation of the aggregated model with Shapley values

Here we use the radshap package to interpret our aggregated radiomic model and highlight the importance of each lesion for the prediction of the histological subtype. In particular we use the radshap package to study which metastases the aggregated model uses (other than the primary tumor) to better predict the histological subtype of the patient

We first re-train both the aggregated radiomic model and the primary model on the whole data set (i.e 110 patients).
We use these trained models to predict the histological subtype of each patient. We then focus on those who were missclassified by the primary model and correctly classified by the aggregated model.
For such patients we compute the Shapley values of each of their metastases to highlight their importance for the prediction made by the aggregated radiomic model.

2.1 Train and predict on the whole data set

[14]:

model_all = clone(pipe).fit(X_all, y)
model_primary = clone(pipe).fit(X_primary, y)

[15]:

df_probas = pd.DataFrame(np.hstack([model_all.predict_proba(X_all)[:, 1].reshape(-1, 1),
                                    model_primary.predict_proba(X_primary)[:, 1].reshape(-1, 1),
                                    y.reshape(-1, 1)]
                                  ),
                         index=df_agg.index,
                         columns = ["probas_all", "probas_primary", "labels"])

df_probas["probas_gap"] = np.abs(df_probas["probas_all"] - df_probas["probas_primary"])

mask = (model_all.predict(X_all) == y) & (model_primary.predict(X_primary) != y)

df_probas[mask].sort_values(by="probas_gap", ascending = False).iloc[:, :-1].head(10)

[15]:

	probas_all	probas_primary	labels
patient_5	0.918168	0.400524	1.0
patient_106	0.640970	0.150564	1.0
patient_7	0.702580	0.229073	1.0
patient_31	0.316072	0.789244	0.0
patient_27	0.945055	0.485545	1.0
patient_107	0.872115	0.454678	1.0
patient_109	0.246746	0.590126	0.0
patient_72	0.191475	0.530324	0.0
patient_44	0.745959	0.418653	1.0
patient_73	0.816111	0.497330	1.0

Overall 20 patients were misclassified by the primary model and correctly classified by the aggregated radiomic model. The table above shows the predicted probabilities of the class 1 (i.e adenocarcinoma) for both models as well as the true label of each patient. The 20 patients are ordered by the gap in absolute value between the predicted probability of the primary model and the predicted probability of the aggregated model (from top to bottom).

2.2 Classical feature importance analysis

We can first take advantage of the coefficients of the logistic regression to highlight the importance of each aggregated radiomic feature. A positive coefficient means that a high value of the associated feature increases the probability of adenocarcinoma while a negative coefficient means that a high value of the associated feature decreases the probability of adenocarcinoma.

Note: Since the features are standardized before training the logisitic regression, a “high value” should be understood as “a value greater than the average of this feature”.

[16]:

fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(data = pd.DataFrame(model_all["logistic"].coef_.reshape(-1,1), index = df_agg.columns).sort_values(by=0, ascending=False).reset_index(),
            y="index",
            x=0,
            orient="h",
            ax=ax,
            color = "gray")
ax.set(xlabel=None, ylabel=None)
ax.set_axisbelow(True)
ax.yaxis.grid(color="gray", linestyle="dashed")

../_images/_examples_nsclc_subtype_classification_34_0.png

Even for such a simple model we see that it is not straightforward to understand the model’s behavior from this feature importance analysis alone.

2.3 Compute and display Shapley values

Contrary to the analysis above, Shapley values computed by the radshap package allow us to study each patient specifically and highlight the importance of their metastases for the prediction of the model. Besides it can explain the decision of the model for new unseen patients.

[17]:

shap = Shapley(predictor = lambda x: model_all.predict_proba(x)[:, 1], aggregation = [('max', None), ('min', None), ('std', None)])

Patient 31 (diagnosed with another subtype)

[23]:

# extract the radiomic features of all the lesions of the patient 31
x_explain = df_radiomics_all.loc["patient_31"].iloc[:, 2:].copy().values

shap_values = shap.explain(x_explain)

We first use a bar plot to display the Shapley values. They are ranked from top to bottom in decreasing order with respect to their absolute value. A negative Shapley value means that considering the associated lesion for the prediction of the histological subtype with the aggregated model decreases the probability of the class 1 (i.e adenocarcinoma) while a positive Shapley value means that it increases the probability of the class 1.

For this patient with only two lesions it is quite simple to explicitly detail the formula of the Shapley value. It will help us better understand its meaning:

\begin{align*} &\phi(\text{META_LUNG}) = \dfrac{1}{2} \left( pred(\{\text{META_LUNG, PRIMARY}\}) - pred(\{\text{PRIMARY}\}) \right) + \dfrac{1}{2} \left( pred(\{\text{META_LUNG}\}) - pred(\emptyset) \right)\\ &\phi(\text{META_LUNG}) = \dfrac{1}{2} \left(0.32 - 0.84 \right) + \dfrac{1}{2} \left(0.27 - 0.5 \right) = -0.38 \end{align*}

It averages the marginal contributions of the lesion when it is considered on its own and when it is added to the other lesion.

[37]:

plot_bars(shap_values, names=["META_LUNG_CL", "PRIMARY_LUNG"], nbest=15)

../_images/_examples_nsclc_subtype_classification_42_0.png

We can also use the Maximum Intensity Projection (MIP) of the baseline PET image as well as a color scale to display the Shapley values.

[38]:

image_path = "E:\\Images\\patient_31\\baseline_PET\\PET.nii.gz"
mask_path = "E:\\Images\\patient_31\\baseline_PET\\segmentations\\"
mask_paths = []
for name in df_radiomics_all.loc["patient_31", "localization"].values:
    mask_paths.append(mask_path + name + '.uint16.nii.gz')

plot_pet(shap_values, image_path=image_path, masks_paths=mask_paths, cmap_lim = 0.3)

../_images/_examples_nsclc_subtype_classification_44_0.png

For the patient 31, the radiomic characteristics of the primary tumor bias the model towards the prediction of an adenocarcinoma. It is the presence of the controlateral lung metastasis (and its radiomic characteristics) that helps the model to correctly predict another histological subtype than adenocarcinoma.

Patient 106 (diagnosed with adenocarcinoma)

[39]:

# extract the radiomic features of all the lesions of the patient 106
x_explain = df_radiomics_all.loc["patient_106"].iloc[:, 2:].copy().values

shap_values = shap.explain(x_explain, n_jobs=-1)

[40]:

plot_bars(shap_values, names=df_radiomics_all.loc["patient_106", "localization"].values, nbest=15)

../_images/_examples_nsclc_subtype_classification_48_0.png

[42]:

image_path = "E:\\Images\\patient_106\\baseline_PET\\PET.nii.gz"
mask_path = "E:\\Images\\patient_106\\baseline_PET\\segmentations\\"
mask_paths = []
for name in df_radiomics_all.loc["patient_106", "localization"].values:
    mask_paths.append(mask_path + name + '.uint16.nii.gz')

plot_pet(shap_values, image_path=image_path, masks_paths=mask_paths, cmap_lim = 0.3)

../_images/_examples_nsclc_subtype_classification_49_0.png

For the patient 106, the radiomic characteristics of the primary tumor bias the model towards the prediction of another subtype. It is the presence of mediastinal lymph node metastases (and their radiomic characteristics) that help the model to correctly predict an adenocarcinoma subtype.