radshap.shapley.RobustShapley

class radshap.shapley.RobustShapley(predictor, aggregation, background_data, invalid_features=None, empty_value=0.5)[source]

Compute the Shapley value while dealing with invalid inputs for every element of a collection of instances that are aggregated together in a single input of a trained predictive algorithm.

Deal with the cases where the aggregation function returns invalid inputs for the predictive algorithm (i.e. unexpected NaN values) by replacing missing values by values from a background data set and computing the average prediction over all these replacements.

Parameters:
predictor: callable (input: 2D array of shape (n_inputs, n_input_features), output: 1D array of shape (n_inputs,))

Trained predictor that returns one single real-value prediction per input.

aggregation: callable, tuple, list of tuples

Aggregator that transforms an array of n_instances (with each one characterized by n_instance_features) into one single vector of shape (n_input_features) that will be used as an input of the predictor.

Aggregator should deal with cases where some aggregated features cannot be computed by returning NaN values instead.

To define the aggregator one can use:

  • a callable that takes as input a 2D array of shape (n_instances, n_instance_features) and returns a 1D array of shape (1, n_input_features).

  • a tuple (method, subset) with method being a string that refers to a numpy aggregating function (e.g ‘sum’, ‘min’, ‘std’…) and subset being a 1D array that defines the subset of columns/features on which to apply this method (or None for applying it on all the columns/features).

  • a list of tuples [(method_1, subset_1), (method_2, subset_2), …] to define several aggregators. Please note that the aggregated features will be ordered according to the order of the provided list (i.e [agg_feature_method_1, …, agg_feature_method_2, …]).

background_data: 2D array of shape (n_background, n_input_features)

Background data set used to deal with invalid inputs for the predictor. In the case of an invalid input, we generate “n_background” new inputs by replacing the invalid values with corresponding values from the background data set. We then apply the predictor to each of these generated inputs and return the average predictions.

invalid_features: 1D array of shape (n_input_features)

Boolean array to specify the features for which a NaN value means that the input is not valid. In such cases, we handle them with the background data set. If None all the features will be taken into account to define invalid inputs. The default is None.

empty_value: float, optional

Prediction made by the algorithm for an input with no instances (i.e random prediction). The default is 0.5.

Attributes:
shapleyvalues_: 1D array of shape (n_instances,)

Shapley value associated to each instance.

Notes

The strategy to deal with invalid inputs was inspired by SHAP (https://shap.readthedocs.io).

Examples

>>> import numpy as np
>>> import joblib
>>> from radshap.shapley import RobustShapley
>>>
>>> model = joblib.load("trained_logistic_regression.joblib")
>>> shap = RobustShapley(predictor = lambda x: model.predict_proba(x)[:, 1],
>>>                      aggregation = ('nanmean', None),
>>>                      background_data = Xback) # Xback a 2D array of shape (n_samples_background, n_input_features)
>>> shapvalues = shap.explain(X) # X a 2D array of shape (n_instances, n_instance_features)

Methods

explain(X[, estimation_method, nsamples, n_jobs])

Compute the Shapley values for each row of X.

explain(X, estimation_method='auto', nsamples=1000, n_jobs=1)[source]

Compute the Shapley values for each row of X.

X must correspond to a valid collection of instances (shape (n_instances, n_instance_features)) that will be passed to the aggregator function and then used as an input for the predictor.

Parameters:
X: 2D array of shape (n_instances, n_instance_features)

X corresponds to a collection of instances that will be aggregated into a single input.

estimation_method: str {‘auto’, ‘exact’, ‘antithetic’}, optional

Estimation method for the Shapley values. The default is “auto”.

  • If estimation = ‘exact’, an exact enumeration strategy is used. All the permutations of the instances are considered.

  • If estimation = ‘antithetic’, a Monte-Carlo scheme with random permutations is applied. Antithetic sampling is used to reduce the variance of the estimator. In that case the number of samples is defined by nsamples.

  • If estimation = ‘auto’, the estimation method is chosen based on the number of instances. If the number of instances is > 8 a Monte-Carlo scheme is applied. Otherwise, an exact scheme is applied.

nsamples: str, optional

Number of samples for the Monte-Carlo scheme. It is not used only when`estimation = ‘exact’ or when the number of instances is > 8. The default is 1000.

n_jobs: int, optional

Number of jobs to run in parallel. -1 means using all processors. See the joblib package documentation for more explanations. The default is 1.

Returns:
1D array of shape (n_instances,)

Shapley value associated to each instance.