radshap.shapley.Shapley
- class radshap.shapley.Shapley(predictor, aggregation, empty_value=0.5)[source]
Compute the Shapley value for every element of a collection of instances that are aggregated together in a single input of a trained predictive algorithm. It either uses an exact enumeration strategy when the number of samples is relatively small (<8) or an approximate Monte-Carlo scheme with antithetic sampling.
- Parameters:
- predictor: callable (input: 2D array of shape (n_inputs, n_input_features), output: 1D array of shape (n_inputs,))
Trained predictor that returns one single real-value prediction per input.
- aggregation: callable, tuple, list of tuples
Aggregator that transforms an array of n_instances (with each one characterized by n_instance_features) into one single vector of shape (n_input_features) that will be used as an input of the predictor.
To define the aggregator one can use:
a callable that takes as input a 2D array of shape (n_instances, n_instance_features) and returns a 1D array of shape (1, n_input_features).
a tuple (method, subset) with method being a string that refers to a numpy aggregating function (e.g ‘sum’, ‘min’, ‘std’…) and subset being a 1D array that defines the subset of columns/features on which to apply this method (or None for applying it on all the columns/features).
a list of tuples [(method_1, subset_1), (method_2, subset_2), …] to define several aggregators. Please note that the aggregated features will be ordered according to the order of the provided list (i.e [agg_feature_method_1, …, agg_feature_method_2, …]).
- empty_value: float, optional
Prediction made by the algorithm for an input with no instances (i.e random prediction). The default is 0.5.
- Attributes:
- shapleyvalues_: 1D array of shape (n_instances,)
Shapley value associated to each instance.
Notes
For an exact estimation, the sum of the Shapley values of all instances equals the difference between the prediction made for the collection of all instances and the empty/random prediction (Efficiency property).
Examples
>>> import numpy as np >>> import joblib >>> from radshap.shapley import Shapley >>> >>> model = joblib.load("trained_logistic_regression.joblib") >>> shap = Shapley(predictor = lambda x: model.predict_proba(x)[:, 1], aggregation = ('mean', None)) >>> shapvalues = shap.explain(X) # X a 2D array of shape (n_instances, n_instance_features)
Methods
explain(X[, estimation_method, nsamples, n_jobs])Compute the Shapley values for each row of X.
- explain(X, estimation_method='auto', nsamples=1000, n_jobs=1)[source]
Compute the Shapley values for each row of X.
X must correspond to a valid collection of instances (shape (n_instances, n_instance_features)) that will be passed to the aggregator function and then used as an input for the predictor.
- Parameters:
- X: 2D array of shape (n_instances, n_instance_features)
X corresponds to a collection of instances that will be aggregated into a single input.
- estimation_method: str {‘auto’, ‘exact’, ‘antithetic’}, optional
Estimation method for the Shapley values. The default is “auto”.
If estimation = ‘exact’, an exact enumeration strategy is used. All the permutations of the instances are considered.
If estimation = ‘antithetic’, a Monte-Carlo scheme with random permutations is applied. Antithetic sampling is used to reduce the variance of the estimator. In that case the number of samples is defined by nsamples.
If estimation = ‘auto’, the estimation method is chosen based on the number of instances. If the number of instances is > 8 a Monte-Carlo scheme is applied. Otherwise, an exact scheme is applied.
- nsamples: str, optional
Number of samples for the Monte-Carlo scheme. It is not used only when`estimation = ‘exact’ or when the number of instances is > 8. The default is 1000.
- n_jobs: int, optional
Number of jobs to run in parallel. -1 means using all processors. See the joblib package documentation for more explanations. The default is 1.
- Returns:
- 1D array of shape (n_instances,)
Shapley value associated to each instance.