ML之shap:基于adult人口普查收入二分類預測數(shù)據(jù)集(預測年收入是否超過50k)利用Shap值對XGBoost模型實現(xiàn)可解釋性案例之詳細攻略
?
相關文章
ML之shap:基于adult人口普查收入二分類預測數(shù)據(jù)集(預測年收入是否超過50k)利用Shap值對XGBoost模型實現(xiàn)可解釋性案例之詳細攻略
ML之shap:基于adult人口普查收入二分類預測數(shù)據(jù)集(預測年收入是否超過50k)利用Shap值對XGBoost模型實現(xiàn)可解釋性案例之詳細攻略實現(xiàn)
基于adult人口普查收入二分類預測數(shù)據(jù)集(預測年收入是否超過50k)利用Shap值對XGBoost模型實現(xiàn)可解釋性案例
1、定義數(shù)據(jù)集
dtypes_len: 15
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | salary |
39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
49 | Private | 160187 | 9th | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | <=50K |
52 | Self-emp-not-inc | 209642 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | >50K |
31 | Private | 45781 | Masters | 14 | Never-married | Prof-specialty | Not-in-family | White | Female | 14084 | 0 | 50 | United-States | >50K |
42 | Private | 159449 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 5178 | 0 | 40 | United-States | >50K |
2、數(shù)據(jù)集預處理
# 2.1、入模特征初步篩選
df.columns?
?14
# 2.2、目標特征二值化
# 2.3、類別型特征編碼數(shù)字化
filt_dtypes_len: 13 [('age', 'float32'), ('workclass', 'category'), ('fnlwgt', 'float32'), ('education_Num', 'float32'), ('marital_status', 'category'), ('occupation', 'category'), ('relationship', 'category'), ('race', 'category'), ('sex', 'category'), ('capital_gain', 'float32'), ('capital_loss', 'float32'), ('hours_per_week', 'float32'), ('native_country', 'category')]
?
# 2.4、分離特征與標簽
df_adult_display
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | salary |
0 | 39 | State-gov | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | 0 |
1 | 50 | Self-emp-not-inc | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | 0 |
2 | 38 | Private | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | 0 |
3 | 53 | Private | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | 0 |
4 | 28 | Private | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | 0 |
5 | 37 | Private | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | 0 |
6 | 49 | Private | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | 0 |
7 | 52 | Self-emp-not-inc | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | 1 |
8 | 31 | Private | 14 | Never-married | Prof-specialty | Not-in-family | White | Female | 14084 | 0 | 50 | United-States | 1 |
9 | 42 | Private | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 5178 | 0 | 40 | United-States | 1 |
df_adult
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | salary |
0 | 39 | 7 | 13 | 4 | 1 | 1 | 4 | 1 | 2174 | 0 | 40 | 39 | 0 |
1 | 50 | 6 | 13 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 13 | 39 | 0 |
2 | 38 | 4 | 9 | 0 | 6 | 1 | 4 | 1 | 0 | 0 | 40 | 39 | 0 |
3 | 53 | 4 | 7 | 2 | 6 | 0 | 2 | 1 | 0 | 0 | 40 | 39 | 0 |
4 | 28 | 4 | 13 | 2 | 10 | 5 | 2 | 0 | 0 | 0 | 40 | 5 | 0 |
5 | 37 | 4 | 14 | 2 | 4 | 5 | 4 | 0 | 0 | 0 | 40 | 39 | 0 |
6 | 49 | 4 | 5 | 3 | 8 | 1 | 2 | 0 | 0 | 0 | 16 | 23 | 0 |
7 | 52 | 6 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 45 | 39 | 1 |
8 | 31 | 4 | 14 | 4 | 10 | 1 | 4 | 0 | 14084 | 0 | 50 | 39 | 1 |
9 | 42 | 4 | 13 | 2 | 4 | 0 | 4 | 1 | 5178 | 0 | 40 | 39 | 1 |
# 2.5、數(shù)據(jù)集整體切分
df_len: 32561 ,train_test_index: 30933
X.shape,y.shape: (30933, 12) (30933,)
X_test.shape,y_test.shape: (1628, 12) (1628,)
#3、模型訓練與推理
# 3.1、數(shù)據(jù)集切分
# 3.2、模型建立并訓練
#?3.3、模型預測
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | y_val_predi | y_val |
11311 | 29 | 4 | 9 | 4 | 1 | 3 | 2 | 0 | 0 | 0 | 60 | 39 | 0 | 0 |
12519 | 33 | 4 | 10 | 4 | 3 | 1 | 2 | 1 | 8614 | 0 | 40 | 39 | 1 | 1 |
29225 | 27 | 4 | 13 | 4 | 10 | 1 | 4 | 1 | 0 | 0 | 45 | 39 | 0 | 0 |
5428 | 22 | 4 | 9 | 2 | 7 | 0 | 4 | 1 | 0 | 0 | 40 | 39 | 0 | 0 |
2400 | 32 | 7 | 10 | 4 | 1 | 1 | 2 | 0 | 0 | 0 | 40 | 39 | 0 | 0 |
4319 | 45 | 4 | 10 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 40 | 39 | 1 | 0 |
26564 | 43 | 4 | 9 | 2 | 6 | 0 | 4 | 1 | 0 | 0 | 40 | 39 | 0 | 0 |
4721 | 60 | 0 | 13 | 2 | 0 | 0 | 4 | 1 | 0 | 0 | 8 | 39 | 0 | 1 |
19518 | 29 | 6 | 9 | 2 | 12 | 0 | 4 | 1 | 0 | 0 | 35 | 39 | 0 | 0 |
25013 | 33 | 4 | 5 | 2 | 6 | 0 | 4 | 1 | 0 | 0 | 40 | 39 | 0 | 0 |
#4、模型特征重要性解釋可視化
#4.1、全局特征重要性可視化
# T1、基于模型本身輸出特征重要性

?XGBR_importance_dict: [('age', 130), ('capital_gain', 125), ('education_num', 86), ('capital_loss', 75), ('hours_per_week', 63), ('relationship', 59), ('marital_status', 52), ('occupation', 52), ('workclass', 20), ('sex', 13), ('native_country', 10), ('race', 6)]
# T2、利用Shap值解釋XGBR模型
利用shap自帶的函數(shù)實現(xiàn)特征貢獻性可視化——特征重要性排序與上邊類似,但并不相同
# (1)、創(chuàng)建Explainer并計算SHAP值

# T2.1、輸出shap.Explanation對象
# T2,2、輸出numpy.array數(shù)組
shap2exp.values.shape (30933, 12)
[[ 0.31074238 -0.16607898 0.5617416 ... -0.04660619 -0.09465054
0.00530914]
[ 0.34912622 -0.16633348 0.65308005 ... -0.06718991 -0.9804511
0.00515459]
[ 0.21971266 0.02263742 -0.299867 ... -0.0583196 -0.09738331
0.00415599]
...
[-0.48140627 0.07019287 -0.30844492 ... -0.04253047 -0.10924102
0.00649792]
[ 0.39729887 -0.2313431 -0.45257783 ... -0.06502013 0.27416423
0.00587647]
[ 0.27594262 0.03170239 0.78293955 ... -0.06743324 0.31613
0.00530914]]
shap2array.shape (30933, 12)
[[ 0.31074238 -0.16607898 0.5617416 ... -0.04660619 -0.09465054
0.00530914]
[ 0.34912622 -0.16633348 0.65308005 ... -0.06718991 -0.9804511
0.00515459]
[ 0.21971266 0.02263742 -0.299867 ... -0.0583196 -0.09738331
0.00415599]
...
[-0.48140627 0.07019287 -0.30844492 ... -0.04253047 -0.10924102
0.00649792]
[ 0.39729887 -0.2313431 -0.45257783 ... -0.06502013 0.27416423
0.00587647]
[ 0.27594262 0.03170239 0.78293955 ... -0.06743324 0.31613
0.00530914]]
shap2exp.values與shap2array,兩個矩陣否相等: True
# (2)、全樣本各特征shap值條形圖可視化

?# shap值高階交互可視化

# (3)、全樣本各特征shap值蜂群圖可視化



?
# (4)、全局特征重要性排序散點圖可視化

?
?
#4.2、局部特征重要性可視化
# (1)、單樣本全特征條形圖可視化
前測試樣本:0
.values =
array([ 0.31074238, -0.16607898, 0.5617416 , -0.58709425, -0.08897061,
-0.6133537 , 0.01539118, 0.04758333, -0.3988452 , -0.04660619,
-0.09465054, 0.00530914], dtype=float32)
.base_values =
-1.3270257
.data =
array([3.900e+01, 7.000e+00, 1.300e+01, 4.000e+00, 1.000e+00, 1.000e+00,
4.000e+00, 1.000e+00, 2.174e+03, 0.000e+00, 4.000e+01, 3.900e+01])

前測試樣本:1
.values =
array([ 0.34912622, -0.16633348, 0.65308005, 0.3069151 , 0.26878497,
0.5229906 , 0.01030679, 0.04531586, -0.15429462, -0.06718991,
-0.9804511 , 0.00515459], dtype=float32)
.base_values =
-1.3270257
.data =
array([50., 6., 13., 2., 4., 0., 4., 1., 0., 0., 13., 39.])

前測試樣本:10
.values =
array([ 0.27578622, 0.02686635, -0.0699547 , 0.2820353 , 0.3097189 ,
0.55229187, -0.03686382, 0.05135565, -0.1607191 , -0.06321771,
0.38190693, 0.02023092], dtype=float32)
.base_values =
-1.3270257
.data =
array([37., 4., 10., 2., 4., 0., 2., 1., 0., 0., 80., 39.])

前測試樣本:20
.values =
array([ 0.31008577, 0.00316932, 1.3133987 , 0.16768128, 0.18239255,
0.6863757 , 0.00508371, 0.05159741, -0.15813455, -0.06736177,
0.31327826, 0.01936885], dtype=float32)
.base_values =
-1.3270257
.data =
array([40., 4., 16., 2., 10., 0., 4., 1., 0., 0., 60., 39.])

# (2)、單轉雙特征全樣本局部獨立圖散點圖可視化


# (3)、雙特征全樣本散點圖可視化

# 4.3、模型特征篩選
# (1)、基于聚類的shap特征篩選可視化

5、模型預測的可解釋性(可主要分析誤分類的樣本)
提供了預測的細節(jié),側重于解釋單個預測是如何生成的。它可以幫助決策者信任模型,并且解釋各個特征是如何影響模型單次的決策。
# ?5.1、力圖可視化分析:可視化單個或多個樣本內各個特征貢獻度并對比模型預測值——探究誤分類樣本
提供了單一模型預測的可解釋性,可用于誤差分析,找到對特定實例預測的解釋。如樣例0所示:
(1)、模型輸出值:5.89;
(2)、基值:base value即explainer.expected_value,即模型輸出與訓練數(shù)據(jù)的平均值;
(3)、繪圖箭頭下方數(shù)字是此實例的特征值。如Age=39;
(4)、紅色則表示該特征的貢獻是正數(shù)(將預測推高的特征),藍色表示該特征的貢獻是負數(shù)(將預測推低的特征)。長度表示影響力;箭頭越長,特征對輸出的影響(貢獻)越大。通過 x 軸上刻度值可以看到影響的減少或增加量。

(1)、單個樣本力圖可視化—對比預測
輸出當前測試樣本:0

mode_exp_value: -1.3270257
<IPython.core.display.HTML object>
輸出當前測試樣本:0
age 29.0
workclass 4.0
education_num 9.0
marital_status 4.0
occupation 1.0
relationship 3.0
race 2.0
sex 0.0
capital_gain 0.0
capital_loss 0.0
hours_per_week 60.0
native_country 39.0
y_val_predi 0.0
y_val 0.0
Name: 11311, dtype: float64
輸出當前測試樣本的真實label: 0
輸出當前測試樣本的的預測概率: 0


輸出當前測試樣本:1

輸出當前測試樣本:1
age 33.0
workclass 4.0
education_num 10.0
marital_status 4.0
occupation 3.0
relationship 1.0
race 2.0
sex 1.0
capital_gain 8614.0
capital_loss 0.0
hours_per_week 40.0
native_country 39.0
y_val_predi 1.0
y_val 1.0
Name: 12519, dtype: float64
輸出當前測試樣本的真實label: 1
輸出當前測試樣本的的預測概率: 1


輸出當前測試樣本:5?

輸出當前測試樣本:5
age 45.0
workclass 4.0
education_num 10.0
marital_status 2.0
occupation 4.0
relationship 0.0
race 4.0
sex 1.0
capital_gain 0.0
capital_loss 0.0
hours_per_week 40.0
native_country 39.0
y_val_predi 1.0
y_val 0.0
Name: 4319, dtype: float64
輸出當前測試樣本的真實label: 0
輸出當前測試樣本的的預測概率: 1


輸出當前測試樣本:7?

輸出當前測試樣本:7
age 60.0
workclass 0.0
education_num 13.0
marital_status 2.0
occupation 0.0
relationship 0.0
race 4.0
sex 1.0
capital_gain 0.0
capital_loss 0.0
hours_per_week 8.0
native_country 39.0
y_val_predi 0.0
y_val 1.0
Name: 4721, dtype: float64
輸出當前測試樣本的真實label: 1
輸出當前測試樣本的的預測概率: 0


(2)、多個樣本力圖可視化
# (2.1)、特征貢獻度力圖可視化,利用深紅色深藍色地圖可視化前 5個預測解釋,可以使用X數(shù)據(jù)集。
# (2.2)、誤分類力圖可視化,肯定要用X_val數(shù)據(jù)集,因為涉及到模型預測。
如果對多個樣本進行解釋,將上述形式旋轉90度然后水平并排放置,得到力圖的變體


# ?5.2、決策圖可視化分析:模型如何做出決策
# (1)、單個樣本決策圖可視化
# (2)、多個樣本決策圖可視化
# (2.1)、部分樣本決策圖可視化
# (2.2)、誤分類樣本決策圖可視化