This analysis is homework for MI-PDD (data preprocessing) course on FIT, ČVUT. You can also find this work on our faculty’s GitLab.

In short, the main task is to play with balancing and binning to obtain the best results for the binary classification task.

## What I was supposed to do:¶

2. Use binning (on features of your choice, with your choice of parameters) and comment on its effects on classification performance.
3. Use at least 2 other preprocessing techniques (your choice!) on the data set and comment on the classification results.
4. Run all classification tests at least three times – once for unbalanced original data, twice for balanced data (try at least 2 balancing techniques), compare those results (give a comment).

In :

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns


In :

data = pd.read_csv("default_of_credit_card_clients.csv", sep=';')


# Introduction¶

Since this is a well-known dataset of customers’ default payments in Taiwan (described also here), I am going to use this apriori knowledge and transform this dataset a bit. In :

columns = [
"LIMIT_BAL",
"SEX", "EDUCATION", "MARRIAGE",
"AGE",
"PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6",
"BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6",
"PAY_AMT1", "PAY_AMT2", "PAY_AMT3", "PAY_AMT4", "PAY_AMT5", "PAY_AMT6",
"CLASS"]
print(len(columns), len(data.columns))
data.columns = columns

24 24


## Categorical Features¶

In :

for col in ["SEX", "EDUCATION", "MARRIAGE"]:
data[col] = data[col].astype('category')


Now let’s convert those categorical data In :

data = pd.get_dummies(data)


In :

for col in ["PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]:
print(np.sort(data[col].unique().tolist()))
data[col + "_DULY"] = pd.Series(data[col] <= 0)

[-2 -1  0  1  2  3  4  5  6  7  8]
[-2 -1  0  1  2  3  4  5  6  7  8]
[-2 -1  0  1  2  3  4  5  6  7  8]
[-2 -1  0  1  2  3  4  5  6  7  8]
[-2 -1  0  2  3  4  5  6  7  8]
[-2 -1  0  2  3  4  5  6  7  8]

• These features represent “Repayment status”. -1 means “paid duly”. I am not sure how to handle -2 values, so I assume repayment was even sooner.
• Therefore I created indicator features to represent soon payment (<=0)

## Classes overview¶

In :

sns.barplot(data['CLASS'].value_counts().index, data['CLASS'].value_counts())
plt.show()
print(f"{data['CLASS'].mean()*100}% of CLASS 1")

22.12% of CLASS 1


We can clearly see, that the majority class is 0.

Now let’s split the dataset into train and validation set

Note: I have read some articles and answeres on SO about balancing, and everytime they say, I should balance only the training set.

In :

train_data, valid_data = train_test_split(data, test_size=.2, stratify=data.CLASS, random_state=42)

sns.barplot(train_data.CLASS.value_counts().index, train_data.CLASS.value_counts()).set_title('train')
plt.show()
sns.barplot(valid_data.CLASS.value_counts().index, valid_data.CLASS.value_counts()).set_title('valid')
plt.show()

print(f"y_train: {train_data.CLASS.mean()*100}%, y_test: {valid_data.CLASS.mean()*100}%")

y_train: 22.120833333333334%, y_test: 22.116666666666667%


# Classification on unbalanced (preprocessed) data¶

• I am going to use Gaussian Naive Bayes, SVM with RBF kernel, Neural Network, K-Nearest Neighbors and Logistic Regression as classification models.
• On the train set, I will do a small grid-search for best parameters, which will be provided by stratified (default in GridSearchCV selector) 3-fold cross-validation.
• Then I will choose the best estimator, fitted with training data and predict the validation data.
• In the end, you will see ROC curve and Precision-Recall curve, with the average AUC-ROC and Precision for each model.

In :

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, get_scorer, roc_curve, average_precision_score, precision_recall_curve, auc
from sklearn.metrics import plot_roc_curve


In :

def models_pipeline(data_train, data_val, verbose=False, scale=True, pca=None, data_name="unbalanced"):
X_train = data_train.drop(columns=["CLASS"])
y_train = data_train.CLASS

X_val = data_val.drop(columns=["CLASS"])
y_val = data_val.CLASS

classifiers = [
("Naive Bayes", GaussianNB(), {'priors': [[1 - y_train.mean(), y_train.mean()]]}),
("RBF SVM", SVC(probability=True), {'kernel': ['rbf'], 'C': [1, 5]}),
("Random Forest", RandomForestClassifier(n_jobs=-1, random_state=42), {'n_estimators': [100, 300]}),
("Neural Net", MLPClassifier(), {'learning_rate_init': [0.001, 0.0001], 'max_iter': }),
("K-NN", KNeighborsClassifier(n_jobs=-1), {'n_neighbors': [3, 5, 10]}),
("Log Reg", LogisticRegression(n_jobs=-1), {'C': [1, 5, 10]})
]

if scale:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
#     print(X_train.shape)
if pca is not None and (pca < 1 and pca > 0):
p = PCA(int(X_train.shape*pca))
p.fit(X_train)
X_train = p.transform(X_train)
X_val = p.transform(X_val)

#     print(X_train.shape)
figure = plt.figure(figsize=(3, 3))
sns.barplot(y_train.value_counts().index, y_train.value_counts()).set_title('train set class')
plt.show()

figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

results = []
for name, clf, param in classifiers:
gscv = GridSearchCV(clf, param_grid=param, scoring=get_scorer('roc_auc'), n_jobs=-1, cv=3)
gscv.fit(X_train, y_train)
#     mean_test_ROC = gscv.cv_results_['mean_test_score']
best_score = gscv.best_score_
best_estimator = gscv.best_estimator_
print(f"{name} done. ", end="")

y_val_score = best_estimator.predict_proba(X_val).T

auc_roc = roc_auc_score(y_val, y_val_score)
fpr, tpr, thresholds = roc_curve(y_val, y_val_score)
axes.plot(fpr, tpr, label=f"AUC = {auc_roc:0.3f}, "+name)

avg_precision = average_precision_score(y_val, y_val_score)
precision, recall, pr_thresholds = precision_recall_curve(y_val, y_val_score)

axes.plot(precision, recall, label=f"P = {avg_precision:.3f}, "+name)

res = {"name": name, "ROC_"+data_name: auc_roc, "P_"+data_name: avg_precision}
results.append(res)

print()
axes.set(title='ROC curve of models on validation set', xlabel='False Positive Rate', ylabel='True Positive Rate')
axes.set(title='Precision(P)-Recall curve', xlabel='Precision', ylabel='Recall')
axes.legend(loc='lower right')
axes.legend()
plt.show()
r = pd.DataFrame(results).set_index('name')
return r

• First, let’s run it on unbalanced data, with PCA taking 2/3 * n_features as a number of components.

In :

models_pipeline(train_data, valid_data, pca=(2/3))


Out:

• We can see, that Neural Net is the leader in both ROC and P-R metrics. I set it to 300 iterations, si it is (over)fitting very well.
• The table on the bottom shows us the best ROC on the training set (mean from all folds of the best estimator), which is almost the same as on the validation set. I think this is the unbalanceness, which is causing it.

# Balancing¶

• I use Random Under Samper, Tomek Links and Edited Nearest Neighbors for under-sampling; SMOTE for up-sampling and SMOTEEN for a combined approach.
• I let my classification pipeline gather both metrics and plot them for each of the samplers.

In :

from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

resamplers = [
(None, 'unbalanced'),
(RandomUnderSampler(random_state=42),'Random under-sampling'),
(EditedNearestNeighbours(),'ENN'),
(SMOTE(random_state=42),'SMOTE'),
(SMOTEENN(random_state=42),'SMOTEENN'),
]

# models_pipeline(train_data, valid_data)

results = pd.DataFrame()

for resampler, description in resamplers:
X_res = train_data.copy()
y_res = train_data.copy()
print(X_res.shape)
print(description)
if resampler is not None:
X_res = train_data.drop(columns=['CLASS'])
y_res = train_data.CLASS
X_res, y_res = resampler.fit_sample(X_res, y_res)

X_res = pd.DataFrame(X_res)
X_res['CLASS'] = y_res
print(f"class 1: {y_res.mean()*100}%")

print(X_res.shape)

results = pd.concat([results, models_pipeline(X_res, valid_data, data_name=description)], axis=1)
results

(24000, 40)
unbalanced
(24000, 40)

(24000, 40)
Random under-sampling
class 1: 50.0%
(10618, 40)

(24000, 40)
class 1: 23.540105529197888%
(22553, 40)

(24000, 40)
ENN
class 1: 34.16344916344916%
(15540, 40)

(24000, 40)
SMOTE
class 1: 50.0%
(37382, 40)

(24000, 40)
SMOTEENN
class 1: 62.53817750831928%
(21937, 40)


Out:

In :

def plot_results(results):
prec = results.filter(regex=("^P"))
roc = results.filter(regex=("^ROC"))
figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
prec.T.plot(ax=axes, title='Average Precisions')
axes.set(xlabel='Estimator', ylabel='Average Precision')
roc.T.plot(ax=axes, title='AUC-ROC scores')
axes.set(xlabel='Estimator', ylabel='AUC-ROC')
for tick in axes.get_xticklabels():
tick.set_rotation(45)
for tick in axes.get_xticklabels():
tick.set_rotation(45)
plt.show()
plot_results(results)

• Neural network nand Random Forest are the leaders. They seem to be pretty stable, but surprisingly – they are working the best on unbalanced data.
• SVM seems to be prone to sampling, works better on the under-sampled training set.

# Binning¶

• Studied here.
• I will skip Fixed-Width Binning, since it is not so data-driven, as Adaptive Binning.

Let’s take a look again, at unique values in our data:

In :

df = data.copy()
data.nunique()


Out:

LIMIT_BAL         81
AGE               56
PAY_0             11
PAY_2             11
PAY_3             11
PAY_4             11
PAY_5             10
PAY_6             10
BILL_AMT1      22723
BILL_AMT2      22346
BILL_AMT3      22026
BILL_AMT4      21548
BILL_AMT5      21010
BILL_AMT6      20604
PAY_AMT1        7943
PAY_AMT2        7899
PAY_AMT3        7518
PAY_AMT4        6937
PAY_AMT5        6897
PAY_AMT6        6939
CLASS              2
SEX_1              2
SEX_2              2
EDUCATION_0        2
EDUCATION_1        2
EDUCATION_2        2
EDUCATION_3        2
EDUCATION_4        2
EDUCATION_5        2
EDUCATION_6        2
MARRIAGE_0         2
MARRIAGE_1         2
MARRIAGE_2         2
MARRIAGE_3         2
PAY_0_DULY         2
PAY_2_DULY         2
PAY_3_DULY         2
PAY_4_DULY         2
PAY_5_DULY         2
PAY_6_DULY         2
dtype: int64
• First I will take a look at AGE feature. It is already a discrete variable, so binning (categorizing) this feature would be sm-oo-th (haha, binning in practice).
• Let’s take a look at a histogram, describing the distribution of this feature.

In :

fig, ax = plt.subplots()
# data['AGE'].hist(edgecolor='black', grid=False)
sns.distplot(data.AGE)
ax.set_title('Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)


Out:

Text(0, 0.5, 'Frequency')
• An expected normal distribution is right-skewed (since not many people at age <21 deal with money and mortgages so much).
• Surprising are those peaks around 30th, 35th, 40th and 50th year, but I will not discuss it.

We will divide age into 4 equal “bins” – quartiles, by looking at the quantiles of the data. In :

quantile_list = [0, .25, .5, .75, 1.]
quantiles = data['AGE'].quantile(quantile_list)
quantiles


Out:

0.00    21.0
0.25    28.0
0.50    34.0
0.75    41.0
1.00    79.0
Name: AGE, dtype: float64

Now let’s take a look, how well it slices the data:

In :

def quant_hist(col):
quantile_list = [0, .25, .5, .75, 1.]
quantiles = data[col].quantile(quantile_list)

fig, ax = plt.subplots(figsize=(4,3))
sns.distplot(data[col])
for quantile in quantiles:
qvl = plt.axvline(quantile, color='r')
ax.legend([qvl], ['Quantiles'], fontsize=10)
ax.set_title(col+' Histogram with Quantiles',
fontsize=12)
ax.set_xlabel(col, fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
quant_hist('AGE')


This looks nice, therefore I am going to create a new feature – an indicator of the quartile range, where given age belongs.

Note: I am going to name it in order since this categorical value is ordinal and I will treat it as any other discrete feature.

In :

quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
quantile_labels = [1, 2, 3, 4]
data['AGE_Q'] = pd.qcut(data['AGE'], q=quantile_list, labels=quantile_labels)

data[['AGE', 'AGE_Q']].iloc[4:9]


Out:

It works like a charm.

In :

quant_hist('LIMIT_BAL')


Looks also plausible for this conversion, so again I am going to create a new ordinal feature.

In :

data['LIMIT_BAL_Q'] = pd.qcut(data['LIMIT_BAL'], q=quantile_list, labels=quantile_labels)
data[['LIMIT_BAL', 'LIMIT_BAL_Q']].iloc[4:9]


Out:

Each feature PAY_AMT[1-6] contains lots of zeros, which pushes the mean and median closer to zero, even after making a logarithm of the feature:

In :

def quant_hist_log(col):
figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 3))

log_col = np.log1p(data[col])
log_mean = np.mean(log_col)

log_col.hist(bins=30, ax=axes)
#     sns.distplot(log_col, ax=axes)

axes.axvline(log_mean, color='r')
axes.set_title(col+' Histogram after Log Transform')
axes.set_xlabel(col+'(log scale)', fontsize=12)
axes.set_ylabel('Frequency', fontsize=12)
axes.text(11.5, 450, r'$\mu$='+str(log_mean), fontsize=10)

quantile_list = [0, .25, .5, .75, 1.]
quantiles = data[col].quantile(quantile_list)

data[col].hist(bins=60)
for quantile in quantiles:
qvl = axes.axvline(quantile, color='r')
axes.legend([qvl], ['Quantiles'], fontsize=10)
axes.set_title(col+' Histogram with Quantiles',
fontsize=12)
axes.set_xlabel(col, fontsize=12)
axes.set_ylabel('Frequency', fontsize=12)


In :

for n in range(1, 6):
quant_hist_log('PAY_AMT'+str(n))


Hence using quantiles would not be appropriate. Using an indicator of zero value would be appropriate.

I can imagine these solutions:

• Take only non-zero entries and divide them into quantiles
• Take only non-zero entries and divide it only by its mean (yess)
• Ignore this problem and don’t use binning on these features.

In :

def log_mean_division(col, verbose=False):
log_col = np.log(data[col])
log_col[log_col == -np.inf] = 0
log_mean = log_col[log_col != 0].mean()

bin_ranges = [-np.inf, 0, log_mean, log_col.max()]
bin_names = [0, 1, 2]
if verbose:
display(pd.concat([data[col], pd.Series(log_col), pd.cut(log_col, bins=bin_ranges), pd.cut(log_col, bins=bin_ranges, labels=bin_names)], axis=1).head(10))
return pd.cut(log_col, bins=bin_ranges, labels=bin_names)

col = 'PAY_AMT1'
cutt = log_mean_division(col)

# pd.concat([pd.cut(log_col, bins=bin_ranges), pd.Series(log_col)], axis=1)

cutt.dtype


Out:

CategoricalDtype(categories=[0, 1, 2], ordered=True)

Looks fine. Let’s add all of these features into the data. In :

for n in range(1, 6):
col = 'PAY_AMT'+str(n)
data['PAY_AMT'+str(n)+'_log_mean'] = log_mean_division(col)

/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in log



In :

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 47 columns):
LIMIT_BAL            30000 non-null int64
AGE                  30000 non-null int64
PAY_0                30000 non-null int64
PAY_2                30000 non-null int64
PAY_3                30000 non-null int64
PAY_4                30000 non-null int64
PAY_5                30000 non-null int64
PAY_6                30000 non-null int64
BILL_AMT1            30000 non-null int64
BILL_AMT2            30000 non-null int64
BILL_AMT3            30000 non-null int64
BILL_AMT4            30000 non-null int64
BILL_AMT5            30000 non-null int64
BILL_AMT6            30000 non-null int64
PAY_AMT1             30000 non-null int64
PAY_AMT2             30000 non-null int64
PAY_AMT3             30000 non-null int64
PAY_AMT4             30000 non-null int64
PAY_AMT5             30000 non-null int64
PAY_AMT6             30000 non-null int64
CLASS                30000 non-null int64
SEX_1                30000 non-null uint8
SEX_2                30000 non-null uint8
EDUCATION_0          30000 non-null uint8
EDUCATION_1          30000 non-null uint8
EDUCATION_2          30000 non-null uint8
EDUCATION_3          30000 non-null uint8
EDUCATION_4          30000 non-null uint8
EDUCATION_5          30000 non-null uint8
EDUCATION_6          30000 non-null uint8
MARRIAGE_0           30000 non-null uint8
MARRIAGE_1           30000 non-null uint8
MARRIAGE_2           30000 non-null uint8
MARRIAGE_3           30000 non-null uint8
PAY_0_DULY           30000 non-null bool
PAY_2_DULY           30000 non-null bool
PAY_3_DULY           30000 non-null bool
PAY_4_DULY           30000 non-null bool
PAY_5_DULY           30000 non-null bool
PAY_6_DULY           30000 non-null bool
AGE_Q                30000 non-null category
LIMIT_BAL_Q          30000 non-null category
PAY_AMT1_log_mean    30000 non-null category
PAY_AMT2_log_mean    30000 non-null category
PAY_AMT3_log_mean    30000 non-null category
PAY_AMT4_log_mean    30000 non-null category
PAY_AMT5_log_mean    30000 non-null category
dtypes: bool(6), category(7), int64(21), uint8(13)
memory usage: 5.6 MB

• Unfortunately, there are negative values in BILL_AMT[1-6] feature, so I will leave it as it is.

In :

np.sum(data.BILL_AMT1 <=0)


Out:

2598

In :

data[data == np.nan].sum()


Out:

LIMIT_BAL            0.0
AGE                  0.0
PAY_0                0.0
PAY_2                0.0
PAY_3                0.0
PAY_4                0.0
PAY_5                0.0
PAY_6                0.0
BILL_AMT1            0.0
BILL_AMT2            0.0
BILL_AMT3            0.0
BILL_AMT4            0.0
BILL_AMT5            0.0
BILL_AMT6            0.0
PAY_AMT1             0.0
PAY_AMT2             0.0
PAY_AMT3             0.0
PAY_AMT4             0.0
PAY_AMT5             0.0
PAY_AMT6             0.0
CLASS                0.0
SEX_1                0.0
SEX_2                0.0
EDUCATION_0          0.0
EDUCATION_1          0.0
EDUCATION_2          0.0
EDUCATION_3          0.0
EDUCATION_4          0.0
EDUCATION_5          0.0
EDUCATION_6          0.0
MARRIAGE_0           0.0
MARRIAGE_1           0.0
MARRIAGE_2           0.0
MARRIAGE_3           0.0
PAY_0_DULY           0.0
PAY_2_DULY           0.0
PAY_3_DULY           0.0
PAY_4_DULY           0.0
PAY_5_DULY           0.0
PAY_6_DULY           0.0
AGE_Q                0.0
LIMIT_BAL_Q          0.0
PAY_AMT1_log_mean    0.0
PAY_AMT2_log_mean    0.0
PAY_AMT3_log_mean    0.0
PAY_AMT4_log_mean    0.0
PAY_AMT5_log_mean    0.0
dtype: float64

In :

for col in data.select_dtypes('category').columns:
data[col] = data[col].cat.codes


In :

train_data, valid_data = train_test_split(data, test_size=.2, stratify=data.CLASS, random_state=42)


In :

resamplers = [
(None, 'unbalanced'),
(RandomUnderSampler(random_state=42),'Random under-sampling'),
(EditedNearestNeighbours(),'ENN'),
(SMOTE(random_state=42),'SMOTE'),
(SMOTEENN(random_state=42),'SMOTEENN'),
]

results_bin = pd.DataFrame()

for resampler, description in resamplers:
X_res = train_data.copy()
y_res = train_data.copy()
print(X_res.shape)
print(description)
if resampler is not None:
X_res = train_data.drop(columns=['CLASS'])
y_res = train_data.CLASS
X_res, y_res = resampler.fit_sample(X_res, y_res)

X_res = pd.DataFrame(X_res)
X_res['CLASS'] = y_res
print(f"class 1: {y_res.mean()*100}%")

print(X_res.shape)

results_bin = pd.concat([results, models_pipeline(X_res, valid_data, data_name=description)], axis=1)
results_bin

(24000, 47)
unbalanced
(24000, 47)

(24000, 47)
Random under-sampling
class 1: 50.0%
(10618, 47)

(24000, 47)
class 1: 23.53801817778763%
(22555, 47)

(24000, 47)
ENN
class 1: 34.16344916344916%
(15540, 47)

(24000, 47)
SMOTE
class 1: 50.0%
(37382, 47)

(24000, 47)
SMOTEENN
class 1: 62.541028446389504%
(21936, 47)


Out:

In :

plot_results(results_bin)

• Binning helped SVM when SMOTEEN was used, however, this model is still not the best one in game for this dataset.
• The neural network likes a lot of data, therefore undersampling is causing a small fall in the results.

# Conclusion¶

• Use binning (on features of your choice, with your choice of parameters) and comment on its effects on classification performance.
• I used binning on three types of features. On Age and Limit Amount I used quartile binning, on Pay Amount I used just 0 indicator and 2 quantiles (divided by mean without zero).
• Use at least 2 other preprocessing techniques (your choice!) on the data set and comment on the classification results.
• I used PCA, Scaling and dummy variables conversion, same as creating new features as indicators for soon payment. It seems it did not have much effect on the classification results.
• Run all classification tests at least three times – once for unbalanced original data, twice for balanced data (try at least 2 balancing techniques), compare those results (give a comment).
• I ran classification test on binned data (pretty much no change than on non-binned), Then I ran all models with cross-validation and grid-search on unbalanced data, random under-sampled, removed Tomek-links, ENN under-sampling; SMOTE for up-sampling and SMOTEEN for combination.
• Results do not seem to vary.

Notes: I used the same train/validation set for any result, to make it reproducible. The validation set remained unbalanced.