Starbucks

Starbucks

Nikde v európskom Starbuckse som nevidel, že by do jednotlivých okienok na pohároch niečo vpisovali. Možno to bolo tým, že som si nikdy neobjednal nič komplikovanejšie ako americano. 

Tu v USA je Starbucks braný viac ako fast-food, akurát s kávou a sem-tam sendvičom. Chodia tu ale všetci od bezdomovcov po vrchné jedno percento. Objednávok je naraz toľko, že sa nemusia zmestiť na náš objednávkový stôl, a ten, čo obsluhuje register, do štvorčekov píše dané objednávky, aby baristi (tak by som nás ale nenazýval) vedeli, čo za špecialitu idú pripravovať. Mimochodom, počas sviatkov a víkendov sú tu v Starbuckse rady až za vchodové dvere. Po dvere je to asi 20-25 ľudí. Počas Dňa Nezávislosti taká rada bola po celý deň a neubúdalo z nej.

V Európe sa človek bojí objednať čo len trošku komplikovanejší nápoj, na Vlaďu v Prahe vždy namosúrene pokukujú, keď si objednáva nízkotučné laté s bezkalorickým vanilkovým sirupom. Tu v USA je to jedna z najľahšich a najfrekventovanejších objednávok s vlastným názvom – chudé vanilkové latté (skinny vanilla latte). 

“Poprosím karamelové obláčikové macchiato so sójovým mliekom, espressom naviac a s extra karamelovým dresingom.” Keďže si takéto niečo nikdy nikto nepýta, určite niektorá z ingrediencií expirovala a v ponuke to nenájdeme. 

Keby ingrediencie aj nemali expiračnú dobu (všetko mi tu príde dostatočne umelé, aby to prežilo aj drahú Zdenu Studenkovú), v našich jazyckoch to všetko znie extrémne smiešne.

Frapučíno vanilkových bôbov = Vanilla Bean Frapuccino

Červenoočko = Red Eye (filtrovaná káva s espressom naviac)

Mangovo-dračí osviežovač = Mango Dragon Refresher

Veľmi bobuľkový ibištekový osviežovač = Very Berry Hibiscus Refresher

Studenovarná ľadová káva so solenou studenou penou = Salted Cold Foam Cold Brew Ice Coffee

Seriózne jahodové frapučíno = Serious Strawberry Frapuccino

Čo ale úplne postráda kreativitu sú názvy ako Ružový nápoj (Pink Drink – jahodový sirup/čaj a kokosové mlieko) Batika (Tie-dye – rôznofarebné prášky –  umelý hnus – s rozmixovaným mliekom a ľadom) a podobne.

Náš zákazník – náš pán: týmto heslom sa tu riadi každá prevádzka, nielen Starbucks, takže keď je niekto nespokojný s tým čo dostal, pokojne to môže vrátiť a my mu to spravíme tak ako chce, prípadne presne tak isto ako to chcel predtým, akurát znova. Alebo znova. A znova. Niektorí ľudia sú tak prieberčiví, že nechcú prijať to, čo si objednali, ani na tretíkrát. Cítia sa špeciálne. 

S týmto prístupom sú tu však riziká v podobe “vychcalých” zákazníkov, ktorí si myslia, že ovládajú vesmír. V jeden pokojný a nie príliš vyťažený deň k nám prišla zákazníčka, ktorá hovorí, že jej vanilkové latté bolo príliš vodové a chce ho spraviť znova. V ruke drží iba telefón a nervózne tam niečo skroluje. Ospravedlním sa a hovorím, že nie je problém urobiť nové (americký kolega za mnou sa ospravedlnil už piatykrát a latté má hotové). Hovorím, že by som rád videl nejaký doklad. Vraví, že mi môže ukázať bankovú transakciu na mobile. Tak sa pýtam či ju môžem vidieť, na čo sa ma spýta či si myslím, že klame. Pozriem sa jej do očí a odpoviem, že áno. Na to sa otočí a bez slova odkráča preč. 

To, že sa musíme v Starbuckse správať ku každému zákazníkovi extrémne milo so sebou prináša aj zopár nevyžiadaných fenoménov. V rámci podpory tzv. “customer connection” (spojenie so zákazníkom) máme za povinnosť sa každého pri výdaji kávy pýtať aký mal deň, čo robí v meste a podobné hovadiny (máme dokonca vylepené nejaké takéto otázky z vnútornej strany výdajného stola – aby sme sa inšpirovali). Na základe našej prívetivosti sa nájdu zákazníci, čo nám to totálne zožerú a majú pocit, že to, že sme na nich milý, znamená, že sme automaticky dobrí kamaráti. Takto ku nám neustále chodil jeden autobusár, ktorý mal pocit, že sa s ním všetci chceme kamarátiť a otravoval nám každý deň jeho rečami. Dokonca si nás popridával na facebooku. Je mi ho trochu ľúto, že nemá lepších kamarátov ako robotníkov v Starbuckse, ktorí na neho *musia* byť milý. Ešteže platil v porovnaní s ostatnými veľké dýška. 

Denný počet kalórií pre muža môjho veku je približne 2000. Naše frapučína majú štandardne 450-550 kalórií. Do každého ide niekoľko dávok takého sirupu, onakého sirupu, taká sladká báza, hentaký sladký prášok… Niektorí to berú ako raňajky alebo večeru, ale netrápi ich, že sú to úplne prázdne kalórie bez živín. Niektorí to zas berú skôr ako dezert alebo len taký zákusok od dobroty po jedle, takže nakoniec ani neviem, čo je horšie. Každopádne frapučína sú tu extrémne populárne, ostatne ako čokoľvek, kde je enormné množstvo cukru. 

Balancing And Binning on “Customers’ Default Payments In Taiwan” Dataset

I did this analysis as a homework on MI-PDD (data preprocessing) course on FIT, ČVUT. You can also find this work on our faculty’s GitLab.

In short, the main task is to play with balancing and binning to obtain the best results for the binary classification task.

What I was supposed to do:

  1. Download the dataset from the course pages (default_of_credit_card_clients.csv).
  2. Use binning (on features of your choice, with your choice of parameters) and comment on its effects on classification performance.
  3. Use at least 2 other preprocessing techniques (your choice!) on the data set and comment on the classification results.
  4. Run all classification tests at least three times – once for unbalanced original data, twice for balanced data (try at least 2 balancing techniques), compare those results (give a comment).

Give comments on each step of your solution, with short explanations of your choices.

In [1]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

data = pd.read_csv("default_of_credit_card_clients.csv", sep=';')

Introduction

Since this is a well-known dataset of customers’ default payments in Taiwan (described also here), I am going to use this apriori knowledge and transform this dataset a bit. In [3]:

columns = [
"LIMIT_BAL", 
"SEX", "EDUCATION", "MARRIAGE",
"AGE",
"PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6",
"BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6",
"PAY_AMT1", "PAY_AMT2", "PAY_AMT3", "PAY_AMT4", "PAY_AMT5", "PAY_AMT6",
"CLASS"]
print(len(columns), len(data.columns))
data.columns = columns
24 24

Categorical Features

In [4]:

for col in ["SEX", "EDUCATION", "MARRIAGE"]:
    data[col] = data[col].astype('category')

Now let’s convert those categorical data In [5]:

data = pd.get_dummies(data)

Feature Addition

In [6]:

for col in ["PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]:
    print(np.sort(data[col].unique().tolist()))
    data[col + "_DULY"] = pd.Series(data[col] <= 0)
[-2 -1  0  1  2  3  4  5  6  7  8]
[-2 -1  0  1  2  3  4  5  6  7  8]
[-2 -1  0  1  2  3  4  5  6  7  8]
[-2 -1  0  1  2  3  4  5  6  7  8]
[-2 -1  0  2  3  4  5  6  7  8]
[-2 -1  0  2  3  4  5  6  7  8]
  • These features represent “Repayment status”. -1 means “paid duly”. I am not sure how to handle -2 values, so I assume repayment was even sooner.
  • Therefore I created indicator features to represent soon payment (<=0)

Classes overview

In [7]:

sns.barplot(data['CLASS'].value_counts().index, data['CLASS'].value_counts())
plt.show()
print(f"{data['CLASS'].mean()*100}% of CLASS 1")
22.12% of CLASS 1

We can clearly see, that the majority class is 0.

Now let’s split the dataset into train and validation set

Note: I have read some articles and answeres on SO about balancing, and everytime they say, I should balance only the training set.

In [8]:

train_data, valid_data = train_test_split(data, test_size=.2, stratify=data.CLASS, random_state=42)

sns.barplot(train_data.CLASS.value_counts().index, train_data.CLASS.value_counts()).set_title('train')
plt.show()
sns.barplot(valid_data.CLASS.value_counts().index, valid_data.CLASS.value_counts()).set_title('valid')
plt.show()

print(f"y_train: {train_data.CLASS.mean()*100}%, y_test: {valid_data.CLASS.mean()*100}%")
y_train: 22.120833333333334%, y_test: 22.116666666666667%

Classification on unbalanced (preprocessed) data

  • I am going to use Gaussian Naive Bayes, SVM with RBF kernel, Neural Network, K-Nearest Neighbors and Logistic Regression as classification models.
  • On the train set, I will do a small grid-search for best parameters, which will be provided by stratified (default in GridSearchCV selector) 3-fold cross-validation.
  • Then I will choose the best estimator, fitted with training data and predict the validation data.
  • In the end, you will see ROC curve and Precision-Recall curve, with the average AUC-ROC and Precision for each model.

In [9]:

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, get_scorer, roc_curve, average_precision_score, precision_recall_curve, auc
from sklearn.metrics import plot_roc_curve

In [10]:

def models_pipeline(data_train, data_val, verbose=False, scale=True, pca=None, data_name="unbalanced"):
    X_train = data_train.drop(columns=["CLASS"])
    y_train = data_train.CLASS

    X_val = data_val.drop(columns=["CLASS"])
    y_val = data_val.CLASS

    classifiers = [
        ("Naive Bayes", GaussianNB(), {'priors': [[1 - y_train.mean(), y_train.mean()]]}),
        ("RBF SVM", SVC(probability=True), {'kernel': ['rbf'], 'C': [1, 5]}),
        ("Random Forest", RandomForestClassifier(n_jobs=-1, random_state=42), {'n_estimators': [100, 300]}),
        ("Neural Net", MLPClassifier(), {'learning_rate_init': [0.001, 0.0001], 'max_iter': [300]}),
        ("K-NN", KNeighborsClassifier(n_jobs=-1), {'n_neighbors': [3, 5, 10]}),
        ("Log Reg", LogisticRegression(n_jobs=-1), {'C': [1, 5, 10]})
    ]

    if scale:
        scaler = StandardScaler()
        scaler.fit(X_train)
        X_train = scaler.transform(X_train)
        X_val = scaler.transform(X_val)
#     print(X_train.shape)
    if pca is not None and (pca < 1 and pca > 0):
        p = PCA(int(X_train.shape[1]*pca))
        p.fit(X_train)
        X_train = p.transform(X_train)
        X_val = p.transform(X_val)

#     print(X_train.shape)
    figure = plt.figure(figsize=(3, 3))
    sns.barplot(y_train.value_counts().index, y_train.value_counts()).set_title('train set class')
    plt.show()

    figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

    results = []
    for name, clf, param in classifiers:
        gscv = GridSearchCV(clf, param_grid=param, scoring=get_scorer('roc_auc'), n_jobs=-1, cv=3)
        gscv.fit(X_train, y_train)
    #     mean_test_ROC = gscv.cv_results_['mean_test_score']
        best_score = gscv.best_score_
        best_estimator = gscv.best_estimator_
        print(f"{name} done. ", end="")

        y_val_score = best_estimator.predict_proba(X_val).T[1]

        auc_roc = roc_auc_score(y_val, y_val_score)
        fpr, tpr, thresholds = roc_curve(y_val, y_val_score)
        axes[0].plot(fpr, tpr, label=f"AUC = {auc_roc:0.3f}, "+name)

        avg_precision = average_precision_score(y_val, y_val_score)
        precision, recall, pr_thresholds = precision_recall_curve(y_val, y_val_score)

        axes[1].plot(precision, recall, label=f"P = {avg_precision:.3f}, "+name)

        res = {"name": name, "ROC_"+data_name: auc_roc, "P_"+data_name: avg_precision}
        results.append(res)

    print() 
    axes[0].set(title='ROC curve of models on validation set', xlabel='False Positive Rate', ylabel='True Positive Rate')
    axes[1].set(title='Precision(P)-Recall curve', xlabel='Precision', ylabel='Recall')
    axes[0].legend(loc='lower right')
    axes[1].legend()
    plt.show()
    r = pd.DataFrame(results).set_index('name')
    return r
  • First, let’s run it on unbalanced data, with PCA taking 2/3 * n_features as a number of components.

In [11]:

models_pipeline(train_data, valid_data, pca=(2/3))

Out[11]:

P_unbalancedROC_unbalanced
name
Naive Bayes0.4341090.731373
RBF SVM0.5050740.720788
Random Forest0.5147460.747997
Neural Net0.5336940.766496
K-NN0.4579240.729072
Log Reg0.5062230.741280
  • We can see, that Neural Net is the leader in both ROC and P-R metrics. I set it to 300 iterations, si it is (over)fitting very well.
  • The table on the bottom shows us the best ROC on the training set (mean from all folds of the best estimator), which is almost the same as on the validation set. I think this is the unbalanceness, which is causing it.

Balancing

  • I use Random Under Samper, Tomek Links and Edited Nearest Neighbors for under-sampling; SMOTE for up-sampling and SMOTEEN for a combined approach.
  • I let my classification pipeline gather both metrics and plot them for each of the samplers.

In [12]:

from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import TomekLinks
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN


resamplers = [
    (None, 'unbalanced'),
    (RandomUnderSampler(random_state=42),'Random under-sampling'),
    (TomekLinks(),'Tomek Links'),
    (EditedNearestNeighbours(),'ENN'),
    (SMOTE(random_state=42),'SMOTE'),
    (SMOTEENN(random_state=42),'SMOTEENN'),
]

# models_pipeline(train_data, valid_data)

results = pd.DataFrame()

for resampler, description in resamplers:
    X_res = train_data.copy()
    y_res = train_data.copy()
    print(X_res.shape)
    print(description)
    if resampler is not None:
        X_res = train_data.drop(columns=['CLASS'])
        y_res = train_data.CLASS
        X_res, y_res = resampler.fit_sample(X_res, y_res)

        X_res = pd.DataFrame(X_res)
        X_res['CLASS'] = y_res
        print(f"class 1: {y_res.mean()*100}%")

    print(X_res.shape)


    results = pd.concat([results, models_pipeline(X_res, valid_data, data_name=description)], axis=1)
results
(24000, 40)
unbalanced
(24000, 40)
(24000, 40)
Random under-sampling
class 1: 50.0%
(10618, 40)
(24000, 40)
Tomek Links
class 1: 23.540105529197888%
(22553, 40)
(24000, 40)
ENN
class 1: 34.16344916344916%
(15540, 40)
(24000, 40)
SMOTE
class 1: 50.0%
(37382, 40)
(24000, 40)
SMOTEENN
class 1: 62.53817750831928%
(21937, 40)

Out[12]:

P_unbalancedROC_unbalancedP_Random under-samplingROC_Random under-samplingP_Tomek LinksROC_Tomek LinksP_ENNROC_ENNP_SMOTEROC_SMOTEP_SMOTEENNROC_SMOTEENN
name
Naive Bayes0.4658000.7293620.4606490.7300150.4625410.7293990.4478230.7296160.4598730.7129680.4362870.713980
RBF SVM0.5017450.7168720.4814240.7458800.5024950.7249960.5002650.7449560.5064850.7276040.4968960.748476
Random Forest0.5380780.7580680.5255790.7611590.5373860.7594620.5288730.7645430.5158630.7483310.5111360.764156
Neural Net0.5395160.7691510.5259110.7661240.5342640.7688440.5258910.7676830.5339690.7673750.5152580.762043
K-NN0.4611900.7260210.4471020.7303510.4688510.7310800.4623910.7363950.4593080.7257560.4619550.738953
Log Reg0.5057280.7435710.5038350.7448270.5050910.7436230.4989400.7433740.5080960.7302820.5039130.733872

In [13]:

def plot_results(results):
    prec = results.filter(regex=("^P"))
    roc = results.filter(regex=("^ROC"))
    figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    prec.T.plot(ax=axes[0], title='Average Precisions')
    axes[0].set(xlabel='Estimator', ylabel='Average Precision')
    roc.T.plot(ax=axes[1], title='AUC-ROC scores')
    axes[1].set(xlabel='Estimator', ylabel='AUC-ROC')
    for tick in axes[0].get_xticklabels():
        tick.set_rotation(45)
    for tick in axes[1].get_xticklabels():
        tick.set_rotation(45)
    plt.show()
plot_results(results)
  • Neural network nand Random Forest are the leaders. They seem to be pretty stable, but surprisingly – they are working the best on unbalanced data.
  • SVM seems to be prone to sampling, works better on the under-sampled training set.

Binning

  • Studied here.
  • I will skip Fixed-Width Binning, since it is not so data-driven, as Adaptive Binning.

Let’s take a look again, at unique values in our data:

In [14]:

df = data.copy()
data.nunique()

Out[14]:

LIMIT_BAL         81
AGE               56
PAY_0             11
PAY_2             11
PAY_3             11
PAY_4             11
PAY_5             10
PAY_6             10
BILL_AMT1      22723
BILL_AMT2      22346
BILL_AMT3      22026
BILL_AMT4      21548
BILL_AMT5      21010
BILL_AMT6      20604
PAY_AMT1        7943
PAY_AMT2        7899
PAY_AMT3        7518
PAY_AMT4        6937
PAY_AMT5        6897
PAY_AMT6        6939
CLASS              2
SEX_1              2
SEX_2              2
EDUCATION_0        2
EDUCATION_1        2
EDUCATION_2        2
EDUCATION_3        2
EDUCATION_4        2
EDUCATION_5        2
EDUCATION_6        2
MARRIAGE_0         2
MARRIAGE_1         2
MARRIAGE_2         2
MARRIAGE_3         2
PAY_0_DULY         2
PAY_2_DULY         2
PAY_3_DULY         2
PAY_4_DULY         2
PAY_5_DULY         2
PAY_6_DULY         2
dtype: int64
  • First I will take a look at AGE feature. It is already a discrete variable, so binning (categorizing) this feature would be sm-oo-th (haha, binning in practice).
  • Let’s take a look at a histogram, describing the distribution of this feature.

In [15]:

fig, ax = plt.subplots()
# data['AGE'].hist(edgecolor='black', grid=False)
sns.distplot(data.AGE)
ax.set_title('Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Out[15]:

Text(0, 0.5, 'Frequency')
  • An expected normal distribution is right-skewed (since not many people at age <21 deal with money and mortgages so much).
  • Surprising are those peaks around 30th, 35th, 40th and 50th year, but I will not discuss it.

We will divide age into 4 equal “bins” – quartiles, by looking at the quantiles of the data. In [16]:

quantile_list = [0, .25, .5, .75, 1.]
quantiles = data['AGE'].quantile(quantile_list)
quantiles

Out[16]:

0.00    21.0
0.25    28.0
0.50    34.0
0.75    41.0
1.00    79.0
Name: AGE, dtype: float64

Now let’s take a look, how well it slices the data:

In [17]:

def quant_hist(col):
    quantile_list = [0, .25, .5, .75, 1.]
    quantiles = data[col].quantile(quantile_list)

    fig, ax = plt.subplots(figsize=(4,3))
    sns.distplot(data[col])
    for quantile in quantiles:
        qvl = plt.axvline(quantile, color='r')
    ax.legend([qvl], ['Quantiles'], fontsize=10)
    ax.set_title(col+' Histogram with Quantiles', 
                 fontsize=12)
    ax.set_xlabel(col, fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
quant_hist('AGE')

This looks nice, therefore I am going to create a new feature – an indicator of the quartile range, where given age belongs.

Note: I am going to name it in order since this categorical value is ordinal and I will treat it as any other discrete feature.

In [18]:

quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
quantile_labels = [1, 2, 3, 4]
data['AGE_Q'] = pd.qcut(data['AGE'], q=quantile_list, labels=quantile_labels)

data[['AGE', 'AGE_Q']].iloc[4:9]

Out[18]:

AGEAGE_Q
4574
5373
6292
7231
8281

It works like a charm.

How about other continuous variables?

In [19]:

quant_hist('LIMIT_BAL')

Looks also plausible for this conversion, so again I am going to create a new ordinal feature.

In [20]:

data['LIMIT_BAL_Q'] = pd.qcut(data['LIMIT_BAL'], q=quantile_list, labels=quantile_labels)
data[['LIMIT_BAL', 'LIMIT_BAL_Q']].iloc[4:9]

Out[20]:

LIMIT_BALLIMIT_BAL_Q
4500001
5500001
65000004
71000002
81400002

Each feature PAY_AMT[1-6] contains lots of zeros, which pushes the mean and median closer to zero, even after making a logarithm of the feature:

In [21]:

def quant_hist_log(col):
    figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 3))

    log_col = np.log1p(data[col])
    log_mean = np.mean(log_col)

    log_col.hist(bins=30, ax=axes[0])
#     sns.distplot(log_col, ax=axes[0])

    axes[0].axvline(log_mean, color='r')
    axes[0].set_title(col+' Histogram after Log Transform')
    axes[0].set_xlabel(col+'(log scale)', fontsize=12)
    axes[0].set_ylabel('Frequency', fontsize=12)
    axes[0].text(11.5, 450, r'$\mu$='+str(log_mean), fontsize=10)

    quantile_list = [0, .25, .5, .75, 1.]
    quantiles = data[col].quantile(quantile_list)

    data[col].hist(bins=60)
    for quantile in quantiles:
        qvl = axes[1].axvline(quantile, color='r')
    axes[1].legend([qvl], ['Quantiles'], fontsize=10)
    axes[1].set_title(col+' Histogram with Quantiles', 
                 fontsize=12)
    axes[1].set_xlabel(col, fontsize=12)
    axes[1].set_ylabel('Frequency', fontsize=12)

In [22]:

for n in range(1, 6):
    quant_hist_log('PAY_AMT'+str(n))

Hence using quantiles would not be appropriate. Using an indicator of zero value would be appropriate.

I can imagine these solutions:

  • Take only non-zero entries and divide them into quantiles
  • Take only non-zero entries and divide it only by its mean (yess)
  • Ignore this problem and don’t use binning on these features.

In [23]:

def log_mean_division(col, verbose=False):
    log_col = np.log(data[col])
    log_col[log_col == -np.inf] = 0
    log_mean = log_col[log_col != 0].mean()

    bin_ranges = [-np.inf, 0, log_mean, log_col.max()]
    bin_names = [0, 1, 2]
    if verbose:
        display(pd.concat([data[col], pd.Series(log_col), pd.cut(log_col, bins=bin_ranges), pd.cut(log_col, bins=bin_ranges, labels=bin_names)], axis=1).head(10))
    return pd.cut(log_col, bins=bin_ranges, labels=bin_names)

col = 'PAY_AMT1'
cutt = log_mean_division(col)

# pd.concat([pd.cut(log_col, bins=bin_ranges), pd.Series(log_col)], axis=1)

cutt.dtype

Out[23]:

CategoricalDtype(categories=[0, 1, 2], ordered=True)

Looks fine. Let’s add all of these features into the data. In [24]:

for n in range(1, 6):
    col = 'PAY_AMT'+str(n)
    data['PAY_AMT'+str(n)+'_log_mean'] = log_mean_division(col)
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in log

In [25]:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 47 columns):
LIMIT_BAL            30000 non-null int64
AGE                  30000 non-null int64
PAY_0                30000 non-null int64
PAY_2                30000 non-null int64
PAY_3                30000 non-null int64
PAY_4                30000 non-null int64
PAY_5                30000 non-null int64
PAY_6                30000 non-null int64
BILL_AMT1            30000 non-null int64
BILL_AMT2            30000 non-null int64
BILL_AMT3            30000 non-null int64
BILL_AMT4            30000 non-null int64
BILL_AMT5            30000 non-null int64
BILL_AMT6            30000 non-null int64
PAY_AMT1             30000 non-null int64
PAY_AMT2             30000 non-null int64
PAY_AMT3             30000 non-null int64
PAY_AMT4             30000 non-null int64
PAY_AMT5             30000 non-null int64
PAY_AMT6             30000 non-null int64
CLASS                30000 non-null int64
SEX_1                30000 non-null uint8
SEX_2                30000 non-null uint8
EDUCATION_0          30000 non-null uint8
EDUCATION_1          30000 non-null uint8
EDUCATION_2          30000 non-null uint8
EDUCATION_3          30000 non-null uint8
EDUCATION_4          30000 non-null uint8
EDUCATION_5          30000 non-null uint8
EDUCATION_6          30000 non-null uint8
MARRIAGE_0           30000 non-null uint8
MARRIAGE_1           30000 non-null uint8
MARRIAGE_2           30000 non-null uint8
MARRIAGE_3           30000 non-null uint8
PAY_0_DULY           30000 non-null bool
PAY_2_DULY           30000 non-null bool
PAY_3_DULY           30000 non-null bool
PAY_4_DULY           30000 non-null bool
PAY_5_DULY           30000 non-null bool
PAY_6_DULY           30000 non-null bool
AGE_Q                30000 non-null category
LIMIT_BAL_Q          30000 non-null category
PAY_AMT1_log_mean    30000 non-null category
PAY_AMT2_log_mean    30000 non-null category
PAY_AMT3_log_mean    30000 non-null category
PAY_AMT4_log_mean    30000 non-null category
PAY_AMT5_log_mean    30000 non-null category
dtypes: bool(6), category(7), int64(21), uint8(13)
memory usage: 5.6 MB
  • Unfortunately, there are negative values in BILL_AMT[1-6] feature, so I will leave it as it is.

In [26]:

np.sum(data.BILL_AMT1 <=0)

Out[26]:

2598

In [27]:

data[data == np.nan].sum()

Out[27]:

LIMIT_BAL            0.0
AGE                  0.0
PAY_0                0.0
PAY_2                0.0
PAY_3                0.0
PAY_4                0.0
PAY_5                0.0
PAY_6                0.0
BILL_AMT1            0.0
BILL_AMT2            0.0
BILL_AMT3            0.0
BILL_AMT4            0.0
BILL_AMT5            0.0
BILL_AMT6            0.0
PAY_AMT1             0.0
PAY_AMT2             0.0
PAY_AMT3             0.0
PAY_AMT4             0.0
PAY_AMT5             0.0
PAY_AMT6             0.0
CLASS                0.0
SEX_1                0.0
SEX_2                0.0
EDUCATION_0          0.0
EDUCATION_1          0.0
EDUCATION_2          0.0
EDUCATION_3          0.0
EDUCATION_4          0.0
EDUCATION_5          0.0
EDUCATION_6          0.0
MARRIAGE_0           0.0
MARRIAGE_1           0.0
MARRIAGE_2           0.0
MARRIAGE_3           0.0
PAY_0_DULY           0.0
PAY_2_DULY           0.0
PAY_3_DULY           0.0
PAY_4_DULY           0.0
PAY_5_DULY           0.0
PAY_6_DULY           0.0
AGE_Q                0.0
LIMIT_BAL_Q          0.0
PAY_AMT1_log_mean    0.0
PAY_AMT2_log_mean    0.0
PAY_AMT3_log_mean    0.0
PAY_AMT4_log_mean    0.0
PAY_AMT5_log_mean    0.0
dtype: float64

In [28]:

for col in data.select_dtypes('category').columns:
    data[col] = data[col].cat.codes

In [29]:

train_data, valid_data = train_test_split(data, test_size=.2, stratify=data.CLASS, random_state=42)

In [30]:

resamplers = [
    (None, 'unbalanced'),
    (RandomUnderSampler(random_state=42),'Random under-sampling'),
    (TomekLinks(),'Tomek Links'),
    (EditedNearestNeighbours(),'ENN'),
    (SMOTE(random_state=42),'SMOTE'),
    (SMOTEENN(random_state=42),'SMOTEENN'),
]

results_bin = pd.DataFrame()

for resampler, description in resamplers:
    X_res = train_data.copy()
    y_res = train_data.copy()
    print(X_res.shape)
    print(description)
    if resampler is not None:
        X_res = train_data.drop(columns=['CLASS'])
        y_res = train_data.CLASS
        X_res, y_res = resampler.fit_sample(X_res, y_res)

        X_res = pd.DataFrame(X_res)
        X_res['CLASS'] = y_res
        print(f"class 1: {y_res.mean()*100}%")

    print(X_res.shape)


    results_bin = pd.concat([results, models_pipeline(X_res, valid_data, data_name=description)], axis=1)
results_bin
(24000, 47)
unbalanced
(24000, 47)
(24000, 47)
Random under-sampling
class 1: 50.0%
(10618, 47)
(24000, 47)
Tomek Links
class 1: 23.53801817778763%
(22555, 47)
(24000, 47)
ENN
class 1: 34.16344916344916%
(15540, 47)
(24000, 47)
SMOTE
class 1: 50.0%
(37382, 47)
(24000, 47)
SMOTEENN
class 1: 62.541028446389504%
(21936, 47)

Out[30]:

P_unbalancedROC_unbalancedP_Random under-samplingROC_Random under-samplingP_Tomek LinksROC_Tomek LinksP_ENNROC_ENNP_SMOTEROC_SMOTEP_SMOTEENNROC_SMOTEENNP_SMOTEENNROC_SMOTEENN
name
Naive Bayes0.4658000.7293620.4606490.7300150.4625410.7293990.4478230.7296160.4598730.7129680.4362870.7139800.4345700.727383
RBF SVM0.5017450.7168720.4814240.7458800.5024950.7249960.5002650.7449560.5064850.7276040.4968960.7484760.4986750.748287
Random Forest0.5380780.7580680.5255790.7611590.5373860.7594620.5288730.7645430.5158630.7483310.5111360.7641560.5125150.764916
Neural Net0.5395160.7691510.5259110.7661240.5342640.7688440.5258910.7676830.5339690.7673750.5152580.7620430.5111570.756952
K-NN0.4611900.7260210.4471020.7303510.4688510.7310800.4623910.7363950.4593080.7257560.4619550.7389530.4535390.744142
Log Reg0.5057280.7435710.5038350.7448270.5050910.7436230.4989400.7433740.5080960.7302820.5039130.7338720.4919820.741910

In [31]:

plot_results(results_bin)
  • Binning helped SVM when SMOTEEN was used, however, this model is still not the best one in game for this dataset.
  • The neural network likes a lot of data, therefore undersampling is causing a small fall in the results.

Conclusion

  • Use binning (on features of your choice, with your choice of parameters) and comment on its effects on classification performance.
    • I used binning on three types of features. On Age and Limit Amount I used quartile binning, on Pay Amount I used just 0 indicator and 2 quantiles (divided by mean without zero).
  • Use at least 2 other preprocessing techniques (your choice!) on the data set and comment on the classification results.
    • I used PCA, Scaling and dummy variables conversion, same as creating new features as indicators for soon payment. It seems it did not have much effect on the classification results.
  • Run all classification tests at least three times – once for unbalanced original data, twice for balanced data (try at least 2 balancing techniques), compare those results (give a comment).
    • I ran classification test on binned data (pretty much no change than on non-binned), Then I ran all models with cross-validation and grid-search on unbalanced data, random under-sampled, removed Tomek-links, ENN under-sampling; SMOTE for up-sampling and SMOTEEN for combination.
    • Results do not seem to vary.

Notes: I used the same train/validation set for any result, to make it reproducible. The validation set remained unbalanced.