Bank Marketing Dataset — MLP Classification

Avarjana Panditha
8 min readSep 1, 2022

--

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y). Summarized details on the dataset are as follow

  • Data Set Characteristics: Multivariate
  • Attribute Characteristics: Real
  • Number of Instances: 41188
  • Number of Attributes: 20
  • Missing Values: No

Source: UCI Machine Learning Repository (bank-additional-full.csv)

For the complete source code (Python Notebook): Source Code

More information about the dataset attributes

Input variables:
1. age (numeric)
2. job: type of job (categorical: ‘admin.’,’ blue-collar’,’ entrepreneur’,’ housemaid’,’ management’,’ retired’,’ self-employed’,’ services’,’ student’,’ technician’,’ unemployed’,’ unknown’)
3. marital: marital status (categorical: ‘divorced’,’ married’,’ single’,’ unknown’; note: ‘divorced’ means divorced or widowed)
4. education (categorical: ‘basic.4y’,’ basic.6y’,’ basic.9y’,’high. school’,’ illiterate’,’ professional.course’,’ university.degree’,’ unknown’)
5. default: has credit in default? (categorical: ‘no’,’ yes’,’ unknown’)
6. housing: has a housing loan? (categorical: ‘no’,’ yes’,’ unknown’)
7. loan: has a personal loan? (categorical: ‘no’,’ yes’,’ unknown’)
# related with the last contact of the current campaign:
8. contact: contact communication type (categorical: ‘cellular’,’ telephone’)
9. month: last contact month of the year (categorical: ‘Jan’, ‘Feb’, ‘mar’, …, ‘Nov’, ‘Dec’)
10. day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes the last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’ nonexistent’,’ success’)

Social and economic context attributes
16. emp.var.rate: employment variation rate — quarterly indicator (numeric)
17. cons.price.idx: consumer price index — monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index — monthly indicator (numeric)
19. euribor3m: euribor 3-month rate — daily indicator (numeric)
20. nr.employed: number of employees — quarterly indicator (numeric)

Output variable (desired target)
21. y — has the client subscribed to a term deposit? (binary: ‘yes’,’ no’)

Following are the steps that I have taken to train the Multi-Layer Perception (MLP) model.

  • Preprocessing numerical columns
  • Preprocessing categorical columns
  • Additional: Fixing the class imbalance using SMOTE
  • Split the dataset for training and testing
  • Identify correlations in the training data
  • Principal Component Analysis (PCA) to reduce the feature count
  • Fit the MLP and fine-tune the network
  • Model evaluation

Let’s dive deeper into the above steps.

Preprocessing numerical columns

First off we have to identify what are the numerical data columns. Of course, we can use the above dataset information to filter the numerical columns but why bother. Just one line of code can do it for us.

numerical_df = df_original.select_dtypes(include=np.number)

This will provide us with a data frame that contains only the numerical data. When you run info() on the above data frame, it should look like this.

Numerical Data Columns

As mentioned in the dataset information, I will drop the “duration” column just to make sure. Running a describe() on the data frame to understand what’s going on with the data.

Numerical data summary with description()

Since the dataset is mostly cleaned, there seems to be nothing much to do here. No zero value columns or anything unwanted as it indicates in the table. Moving on to Boxplot, Histogram and Q-Q plots to identify outliers and skewness in the data.

Outliers

There seems to be a single outlier value for Cons.conf.idx column. Observe the following graphs,

Cons.conf.idx

This value will not be removed since it can be useful in the evaluation of the campaign due to the existing class imbalance of the ‘y’.

Skewness

After drawing the plots, it can be seen that Age, Campaign and Previous are right-skewed while the Nr.employed is left-skewed.

Age is Right Skewed
Campaign is Right Skewed
Previous is Right Skewed
Nr.employed is Left Skewed

To reduce the skewness before scaling to standardize, we need to apply suitable transformations to the dataset. For that purpose we are using x² (for left skewed) and sqrt transformations (for right skewed). We are not using the logarithms due to 0 values and therefore we cannot use the exponential function as well. After applying the transformations, new diagrams are as follows,

Age after transformation
Campaign after transformation
Previous after transformation
Nr.employed after transformation

Preprocessing categorical columns

The same method is used to identify the categorical columns in the dataset.

cat_df = df_original.select_dtypes(exclude=np.number)

Following are the categorical columns identified in the whole dataset.

Categorical Data columns

After looking at the number of unique values in each column, we can see that those columns have at least 3 unique values present except the ‘y’ column which is the target column. So, we will drop the ‘y’ at the moment and use the One Hot Encoding for all the other columns. This coding will introduce new columns and we will drop the original columns. It will transform the dataset of categorical shape from (41188 x 11) into (41188 x 54). Finally, we have to use Label Encoding to encode the ‘y’ column as 0 and 1.

df_target_bef['y'] = df_target_bef['y'].astype('category').cat.codes

Now we merge the two data frames numerical and categorical to get the dataset ready for further analysis. The shape of the dataset now is (41188 x 62)

Additional: Fixing the class imbalance using SMOTE

As I mentioned earlier, there exists a class imbalance in the dataset for the target column. We can fix the biasness using Synthetic Minority Over-sampling Technique (SMOTE).

Visible class imbalance in ‘y’

What we are going to do is, oversample from the minority class to fill the class imbalance gap. After balancing the classes the shape of the dataset is as follows.

Now we can move on with the dataset knowing that the class imbalance is fixed and will provide better classification from our network.

Split the dataset for training and testing

Since the initial transformations are done with the dataset, we can split the dataset into training and testing sets. This is done at this point to avoid any data leaks that might occur after the scaling of the features.

x_train, x_test, y_train, y_test = train_test_split(df_features, df_target, test_size = 0.2, random_state = 101)
x_train=x_train.reset_index(drop=True)
x_test=x_test.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)

Then the standardization will be carried out for both testing and training data features with numerical values. One hot encoded values will not be scaled here. Using the standard scalar function for numerical columns existing in the dataset.

scaler = StandardScaler()
scaler.fit(x_train_std)
x_train_scaled = scaler.transform(x_train_std)
x_test_scaled = scaler.transform(x_test_std)

Now we can move forward with the testing dataset to identify correlations.

Identify correlations in the training data

I will consider the numerical variables with the target y to evaluate the correlations.

Correlation matrix

We can see that there are strong correlations between few features (0.96, 0.87, -0.73). I will not remove dimensions from these observations at the moment. Rather moving into Principal Component Analysis to get the principal components to do the dimensionality reduction.

Principal Component Analysis (PCA) to reduce the feature count

PCA analysis provided the following dimensions,

PCA output

From this I will take 20 components which cover approximately 89.9% for the training of the model.

pca = PCA(n_components=20)
pca.fit(x_train)
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)

Since the PCA transformation is done, we can now move on to neural network training.

Fit the MLP and fine-tune the network

We are using the Multi-Layer Perception model for the training. After few trial and error sessions, the following values are decided to be used for the training. Two hidden layers with 10 and 3 neurons each. The activation function is decided to be the Logistic function. Learning rate initialization is set to 0.01 where the tolerance is set to 0.0001 with early stopping is true.

mlp = MLPClassifier(hidden_layer_sizes=(10,3,),
max_iter=500,
activation = 'logistic',
solver='adam',
verbose=True,
early_stopping=True,
validation_fraction=0.2,
tol=0.0001,
learning_rate_init=0.01,
random_state=466
)

Model evaluation

Following is the confusion matrix after the training is done.

Confusion matrix

There is a considerable number of false negatives that occurred. This might be due to the fixed class imbalance in the dataset. Overall these false negative and false positive values are considerably smaller than the other models I have trained.

The model gave a score of 78.581% but the recall score is 69.254% which is troublesome at a glance. With the dataset, it is fair in my opinion. The training error of the model is 0.455 and the MSE (Mean Squared Error) is 0.2114.

--

--