Bank Marketing Dataset — SVM Classification

Avarjana Panditha
7 min readMay 16, 2023

--

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y). Summarized details on the dataset are as follow

  • Data Set Characteristics: Multivariate
  • Attribute Characteristics: Real
  • Number of Instances: 41188
  • Number of Attributes: 20
  • Missing Values: No

Source: UCI Machine Learning Repository (bank-additional-full.csv)

For the complete source code (Python Notebook): Source Code

More information about the dataset attributes

Input variables:
1. age (numeric)
2. job: type of job (categorical: ‘admin.’,’ blue-collar’,’ entrepreneur’,’ housemaid’,’ management’,’ retired’,’ self-employed’,’ services’,’ student’,’ technician’,’ unemployed’,’ unknown’)
3. marital: marital status (categorical: ‘divorced’,’ married’,’ single’,’ unknown’; note: ‘divorced’ means divorced or widowed)
4. education (categorical: ‘basic.4y’,’ basic.6y’,’ basic.9y’,’high. school’,’ illiterate’,’ professional.course’,’ university.degree’,’ unknown’)
5. default: has credit in default? (categorical: ‘no’,’ yes’,’ unknown’)
6. housing: has a housing loan? (categorical: ‘no’,’ yes’,’ unknown’)
7. loan: has a personal loan? (categorical: ‘no’,’ yes’,’ unknown’)
# related with the last contact of the current campaign:
8. contact: contact communication type (categorical: ‘cellular’,’ telephone’)
9. month: last contact month of the year (categorical: ‘Jan’, ‘Feb’, ‘mar’, …, ‘Nov’, ‘Dec’)
10. day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes the last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’ nonexistent’,’ success’)

Social and economic context attributes
16. emp.var.rate: employment variation rate — quarterly indicator (numeric)
17. cons.price.idx: consumer price index — monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index — monthly indicator (numeric)
19. euribor3m: euribor 3-month rate — daily indicator (numeric)
20. nr.employed: number of employees — quarterly indicator (numeric)

Output variable (desired target)
21. y — has the client subscribed to a term deposit? (binary: ‘1’,’ 0’)

Following are the steps that I have taken to train the network.

  • Preprocessing numerical columns
  • Preprocessing categorical columns
  • Identify correlations in the training data
  • Principal Component Analysis (PCA) to reduce the feature count
  • Split the dataset for training and testing
  • Additional: Fixing the class imbalance using SMOTE
  • Fit the SVM fine-tune
  • Model evaluation

Let’s dive deeper into the above steps.

Preprocessing numerical columns

First off we have to identify what are the numerical data columns. Of course, we can use the above dataset information to filter the numerical columns but why bother. Just one line of code can do it for us.

numerical_df = df_original.select_dtypes(include=np.number)

This will provide us with a data frame that contains only the numerical data. When you run info() on the above data frame, it should look like this.

Numerical Data Columns

As mentioned in the dataset information, I will drop the “duration” column just to make sure. Running a describe() on the data frame to understand what’s going on with the data.

Numerical data summary with description()

Since the dataset is mostly cleaned, there seems to be nothing much to do here. No zero value columns or anything unwanted as it indicates in the table. Moving on to Boxplot, Histogram and Q-Q plots to identify outliers and skewness in the data.

Outliers

There seems to be a single outlier value for Cons.conf.idx column. Observe the following graphs,

Cons.conf.idx

This value will not be removed since it can be useful in the evaluation of the campaign due to the existing class imbalance of the ‘y’.

Skewness

After drawing the plots, it can be seen that Age, Campaign and Previous are right-skewed while the Nr.employed is left-skewed.

Age is Right Skewed
Campaign is Right Skewed
Nr.employed is Left Skewed

To reduce the skewness before scaling to standardize, we need to apply suitable transformations to the dataset. For that purpose we are using x² (for left skewed) and sqrt transformations (for right skewed). We are not using the logarithms due to 0 values and therefore we cannot use the exponential function as well. After applying the transformations, new diagrams are as follows,

Age after transformation
Campaign after transformation
Previous after transformation
Nr.employed after transformation

Preprocessing categorical columns

The same method is used to identify the categorical columns in the dataset.

cat_df = df_original.select_dtypes(exclude=np.number)

Following are the categorical columns identified in the whole dataset.

Categorical Data columns

After looking at the number of unique values in each column, we can see that those columns have at least 3 unique values present except the ‘y’ column which is the target column. So, we use Label Encoding for all the other columns.

cat_columns = cat_df.columns
for col in cat_columns:
cat_df[col] = cat_df[col].astype('category').cat.codes

For the column ‘pdays ’we will convert that into a categorical variable as Contacted or Not contacted previously.

numerical_df['pdays'] = np.where(a < 999, 0, 1).tolist()

Now we merge the two data frames numerical and categorical to get the dataset ready for further analysis. The shape of the dataset now is (41188 x 62)

Identify correlations in the training data

I will consider the numerical variables with the target y to evaluate the correlations.

Correlation matrix

We can see that there are strong correlations between few features (0.96, 0.87, -0.73). I will not remove dimensions from these observations at the moment. Rather moving into Principal Component Analysis to get the principal components to do the dimensionality reduction. From this we will drop two columns ‘euribor3m’ and ‘nr_employed’ because of the higher correlation with other features.

df_features = df_features.drop(['euribor3m','nr_employed'], axis=1)

Split the dataset for training and testing

Since the initial transformations are done with the dataset, we can split the dataset into training and testing sets. This is done at this point to avoid any data leaks that might occur after the scaling of the features.

x_train, x_test, y_train, y_test = train_test_split(df_features, df_target, test_size = 0.2, random_state = 101)
x_train=x_train.reset_index(drop=True)
x_test=x_test.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)

Additional: Fixing the class imbalance using SMOTE

As I mentioned earlier, there exists a class imbalance in the dataset for the target column. We can fix the biasness using Synthetic Minority Over-sampling Technique (SMOTE).

Visible class imbalance in ‘y’

What we are going to do is, oversample from the minority class to fill the class imbalance gap. After balancing the classes the shape of the dataset is as follows.

Now we can move on with the dataset knowing that the class imbalance is fixed and will provide better classification from our network.

Principal Component Analysis (PCA) to reduce the feature count

PCA analysis provided the following dimensions,

PCA output

From this, I will take 6 components that cover approximately 94.1% for the training of the model.

pca.explained_variance_ratio_[:6].sum()0.9411865921962747

Since the PCA transformation is done, we can now move on to training.

Fit the SVM and fine-tune

After trial and error method for values C and Gamma, I was able to get a reasonable level of accurate classifications with C=1.2 and Gamma=0.3. Here we are using the RBF kernel rather than the linear kernel.

svc = svm.SVC(kernel='rbf', C=1.2, gamma=0.3).fit(x_train_pca, y_train.values.ravel())

Model evaluation

Following is the confusion matrix after the training is done.

Confusion matrix

There is a considerable number of false negatives that occurred. This might be due to the fixed class imbalance in the dataset. Overall these false negative and false positive values are considerably smaller than the other models I have trained.

The model gave a score of 81.08% with the following values for precision and recall

Model Evaluation

Obtained values are fair enough such as precision of 31% with the imbalanced classes.

--

--