Predicting Customer Purchase to Improve Bank Marketing Effectiveness

Project Details


Fall 2017


Sandy Wu, Andy Hsu, Wei-Zhu Chen, Samantha Chien





A bank marketing dataset from UCI Machine Learning Repository was adopted for this project ( The dataset is about a Portuguese banking institution with records of direct marketing campaign phone calls, and the final outcomes indicating whether success campaigns are also included in a binary format (yes/no). A success campaign indicates the customer has finally subscribed a term deposit at the end of the campaign. Our business goal here is to identify the elements for a success campaign, from which we can improve marketing effectiveness by targeting the right customers; that is, we want to find what kind of campaign strategies/history combined with what kinds of customers will bring about high propensities for subscribing term deposits. Via building predictive models outputting the propensities of subscribing, we can therefore increase revenues and lower labour costs by having more efficient marketing strategies without harming customer relationship.
The dataset includes 41,188 campaign records, in which demographic, credit, current/previous campaign, and social/economic data are included (a total of 19 columns for each customer). Campaign data may include contact types, times of contacts, contact durations, etc. The data mining goal here is to predict the last column in the dataset which is the outcome of the current campaign (yes/no). It’s a predictive classification problem, and supervised models were built to solve this problem. Furthermore, propensities of subscribing output from our predictive models can be also utilized to do ranking of customers, which facilitates making most revenues out of performing direct campaigns on only a small portion of customers (this comes with a cost down on labouring).
Naïve Bayes, logistic regression, decision tree, and random forest were included as our predictive models, and naïve rule that output propensities of success by simply calculating the proportion of the success records in the training set was chosen as the benchmark. Naïve Bayes models can produce robust predictions if the predictors have small correlations, even with a simple architecture. Logistic regression models are simple but equipped with decent capabilities for predictions. Decision trees are easy to interpret and are capable of giving insights about the important features; random forest is an improved version of decision tree, which can produce really good and robust predictions. Models mentioned above were implemented and their performances were compared based on lift curves. Lift curves can be constructed by accumulating the recall of true success (identified by the final column) starting from the customers with high propensities of subscribing where the propensities were collected from our predictive models. If there is no clear distinction by observing lift curves, the model with the highest sensitivity (the ability to identify true subscribers) will be chosen as our final model (perhaps also with less implementation efforts needed). Random forest was the final model adapted. It produced the best result in terms of lift curve, and an accuracy of 78.96% was achieved with 0.64 in sensitivity. This model includes 75% of the true subscribers with only contacting the top 40% of the total customers in terms of subscribing propensity.
Last but not least, this dataset contains many categorical columns and most of them have really low correlation between each other. If more ordinal/numerical columns were included, we can have better results and can lead to more cost down on labour costs and much more precision for conducting direct campaigns, lowering the chances of harming precious customer relationships.

Application Area: