Fabian Eckert
8 min readFeb 3, 2021

--

This blog post is part of the Udacity Data Scientists Nanodegree Program. The analysis with the required code is posted in here.

Segmentation of Customers for Arvato Bertelsmann

Source: Pixabay

Introduction

This project contains real world data from Bertelsmann Arvato Analytics and aims to find new customers for the company using extensive data from a general population. It was edited as a part of Udacity’s course ‘Data Scientist Nanodegree’ and is its capstone project.

The project consists of three parts:

  1. All data received was analyzed. In addition, a cleaning of the data took place.
  2. We did a customer segmentation based on unsupervised learning techniques. With our data we were able to describe the relationship between the general population and the company’s customers. For the segmentation a principal component analysis for reducing the data dimension and the average within-cluster distance was used for getting the number of clusters for combining groups of customers and general population.
  3. In the last part we built a prediciton model which is based on a supervised learning approach. With train and test data we were able to predict which people could become future customers. For this we tuned the parameter of different classifier using roc-auc. Finally we upload our result to kaggle to get a score.

Data

Given Files:

  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-oder company, 891 211 persons (rows) x 366 features (columns).
  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany, 191 652 persons (rows) x 369 features (columns).
  • Udacity_MAILOUT_052018_TRAIN.csv, 42962 persons (rows) x 364 features (columns) and Udacity_MAILOUT_052018_TEST.csv, 42833 persons (rows) x 364 features (columns): Demographics data for people who were be picked of mail campaign.

Data Analysis

In order to be able to clean up the data later, it is important to firstly understand and look at it. For this purpose, both the general population data and the customer data were examined.
First, we want to examine how many NAN values exist in the individual columns. Therefore, we sort the columns by the most missing values. We look at the general population data to start with. We can see that six columns have more than 40% missing values.

Missing Values in columns for general population data

This graph looks similar for the customer data.

Missing Values in columns for general customer data

A histogram helps repicting the proportions. Here you can see the frequency distribution of missing values.

Frequency distribution of missing values for general population data
Frequency distribution of missing values for general customers data

The same can be determined for the rows so that rows with many missing values can be deleted later.

General Population Missing Values
Customers Missing Values

In addition, it was investigated whether data with different types in one column exist. The columns must be brought to a uniform type. Furthermore, it can be seen in the data that some columns with categorical variables exist.

Data Cleaning

A simple function for cleaning the data has been written so that this function can be used again.
In this function we did mulitple steps:

  • drop columns where more than 40 percent of values are missing
  • drop rows where more than 25 values are missing
  • drop three columns that exist in customer data but not in general population data
  • drop columns whit high relation
  • clean columns with different types
  • drop columns with type object that have too many different items
  • convert categorical variable into dummy variables
  • imputing the nan value with mode
  • normalize

As a result, the population dataset was reduced from 364 to 282 columns and the rows were reduced from 891211 to 750582.
Accordingly, the columns of the customer dataset were reduced from 367 to 309 and the rows from 191652 to 135043.

Customer Segmentation

Principal component analysis (PCA)

With the help of PCA from the Scikit-learn package we were able to reduce the dimension of data. With the following plot and with the help of explained variance we were able to reduce components but keep the information with high variance.

Here we can see a flattened curve at 200 components, therefore we will keep them with a high variance and use the first 200 components for the following steps.

Create Cluster

For clustering we used k-means from the Scikit-learn library. We used the elbow method and enter a maximum cluster count of 20 as a boundary to generate a diagram that represents the within-cluster distance over the cluster count.

This diagram shows that the curve flattens out from cluster number 11 onwards. So we used 11 clusters for the segmentation.

The cluster was fitted on the general population data to build the model. After that, it was used for the prediction with the population data and with the customer data.
Comparing both clusters, we get this result:

We learn that cluster 8 of customers is overrepresented. Cluster 1 and 3 are underrepresented. So the difference between customers and general population is largest in clusters 8, and lowest in clusters 1.

Metrics

For unsupervised learning it is hard do define unambiguous metrics. In this part when building clusters it is important that items in clusters differ. In this case it means the difference between clusters of the general population and the customers. Here it is to determine the over- and underrepresented clusters.

Supervised Learning

In this part, an approach based on supervised learning techniques is done. This model helps to predict which people should be selected for the marketing campaign.
For this purpose, a data set with training data and a data set with test data were provided. In the train dataset which is called “MAILOUT_TRAIN” exist among 43000 individuals but only 532 people responded. So here is a imbalance issue. That lead to ROC AUC socring. The cleaning function was used for the training data set once more. This was then split into a data set for training and in one for testing. For the model, inside check was tried for different classifiers. In this case we built a pipeline where we tested a Random Forest Classifier, a Gradient Boosting Regressor and Logistic Regression. First of all Logistic Regression provides the worst results. So only the first two remain. For the parameters for Random Forest we tested estimators in range of 10 to 100 and maxium of features from 1 to 3. The parameter of the Gradient Boost Regressor which were tested are the number of estimators from 50 to 200 and a minimum samples split form 2 to 4.

After long time of fine tuning the the best roc_auc score is 0.76093. This was achieved by using the classifier GradientBoostingRegressor with the min_samples_split=4 and n_estimators=50.

Metrics

Here we have a class imbalance issue because the total number of class with positive responses is much different form the class with no repsones. ROC-AUC is sensitive to class imblance so we use this instead of e.g. accuracy score.

Refinement

Gradient Boosting and Random Forest models are tuned by hyperparameter optimizations. At the beginning also Logistic regression which different penalty. But this delivers a more worse score of 0.624. So I decided to choose Gradient Boosting and Random Forest models for optimization. After testing different paramter, I came to the parameters mentioned above.

Kaggle Competition

The last part of this project was uploading the submission to get a score for the model. The built model achieved a score of 0.79006.

Model Evaluation and Validation

Doing k-fold cross validation with 5 splits with the tunend classifier with its hyperparamter we can see that validation performance is stable and doesn’t fluctuate much, this allows the assumption that the model is robust against small perturbations in the training data. The mean score is 0.759 and over all k we have 0.748, 0.757, 0.750, 0.736 and 0.807.

Justification

Using GridSearchCV tuning hyperparamter does took a long time if your hardware is not desinged for this tasks. Because of this not many Classifier with a lot of hyperparamter were tuned in one pass. So I did tested different classifiers with different hyperparamter that I can say in the end it is the best result that I reached. Also the Kaggle-Score is in relation to other score a good result.

Conclusion

For this project, it was first necessary to analyze the data in order to be able to clean it afterwards. This was a very time-consuming process as the method of the PCA failed again and again due to some data gaps which led to an error. In addition, the amount of data was very large, so the computation time took an extremely long time. Cleaning data did take a large part to understand and find all errors and missing values. Also it was hard to figure out which methods made the most sense to be able to work with the data afterwards.
After cleaning the data, a customer segmentation was performed for which a k-mean model was trained.
With this model, clusters of population data and customer data could be created to compare them to each other. In order to be able to deal better with the large data set, PCA could help to reduce the dimensions. In this case it is very subjective to decide how much data to keep based on the variance.
The same applies to the choice of the number of clusters. In the end, we were able to see the people from general population dataset who could get part of the mail-order company’s main customer base.

For the supervised based learning model a score close to 0.8 could be achieved with the selected model based on the train and test data.
In order to further improve the result, more models with different fine-tuning could be selected. But this is always a question of time and computer hardware.

Finally I want to thank Arvato Bertelsmann and Udacity for the data and the interesting project. I also want to thank Kaggel for the possibility to create a score of the model.

--

--