What would you do to keep your business alive in the era of Big Data?

  1. Data cleaning
  2. Exploration and Data Analysis
  3. Feature engineering
  4. Modeling
  5. Model Evaluation and Validation
  6. Justifications
  7. Improvements
  8. Summary

Data Cleaning

  • the feature type: It helps to differentiate numerical data from the rest. Depending on what we have, we get an idea of our limits but also, we can draw a strategy for data wrangling. For instance, string features cannot be used by Machine learning algorithms so they need extra effort to transform them into numerical ones.
  • the cardinality: It shows the order of precedence between different features: For instance, there are more sessionId than userId so we can assume that one user can open many sessions. There are fewer locations than users so the location can cover a large area holding many users...
    Also, we can spot string features that can be encoded when their cardinality is relatively low. Encoding is the first step towards the numerical transformation of data.
  • the number of missing values: It indicates what features are missing data and in what proportions. Too much missing data can make the feature useless. Also, when features are missing values synchronously and consistently, we can understand the reason why.
  • The largest group affects the “artist” name, the “song” name, and the “length” of the song. All these features turn around the music being played. It happens because users leave the audio player for visiting some other pages like “Settings” or “Help”. In this case, missing data are legit.
  • The second group affects the “userId”, the “gender”, the “location”, the “lastName”, the “firstName”, and the “userAgent”. This time, all these features gravitate around users´ information. It happens when users visit pages like “login” or “logged out”. As they come in and out, their activities are still tracked but their identity is not. We do not know who they are thus we cannot deliver any personalized prediction for them. In this case, missing data are out of scope. We get rid of them.

Exploration and Data Analysis

  • genders: male/females. The bar plot below outlines the total amount of users for each gender and the proportion of positive users among them. The churn rate is 23% among females and 22% among male users. There are 44% females on the platform, against 66% males.
  • level of payments: paid/free. The bar plot below outlines the total amount of users for each level of payment and the proportion of positive users among them. The churn rate is 19% among users streaming for free and 23% among the ones who paid at least once. 28% of users never paid. The other 72% paid at least once.
  1. the week number when he got active on the platform for the very first time. On a time scale, it corresponds to the week number of the minimum recorded timestamp. It varies between week 44 and week 48 (for the year 2018).
  2. the week number week when he left the platform. On a time scale, it corresponds to the week number of the maximum recorded timestamp. It varies between week 44 and week 48 (for the year 2018).
  3. the weekday number when he churned. On a time scale, it corresponds to the weekday number of the maximum recorded timestamp. It varies between 1 (for Monday) and 7 (for Sunday).

Feature engineering


  • Gradient Boosted Trees (GBT): classification and regression method using ensembles of weak decision trees.
    The best parameters are: maxIter=25, maxDepth=10.
  • Naive Bayes (NB): classification based on Bayes’ theorem with strong (naive) independence assumptions between every pair of features.
    The best parameters are: smoothing=0.5, modelType=gaussian.
  • Linear Support Vector Machine (LSVC): It constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space for separating negative from positive data.
    The best parameters are: maxIter=15, tol=0,01, regParam=0,1
  • Multi-Layer Perceptron (MLP): classifier based on the feedforward artificial neural network. MLP consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network.
    The best parameters are:layers=[81,40,20,2],maxIter=50,solver=l-bgfs
  • Accuracy is defined as simply the number of correctly categorized examples divided by the total number of examples. Accuracy can be useful but does not take into account the subtleties of class imbalances, or differing costs of false negatives and false positives.
  • Precision is the fraction of true positive examples among the examples that the model classified as positive. In other words, the number of true positives divided by the number of false positives plus true positives.
  • Recall is the fraction of examples classified as positive, among the total number of positive examples. In other words, the number of true positives divided by the number of true positives plus false negatives.
  • F-score is a way of combining the precision and recall of the model. It is defined as the harmonic mean of the model’s precision and recall.
  • AUC is like the probability that the model ranks a random positive example more highly than a random negative example. It is quite attractive when dealing with imbalanced data.
  • In the case of targeted advertisement, recall ensures that we reach all our targets but it does not care if we get at others. This is the role of precision. The recall is more important because advertisement is not so costly and cannot harm happy users unless they get annoyed.
  • In the case of discounts, precision is more important. It is not in the company´s interest to grant a discount to anyone so we should be sure that the users being offered the options are the right ones. We risk losing money otherwise. Precision is okay with missing positive users as long as when we make an offer it points to the right users.

Model Evaluation and Validation

  • There should be 0s on the right diagonal [↗] as it represents false-positive (FP) and false-negative (FN) predictions. Ideally, they should be completely dark.
  • The cells on the left diagonal [↘] represent the true-negative (TP) and true-positive (TP). We want them to be brighter. However, the top left corner will always be much brighter because of the imbalanced nature of our data.


  • What is the impact of the OS on the churn rate?
  • What is the impact of the user´s location on the churn rate?
  • What is the Sparkify activity over time?
  • What is the impact of gender or the level of payment on the churn rate?
  • When users churn their subscriptions?


  • The data exploration part is not exhaustive, there are a lot more questions to be answered. The more we ask questions, the more we find potential features for Machine learning refinement.
  • We could imagine building a matrix counting pairwise interactions between users and artists. This could be the first step for building a recommendation engine based on collaborative filtering. We could take the Singular Value Decomposition (SVD) of that matrix and retrieve the latent vectors. Assuming that latent vectors represent music types, we could calculate music type scores for artists and use them as new features. Depending on the preferences of positive users, we can identify some gaps. It would mean that we need more content of that type.
  • We have decided to do predictions real-time predictions but we could have also use aggregated information from the past like average session duration or average duration between sessions…
  • The company could also decide to review its data collection model in function of its goals. There are still raw data out there that could be collected. The age of the users could be a great asset for instance.
  • We could consider using strategies for handling imbalanced datasets like bootstrapping or stratified k-fold cross-validation.
  • Our model detects suspicious events with the platform. It does not tell directly if a user is at risk yet. We could think of a higher logic or a new model deciding when suspicious events have to be considered seriously for triggering action.


  • We have begun this journey with data cleaning. We could spot a large chunk of data where users were going missing.
  • Data explanation was certainly the most challenging part. It was difficult to grasp dependencies between data as they keep changing over time. Also, the period of observation is too short for characterizing positive users as their number keeps decreasing over time.
  • We used our previous investigation on data for selecting interesting raw features. We designed a standalone featurization pipeline for transforming those raw data into Machine Learning material.
  • We built 4 Machine learning models on top of the featurization pipeline picking the best one. After discussion on performance metrics and evaluation outcomes, Gradient Boosted Trees´ model wins the prize. It allows identifying users that will churn, hence gives the possibility to act beforehand and to avoid losing customers.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store