Detecting breakaway before it happens

9 min readNov 15, 2021

In nowadays, most of services’ business model is subscription business model. Customers subscribe company’s service and use it with monthly or annual payment. Among them, there also exists who do not satisfy with services for many reasons(including UI) and cancel the subscription. We usually call them ‘churn’, and many companies care for churn rate of their service.

I recently analyzed the realistic dataset related to subscription service in Udacity(Sparkify, the name of capstone project). In this story, I’ll show what I’ve analyzed and which insights could be acquired with this data. Related links are below.

ETL and EDA / Feature Engineering and Modeling

Part 1. Glimpse Our Data

A. How Data Looks Like?

Let’s see our data first. Our data is in mini_sparkify_event_data.json file. Since the whole data is too big to handle, we handled a fraction of it.

See how data looks like via printSchema()

We could see which features are in data with printSchema() method. There are some demographic columns like gender and location, but the most of columns are related to each user’s activity in our service. At this moment, the feature named page seems to have major effect to whether user will stay or churn. Also our data is composed of 286,500 rows.

We saw that 8,346 values among 286,500 rows have empty userId value. Maybe these are trajectories of guests, I simply filtered them. So we’ll use 278,154 rows.

I also looked in the feature named auth, gender, level, method and page. As I thought before, page has the most various information among those features. The values including the word ‘Cancel’ are directly related to churn, so these will be filtered in modeling procedure I guess.

There are 225 unique users in this dataset after filtering empty userId value. Among them, the number of users in perspective of level is like below

paid: 165
free: 195
both paid and free: 135

Logs of each user looks like above. We could choose columns we want with .select() method. The user whose userId is 30 added song named ‘Passengers’ to playlist and keep listening musics. Maybe we could analyze the most beloved song with each users’ playtime of each songs with this kind of data.

B. Define `churn`

To predict churn with this data, we should newly define our target column. I named it churn, and I defined it like below.

If a specific user was at page named ‘Cancellation Confirmation’, we’ll treat the user as churn. We made a list of them and add binary column to our data.

We could see that there are 52 churns among 225 unique users.

Part 2. Exploratory Data Analysis

To predict the target, we should do more analysis with our dataset. We usually call this process as EDA, Exploratory Data Analysis. We’ll try to find the features related to our target via EDA.

To do that, let’s start to group each features by churn binary values. If there’s any difference between two groups, we could say that it affect to our target(I don’t write all features here).

Artist

I counted the number of artists and unique artists played by each users, and compare the average value of them. We could see that there are difference in two groups. Stayed users tend to experience various artists than churns.

Auth

From above, this feature has two values named ‘Logged in’ and ‘Cancelled’. Among these, the ‘Cancelled’ value has same meaning and same count compared to the ‘Cancellation Confirmation’ value in page feature. So I decided to filter this feature.

firstName, lastName, gender

Since names of each users are too unique, it will have a little difference in distributions between stayed users and churns. So we’ll filter these two features too.

I calculated the average of each user’s average churn value. There also exists difference between two groups. Female users tend to settle down to our service than male users.

length

I calculated the average listening duration of each users. There also exists difference between two groups. Churns tend to spend little time for listening music in our service than stayed users.

level

Paid users tend to stay than free users. This is quite natural result!

location

I drawed the bar plot of average churn rate of each users via seaborn. There exists locational difference in whether user will stay or leave our service.

page

Since there are many categorical values in this feature, I made pivot table of this first. It looks like below image.

With this pivot table, we saw each value’s average chrun rate and distribution. This contains too much information, I’ll leave link instead of images(it’s in section 3–9). The result is below.

There are definitely some differences in page values between churns and non-churns.
Both users have largest value in Thumbs up page. This is natural.
Churn users tend to have smaller values than non-churns in almost pages. This could be interpreted that churns tend to use our service timidly than others. Maybe complex UI or timid glitches are the reason.
Churn users are likely to get more roll-advert.
Churn users have smaller values in Thumbs up and Thumbs down page, which means they don't interact with our service that often.

song

Like Artist, I counted the number of songs and unique songs played by each users, and compare the average value of them. It seems that churns tend to listen less songs than stayed users.

ts

This feature is about the time user spend in our service. We could notice that churns usually use less time than non-churns.

Part 3. Feature Engineering

Usually, after doing EDA, many people make more features to richen data and boost performance of models. Because machine could not get all potential meanings in data, we should make our data more affluent. This process is called ‘Feature Engineering’. We’ll do this task in this part.

A. From EDA

We’ll make additional features from the result of EDA. The list is below.

Artist

- Count of listened artists & distinct artists

gender(encoded as label)
length

- Total length of played songs, by second

level

- Last status of level (free or paid)

page

- Frequency of every values

song

- Count of played songs & distinct songs

sessionId

- The number of session each user played

B. Add More Features

I generated more features for modeling, based on my data wrangling and music application experiences. The list is below. Details are in here, section C-2.

Average of the number of songs per each session
Average duration of each session
Average duration of each session INTERVAL by days
Subscription length, calculated by registration date
Ratio of same artists among whole artists user listened

C. Check Multicollinearity

If there’s a correlation between our features, it disturbs machine’s classifying process. So we should check correlations of each features(called multicollinearity). I used sns.heatmap to see this briefly.

It seems that song_cnt, song_cnt_uniq, artist_cnt, artist_cnt_uniq, length_sum, pageCount looks highly correlated. Let’s see again with correlation threshold 0.95.

We could see that same features has filtered by threshold. I dropped those columns.

Part 4. Modeling

Now we’ll train model with our data and predict whether user will stay or not. At first, we’ll verify which model could return best score with our data. After choosing a model, we usually should set parameters via cross validation(CV). However, because of lack of computing power, I couldn’t do CV(wrote the code for this task in section D-3). So I chose best model and predicted churn with default parameter setting.

A. Fill `null` Values

Before put data into models, we should impute null values because some model can’t be calculated with missing values.

As we can see, there’s no null values in our dataset. If we have missing values, we could easily impute those via procedures below.

Use pyspark.ml.feature.StringIndexer to encode categorical features into label
Use pyspark.ml.feature.Imputer to impute numerical features

B. Choose Model with Default Models

After we split the data to training & validation & test sets, we built default classifiers with LogisticRegression, RandomForestClassifier and GBTClassifier in pyspark.ml.classification and saw which model returns best output. Below is snapshot for this task.

We saw that there are 52 churns among 225 unique users, about 23%. In this case, minimum accuracy will exceed 70% and this metric becomes meaningless! So I decided to use F-1 score. This score is generally used in binary classification also.

The table above is about results from default models. We can see that LogisticRegression's F1 score in test set is highest among three models. So we'll use this for our final output.

C. Hyperparameter Tuning

Because server memory couldn’t handle this CV task, so I only left code for this task and commented out. This could be seen in section D-3

D. Feature Importance

We finally predicted output with default LogisticRegression classifier and got F1 score around 0.81. At this moment, we should see each feature’s influence to the target value. This called ‘Feature Importance’.

We could see that the features generated from page affects quite large amount to whether user will stay or not. We could say that page feature definitely could show each user’s characteristic.

We could easily imagine that if specific users are not satisfied with service, they’ll push Thumbs Down button more often than others. They’ll tend to Downgrade their subscription also.

Additionally, features like Add Friend or freqLogout show low importance because they are generally done by both stayed and left users. Count values of song and Artist also have low importance.

Part 5. Findings & Conclusion

We built a churn detecting model with small-medium sized event data. If we could predict whether user will quit our service or not, we could treat these users with various remedies before they quit. (ex: give them coupons or send them event push alarm, etc)

We found that values about user-activity from page feature have enormous effect to users’ breakaway. So, to prevent churn, we should analyze user activity more frequently.

Because of lack of computing power, we couldn’t do cross validation for obtaining best model. If our computing power is much stronger than now, we could use more various models and various techniques(ensemble method like stacking, etc) to get best predictions.

Detecting breakaway before it happens

Part 1. Glimpse Our Data

A. How Data Looks Like?

B. Define `churn`

Part 2. Exploratory Data Analysis

Part 3. Feature Engineering

A. From EDA

B. Add More Features

C. Check Multicollinearity

Part 4. Modeling

A. Fill `null` Values

B. Choose Model with Default Models

C. Hyperparameter Tuning

D. Feature Importance

Part 5. Findings & Conclusion

Written by Baescott

No responses yet

Detecting breakaway before it happens

Part 1. Glimpse Our Data

A. How Data Looks Like?

B. Define churn

Part 2. Exploratory Data Analysis

Part 3. Feature Engineering

A. From EDA

B. Add More Features

C. Check Multicollinearity

Part 4. Modeling

A. Fill null Values

B. Choose Model with Default Models

C. Hyperparameter Tuning

D. Feature Importance

Part 5. Findings & Conclusion

Written by Baescott

No responses yet

B. Define `churn`

A. Fill `null` Values