In nowadays, most of services’ business model is subscription business model. Customers subscribe company’s service and use it with monthly or annual payment. Among them, there also exists who do not satisfy with services for many reasons(including UI) and cancel the subscription. We usually call them ‘churn’, and many companies care for churn rate of their service.
I recently analyzed the realistic dataset related to subscription service in Udacity(Sparkify, the name of capstone project). In this story, I’ll show what I’ve analyzed and which insights could be acquired with this data. Related links are below.
Part 1. Glimpse Our Data
A. How Data Looks Like?
Let’s see our data first. Our data is in mini_sparkify_event_data.json
file. Since the whole data is too big to handle, we handled a fraction of it.
We could see which features are in data with printSchema()
method. There are some demographic columns like gender
and location
, but the most of columns are related to each user’s activity in our service. At this moment, the feature named page
seems to have major effect to whether user will stay or churn. Also our data is composed of 286,500 rows.
We saw that 8,346 values among 286,500 rows have empty userId
value. Maybe these are trajectories of guests, I simply filtered them. So we’ll use 278,154 rows.
I also looked in the feature named auth
, gender
, level
, method
and page
. As I thought before, page
has the most various information among those features. The values including the word ‘Cancel’ are directly related to churn, so these will be filtered in modeling procedure I guess.
There are 225 unique users in this dataset after filtering empty userId
value. Among them, the number of users in perspective of level
is like below
- paid: 165
- free: 195
- both paid and free: 135
Logs of each user looks like above. We could choose columns we want with .select()
method. The user whose userId
is 30 added song named ‘Passengers’ to playlist and keep listening musics. Maybe we could analyze the most beloved song with each users’ playtime of each songs with this kind of data.
B. Define churn
To predict churn with this data, we should newly define our target column. I named it churn
, and I defined it like below.
If a specific user was at page named ‘Cancellation Confirmation’, we’ll treat the user as churn. We made a list of them and add binary column to our data.
We could see that there are 52 churns among 225 unique users.
Part 2. Exploratory Data Analysis
To predict the target, we should do more analysis with our dataset. We usually call this process as EDA, Exploratory Data Analysis. We’ll try to find the features related to our target via EDA.
To do that, let’s start to group each features by churn
binary values. If there’s any difference between two groups, we could say that it affect to our target(I don’t write all features here).
Artist
I counted the number of artists and unique artists played by each users, and compare the average value of them. We could see that there are difference in two groups. Stayed users tend to experience various artists than churns.
Auth
From above, this feature has two values named ‘Logged in’ and ‘Cancelled’. Among these, the ‘Cancelled’ value has same meaning and same count compared to the ‘Cancellation Confirmation’ value in page
feature. So I decided to filter this feature.
firstName
,lastName
,gender
Since names of each users are too unique, it will have a little difference in distributions between stayed users and churns. So we’ll filter these two features too.
I calculated the average of each user’s average churn value. There also exists difference between two groups. Female users tend to settle down to our service than male users.
length
I calculated the average listening duration of each users. There also exists difference between two groups. Churns tend to spend little time for listening music in our service than stayed users.
level
Paid users tend to stay than free users. This is quite natural result!
location
I drawed the bar plot of average churn rate of each users via seaborn
. There exists locational difference in whether user will stay or leave our service.
page
Since there are many categorical values in this feature, I made pivot table of this first. It looks like below image.
With this pivot table, we saw each value’s average chrun rate and distribution. This contains too much information, I’ll leave link instead of images(it’s in section 3–9). The result is below.
- There are definitely some differences in page values between churns and non-churns.
- Both users have largest value in
Thumbs up
page. This is natural. - Churn users tend to have smaller values than non-churns in almost pages. This could be interpreted that churns tend to use our service timidly than others. Maybe complex UI or timid glitches are the reason.
- Churn users are likely to get more roll-advert.
- Churn users have smaller values in
Thumbs up
andThumbs down
page, which means they don't interact with our service that often.
song
Like Artist
, I counted the number of songs and unique songs played by each users, and compare the average value of them. It seems that churns tend to listen less songs than stayed users.
ts
This feature is about the time user spend in our service. We could notice that churns usually use less time than non-churns.
Part 3. Feature Engineering
Usually, after doing EDA, many people make more features to richen data and boost performance of models. Because machine could not get all potential meanings in data, we should make our data more affluent. This process is called ‘Feature Engineering’. We’ll do this task in this part.
A. From EDA
We’ll make additional features from the result of EDA. The list is below.
Artist
- Count of listened artists & distinct artists
gender
(encoded as label)length
- Total length of played songs, by second
level
- Last status of level (free or paid)
page
- Frequency of every values
song
- Count of played songs & distinct songs
sessionId
- The number of session each user played
B. Add More Features
I generated more features for modeling, based on my data wrangling and music application experiences. The list is below. Details are in here, section C-2.
- Average of the number of songs per each session
- Average duration of each session
- Average duration of each session INTERVAL by days
- Subscription length, calculated by registration date
- Ratio of same artists among whole artists user listened
C. Check Multicollinearity
If there’s a correlation between our features, it disturbs machine’s classifying process. So we should check correlations of each features(called multicollinearity). I used sns.heatmap
to see this briefly.
It seems that song_cnt
, song_cnt_uniq
, artist_cnt
, artist_cnt_uniq
, length_sum
, pageCount
looks highly correlated. Let’s see again with correlation threshold 0.95.
We could see that same features has filtered by threshold. I dropped those columns.
Part 4. Modeling
Now we’ll train model with our data and predict whether user will stay or not. At first, we’ll verify which model could return best score with our data. After choosing a model, we usually should set parameters via cross validation(CV). However, because of lack of computing power, I couldn’t do CV(wrote the code for this task in section D-3). So I chose best model and predicted churn with default parameter setting.
A. Fill null
Values
Before put data into models, we should impute null
values because some model can’t be calculated with missing values.
As we can see, there’s no null values in our dataset. If we have missing values, we could easily impute those via procedures below.
- Use
pyspark.ml.feature.StringIndexer
to encode categorical features into label - Use
pyspark.ml.feature.Imputer
to impute numerical features
B. Choose Model with Default Models
After we split the data to training & validation & test sets, we built default classifiers with LogisticRegression
, RandomForestClassifier
and GBTClassifier
in pyspark.ml.classification
and saw which model returns best output. Below is snapshot for this task.
We saw that there are 52 churns among 225 unique users, about 23%. In this case, minimum accuracy will exceed 70% and this metric becomes meaningless! So I decided to use F-1 score. This score is generally used in binary classification also.
The table above is about results from default models. We can see that LogisticRegression
's F1 score in test set is highest among three models. So we'll use this for our final output.
C. Hyperparameter Tuning
Because server memory couldn’t handle this CV task, so I only left code for this task and commented out. This could be seen in section D-3
D. Feature Importance
We finally predicted output with default LogisticRegression
classifier and got F1 score around 0.81. At this moment, we should see each feature’s influence to the target value. This called ‘Feature Importance’.
We could see that the features generated from page
affects quite large amount to whether user will stay or not. We could say that page
feature definitely could show each user’s characteristic.
We could easily imagine that if specific users are not satisfied with service, they’ll push Thumbs Down
button more often than others. They’ll tend to Downgrade
their subscription also.
Additionally, features like Add Friend
or freqLogout
show low importance because they are generally done by both stayed and left users. Count values of song
and Artist
also have low importance.
Part 5. Findings & Conclusion
We built a churn detecting model with small-medium sized event data. If we could predict whether user will quit our service or not, we could treat these users with various remedies before they quit. (ex: give them coupons or send them event push alarm, etc)
We found that values about user-activity from page
feature have enormous effect to users’ breakaway. So, to prevent churn, we should analyze user activity more frequently.
Because of lack of computing power, we couldn’t do cross validation for obtaining best model. If our computing power is much stronger than now, we could use more various models and various techniques(ensemble method like stacking, etc) to get best predictions.