Machine Learning for Alteryx users: predicting popularity of NYC apartments
Do you know Kaggle and their awesome competitions?
If the answer is yes you probably understand what I am talking about, if the answer is no, you should check it out and find tons of dataset about many topics.
Few months ago, coach Andy asked me to teach predictive analytics at DS5. Usually because people are new with the topic I cover clustering and linear regression, but this time I also wanted to come up with something new and exciting.
I picked this one: Two sigma connect, rental listing inquiries.
In this competition, you will predict how popular an apartment rental listing is based on some variables like number of bedrooms/bathrooms, price, etc.
The data comes from renthop.com, an apartment listing website. These apartments are in New York City.
The target variable, ‘interest_level’, is defined by the number of inquiries a listing has in the duration that the listing was live on the site. In other words, an apartment can be classified as low, medium or large interest.
- bathrooms: number of bathrooms
- bedrooms: number of bathrooms
- features: a list of features about this apartment
- price: in USD
- interest_level: this is the target variable. It has 3 categories: ‘high’, ‘medium’, ‘low’
- other variables
What we are going to predict here is a probability to fall in one of these three variables, in other words it’s a classification problem.
You can find all theory about supervised learning and classification in machine learning in my previous blog where I predicted the probability to survive at the Titanic disaster.
Just for a quick recap, in supervised learning, we have a data set that already knows what the correct output should look like, having an idea that there is a relationship between the input and the output. The goal of this method is to predict a variable of interest.
In simple English, we have two datasets:
In the first one for each row (listing) we have a variable of interest displayed.
We feed the algorithm/model with this information so that it can learn for the predictors (bathroom/bedroom/price…) and as the model is exposed with the new dataset (without the variable of interest, we want to predict it in this case), it can independently adapt.
Models learn from previous computations to produce reliable, repeatable decisions and results.
We are classifying a NYC flat popularity based on number of bathroom, bedroom, price…. Also, I have a variable that list all features bellowing to that flat: doorman, elevator, fitness centre, cats allowed…. If I just count them I can get a new variable (count of feature).
But as I love travelling and I like to go in Airbnb’s accommodations, what’s one of the most important thing in a city?
POSITION is a key variable for me.
If I have latitude and longitude, I can use a good central point of NY (Time Square is central enough?) and then calculate the distance from that point, so I can have a new variable like distance in miles.
This is how the data preparation workflow looks like.
You can download the data preparation workbook here.
If you not familiar with spatial analytics in Alteryx you can watch some video from Information Lab at this channel or read some of the blogs available on the website.
Now let’s run our models!
You can find the workflow here.
First, a sample must be created with the sample tool as you want to train the predictive models on a subset of data (estimation sample) and validate the models on another subset (validation sample). Anything not categorized in these two gets to the Holdout sample.
Just to remind what we are doing here: in supervised learning, we have a data set called training set with the low/medium/high responses that constitute the variable of interest and the other variables that are the predictors or independent variables that explain the dependent one. From this dataset, I want to learn how to predict the low/medium/high variable based on some characteristics (predictors). So, in simple English, if a flat is quite central (short miles’ distance from time square), reasonable price, 2 bedrooms it might be of high interests for families that travel to NY with children, or another flat in Brooklyn with a doorman could be awesome for young people that take cheap flights and arrive in the evening and still can get the key from the doorman.
What I want to say here that a mix of predictors explain the variable of interest and we need to teach that to the algorithm to predict a good percentage in the new dataset. Again, we will not predict low/medium/high but a percentage for all three, which the total will add to 100%.
I also want to highlight that this is a classification problem with a not binary variable (binary means 0 or 1, YES or NO) as we are spreading the % among three variables.
What models are we comparing?
Decision tree, Boosted model and Forest model: class of machine learning methods that predict a target variable using one or more variables that are expected to have an influence on the target variable.
How to test goodness of prediction?
We attached the predicted values at the stream (validation sample from the sample tool) with the score tool:
We will have something like this:
For each listing, we have the actual variable of interest (first row, listingID 42, interest level = Low) and a % for each of the possibilities, if you add that numbers will be = 1.
Which is the best model? Well as we already have the answer (we know that listingID 42 has a low interest) so I first create a new column with the following calculation:
if [Score_low] < [Score_medium] then
(If [Score_medium] < [Score_high] then ‘high’ else ‘medium’ endif) else
(if [Score_low] < [Score_high] then ‘high’ else ‘low’ endif)
and a second calculation:
if [interestlevel] = [label] then 1 else 0 endif
On the first calculation, I create a new column which will be labeled as low/medium/high based on the highest score (in listingID 42, the % of low is the highest, therefore the new column is equal to ‘low’).
The second calculation is 1 if the actual interest level is the same as the predicted one (the level with the highest score) else 0.
Now I can use a summarize tool and sum all the 1, the model with the highest number is the best one.
The forest model is the best, it’s able to catch 7055 correct answers.
Now I can use this model with the new data and predict my variable of interest.
Let’s have a look at the results:
Now if you are a map lover, have fun in Tableau!
You can find the workbook here.