# Taxi demand prediction in New York City

**Objective**:

How many pickups as accurate as possible expect in the respective region for the given time period(10-min interval)(predict demand in a given region) where the end-user is Taxi-driver.

**Constrains:**

> Latency-we expect ML model predict in adjoining region in few seconds

> Interpretability: Not very required

> Relative percentage error between predicted and actual

## Data Information

Ge the data from : http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (2016 data) The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC).Every row corresponding one trips.

**Information on taxis:**

Yellow Taxi: Yellow Medallion Taxicabs

These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.

# ML Problem Formulation

**Time-series forecasting and Regression**

*To find a number of pickups, given location coordinates(latitude and longitude) and time, in the query region and surrounding regions.*- To solve the above we would be using data collected in Jan — Mar 2015 to predict the pickups in Jan — Mar 2016.

# Performance metrics

- Mean Absolute percentage error = ((actual-predicted)/actual)*100
- Mean Squared error.

# 2.Data Cleaning:

In this section, we will be doing univariate analysis and removing outlier/illegitimate values which may be caused due to some error

## 2.1 Pickup Latitude and Pickup Longitude

It is inferred from the source https://www.flickr.com/places/info/2459115 that New York is bounded by the location coordinates(lat, long) — (40.5774, -74.15) & (40.9176,-73.7004)

so hence any coordinates not within these coordinates are not considered by us as we are only concerned **with pickups which originate within New York**. so hence any coordinates not within these coordinates are not considered by us as we are only concerned with dropoffs which are within New York.

**Observation:-** As you can see above that there are some points just outside the boundary but there are a few that are in either South america, Mexico or Canada

## 2.2 Pickup Latitude and Pickup Longitude

Plotting **dropoff cordinates** which are outside the bounding box of New-York

**Observation:-** The observations here are similar to those obtained while analysing pickup latitude and longitude

# 3. Trip Durations:

According to NYC Taxi & Limousine Commision Regulations **the maximum allowed trip duration in a 24 hour interval is 12 hours. it is found using subtracting dropoff time with pick time.**

The timestamps are converted to unix so as to get duration(trip-time) & speed also pickup-times in unix are used while binning in out data we have time in the formate “YYYY-MM-DD HH:MM:SS” we convert this sting to python time formate and then into unix time stamp.

## 3.1 **Box plot of trip time**

**Observation:** the skewed box plot shows us the presence of outliers where value of outlier is very high. on other hand, 25,50,75 percentile are almost negligible so Box plot graph not interpratable so plot pdf

**After removing outlier** where trip time not lie in the range of 1–12 hour after removing data based on our analysis and TLC regulations. Ploting again trip time after removal of outlier.

`frame_with_durations_modified=frame_with_durations[(frame_with_durations.trip_times>1) & (frame_with_durations.trip_times<720)]#NY Authority rule the max travel is 12 hour=720min`

**Observation**: Plot still not interpratable because 25,50,75 percentile are very low. Now plot PDF

## 3.2 PDF plot of trip time

pdf of trip-times after removing the outliers.

**Observation:**

its right skewed so tale log which is normally distributed between -3 t0 3.there is some tail on right side.

lets plot log-value of time duration.

it look like bell shape but there is short tail on left in above plot so lets plot Q-Q plot to check distribution.

Q-Q plot for checking if trip-times is log-normal

**Observation**: Log of trip time value lie between -3 to 3 follow Gaussian distribution so let clean data outside of this range.

# 4. Speed

Speed calculated b dividing trip distance with trip time. let plot box plot of average speed.

still plot is not interpretable so go for percentile value to find outlier.

So remove further outliers based on the 99.9th percentile value. speed should **lie between 0 to 45.31**. Average speed of cleaned speed is 12.45.

**-> The avg speed in Newyork speed is 12.45miles/hr, so a cab driver can travel 2 miles per 10 min on avg. So within 10 min, he can travel only 2 miles according to the above calculation. So divide region based on the above observation.**

# 5. Trip Distance

Let calculate the percentile value of Trip distance.

Here we can say that 99.9 percentile have value 22.57 and after that value is very high so it is an outlier. So consider value between 0 to 22.57 mile.

Lets plot box-plot after removing outlier.

The above plot is very interpretable. we can easily find 25,50,75 percentile value.

# 6. Total Fare

Let calculate the percentile value of fare data.

**Observation:-** As even the 99.9th percentile value doesnt look like an outlier,as there is not much difference between the 99.8th percentile and 99.9th percentile, we move on to do graphical analyis

below plot shows us the fare values(sorted) to find a sharp increase to remove those values as outliers. plot the fare amount excluding last two values in sorted data.

**Observation:** A very sharp increase in fare values can be seen

Now plotting last three total fare values, and we can observe there is share increase in the values

we can observe there is share increase in the values

we plot last 50 values excluding last two value

now looking at values, not including the last two points we again find a drastic increase at around 1000 fare value.

## Remove all outliers/erronous points from the above analysis

So total around 3 percent data removed after above cleaning. 97 percent of data still left for further analysis.

# B. Data-preparation

## 1. Clustering/Segmentation

we will apply clustering k-mean based on pickup size. and divide latitude and longitude of area. so the problem is how to find k value in k-mean.

- I want intercluster distance less than 2 miles but region cant be less than 0.5 miles because cluster will become too small.(distance measure from center of cluster)

we finally choose k=40 because 9 cluster is within 2 miles and minimum intercluster distance is 0.5 miles. The main objective was to find an optimal min. distance(Which roughly estimates to the radius of a cluster) between the clusters which we got was 40. if check for the 50 clusters you can observe that there are two clusters with only 0.3 miles apart from each other

# so we choose 40 clusters for solving the further problem.

`# Getting 40 clusters using the kmeans `

kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000,random_state=0).fit(coords)

frame_with_durations_outliers_removed[‘pickup_cluster’] = kmeans.predict(frame_with_durations_outliers_removed[[‘pickup_latitude’, ‘pickup_longitude’]])

## 2. Plotting the clusters:

## 3. Time binning

taking time regular format and convert into UNIX time format(Refer:https://www.unixtimestamp.com)/ and divide it by 600 to get 10min(600 sec) bin interval.

# C. Smoothing

when plotting between number of pickup and time bins, there is a chance that some value may hit zero for particular 10 min bins. that will create problem in ratio feature(divide by zero). smoothing is nothing but to avoid zero. (ex. 100,0,50 pickup can be converted as 50,50,50)

Fills a value of zero for every bin where no pickup data is present. the count_values: number picks that are happened in each region for each 10min travel their won't be any value if there are no pickups.

smoothing is done only for 2015 Jan data

# D. Time series and Fourier Transforms

plot amplitude vs frequency. here we will find different amplitude at a particular time (ex: 0 as DC component, 1/144 daily, 1/288 morning & evening)

# E. Modelling: Baseline Models

Now we get into modeling in order to forecast the pickup densities for the months of Jan, Feb, and March of 2016 for which we are using multiple models with two variations.

- Using
**Ratios**of the 2016 data to the 2015 data i.e 𝑅𝑡=𝑃2016𝑡/𝑃2015𝑡 - Using
**Previously known values**of the 2016 data itself to predict the future values

## Simple Moving Averages

**For Ratio feature**: The First Model used is the Moving Averages Model which uses the previous n values in order to predict the next value. Using Ratio Values 𝑅𝑡=(𝑅𝑡−1+𝑅𝑡−2+𝑅𝑡−3….𝑅𝑡−𝑛)/𝑛

here Rt-1 =(p2016(t-1)/p2015(t-1) similarly calculate average of n previous time steps. then take the average of calculated ratio i.e Rt.

here n is the hyperparameter which can be calculated using tuning method. For the above, the Hyperparameter is the window-size (n) which is tuned manually and it is found that the window-size of 3 is optimal for getting the best results using simple Moving Averages using previous Ratio values, therefore, we get 𝑅𝑡=(𝑅𝑡−1+𝑅𝑡−2+𝑅𝑡−3)/3.

Accuracy assessment was done using MAPE

**For previously known value:**

Next, we use the Moving averages of the 2016 values itself to predict the future value using 𝑃𝑡=(𝑃𝑡−1+𝑃𝑡−2+𝑃𝑡−3….𝑃𝑡−𝑛)/𝑛. In this case, take average value of n previously know value to predict next known value. here n is hyperparameter.

For the above, the Hyperparameter is the window-size (n) which is tuned manually and it is found that the window-size of 1 is optimal for getting the best results using Moving Averages using previous 2016 values, therefore, we get 𝑃𝑡=𝑃𝑡−1.

## Weighted Moving Averages

The Moving Average Model used gave equal importance to all the values in the window used, but we know intuitively that the future is more likely to be similar to the latest values and less similar to the older values. Weighted Averages converts this analogy into a mathematical relationship giving the highest weight while computing the averages to the latest previous value and decreasing weights to the subsequent older ones.

Weighted Moving Averages using Ratio Values — 𝑅𝑡=(𝑁∗𝑅𝑡−1+(𝑁−1)∗𝑅𝑡−2+(𝑁−2)∗𝑅𝑡−3….1∗𝑅𝑡−𝑛)/(𝑁∗(𝑁+1)/2)

**for ratio feature **For the above the Hyperparameter is the window-size (n) which is tuned manually and it is found that the window-size of 5 is optimal for getting the best results using Weighted Moving Averages using previous Ratio values, therefore, we get 𝑅𝑡=(5∗𝑅𝑡−1+4∗𝑅𝑡−2+3∗𝑅𝑡−3+2∗𝑅𝑡−4+𝑅𝑡−5)/15

**for previously known value** Weighted Moving Averages using Previous 2016 Value 𝑃𝑡=(𝑁∗𝑃𝑡−1+(𝑁−1)∗𝑃𝑡−2+𝑁−2)∗𝑃𝑡−3….1∗𝑃𝑡−𝑛)/(𝑁∗(𝑁+1)/2)

For the above, the Hyperparameter is the window-size (n) which is tuned manually and it is found that the window-size of 2 is optimal for getting the best results using Weighted Moving Averages using previous 2016 values, therefore, we get 𝑃𝑡=(2∗𝑃𝑡−1+𝑃𝑡−2)/3

## Exponential Weighted Moving Averages

https://en.wikipedia.org/wiki/Moving_average #Exponential_moving_average Through weighted averaged we have satisfied the analogy of giving higher weights to the latest value and decreasing weights to the subsequent ones but we still do not know which is the correct weighting scheme as there are infinitely many possibilities in which we can assign weights in a non-increasing order and tune the hyperparameter window-size. To simplify this process we use Exponential Moving Averages which is a more logical way towards assigning weights and at the same time also using an optimal window-size.

In exponential moving averages, we use a single hyperparameter alpha (𝛼)(α) which is a value between 0 & 1 and based on the value of the hyperparameter alpha the weights and the window sizes are configured.

For eg. If 𝛼=0.9α=0.9 then the number of days on which the value of the current iteration is based is~1/(1−𝛼)=101/(1−α)=10 i.e. we consider values 10 days prior before we predict the value for the current iteration. Also, the weights are assigned using 2/(𝑁+1)=0.182/(N+1) =0.18, where N = number of prior values being considered, hence from this it is implied that the first or latest value is assigned a weight of 0.18 which keeps exponentially decreasing for the subsequent values.

for ratio feature 𝑅′𝑡=𝛼∗𝑅𝑡−1+(1−𝛼)∗𝑅′𝑡−1 where R’ is predicted value at t time.

for known previous value 𝑃′𝑡=𝛼∗𝑃𝑡−1+(1−𝛼)∗𝑃′𝑡−1

## Result:

**Please Note:-** The above comparisons are made using Jan 2015 and Jan 2016 only

From the above matrix it is inferred that the best forecasting model for our prediction would be:- 𝑃′𝑡=𝛼∗𝑃𝑡−1+(1−𝛼)∗𝑃′𝑡−1Pt′=α∗Pt−1+(1−α)∗Pt−1′ i.e Exponential Moving Averages using 2016 Values

# F. feature engineering

- we take number of pickups that are happened in last 5 10min travels
- from the baseline models, we said the exponential weighted moving average gives us the best error. we will try to add the same exponential weighted moving average at t as a feature to our data exponential weighted moving average => p’(t) = alpha*p’(t-1) + (1-alpha)*P(t-1)
- we will code each day Sunday = 0, monday=1, tue = 2, wed=3, thur=4, fri=5,sat=6
- latitude
- longitude
- frequency and amplitude at different time zone(peak, morning&evening etc) using Fourier Transforms

# G. Train-Test Split

Before we start predictions using the tree-based regression models we take 3 months of 2016 pickup data and split it such that for every region we have 70% data in train and 30% in the test, ordered date-wise for every region

# H. Implementing model

## Regression Models

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import learning_curve, GridSearchCVlr_reg=LinearRegression()

parameters = {‘fit_intercept’:[True,False], ‘normalize’:[True,False], ‘copy_X’:[True, False]}

grid = GridSearchCV(lr_reg,parameters, cv=None)

grid.fit(df_train, tsne_train_output)print(grid.best_estimator_)

print(grid.best_params_)

After implementing GridsearchCV we found the optimal parameter

## Using Random Forest Regressor

regr1 = RandomForestRegressor()#(max_features=’sqrt’,min_samples_leaf=4,min_samples_split=3,n_estimators=40, n_jobs=-1)

param_dist = {“max_depth”: [3, None],

“max_features”: [‘sqrt’ , ‘log2’ ],

“min_samples_split”: randint(2, 11),

“min_samples_leaf”: randint(1, 11),

“n_estimators”:[35,40,45]

}# run randomized search

n_iter_search = 20

random_search = RandomizedSearchCV(regr1, param_distributions=param_dist,n_iter=n_iter_search)

random_search.fit(df_train, tsne_train_output)

print(random_search.best_params_)

after implementing random search, we found the optimal parameter

the feature importance based on random forest

Observation: it is found that the previous time step at 2 periods is a very important feature and longitude is least.

## Using XgBoost Regressor

Training a hyper-parameter tuned Xg-Boost regressor on our train data.

Find more about XGBRegressor function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#module-xgboost.sklearn

from xgboost import XGBClassifier

x_model = xgb.XGBRegressor()

param_dist = {“max_depth”: [3, 4,5],

‘n_estimators’: randint(400,600),

“min_child_weight”: [3, 4,5,6],

“gamma”:[0,0.1,0.2],

“colsample_bytree”:[0.7,0.8,0.9],

“nthread”:[3,4,5]

}# run randomized search

n_iter_search = 20

random_search = RandomizedSearchCV(x_model, param_distributions=param_dist,n_iter=n_iter_search)

random_search.fit(df_train, tsne_train_output)print(random_search.best_params_)

after fitting XGBoost, we found the optimal parameter

feature importance based on XGBoost

Observation: it is found that the previous time step at 4-periods is a very important feature and feature at the previous 2-periods is least.

# Model Comparision

## Observation:

Random Forest Regression seems to be best model where MAPE of train value deacrease below 12%there is not any sign of overfitting or underfitting but Random forest model seems little bi overfittingall model have test MAPE in range of 12.6 to 13.6%if we avoid overfitting, XGBoost is best model .

# Solution Procedure:

1)PROBLEM FORMULATION2) DATA CLEANING3) EDA4) DATTA PREPARATION(CLUSTERING/SEGMENTATION)5) Feature engineering( Time series and Fourier Transforms, frequency and amplitude at different time step, latitude, longitude etc.)6) Modelling: Baseline Models(simple, weighted and exponetntial moving average)7) split dataset into train and test8) Implementing modelTask 1: Incorporate Fourier features as features into Regression models and measure MAPE.Task 2: Perform hyper-parameter tuning for Regression models. 2a. Linear Regression: Grid Search 2b. Random Forest: Random Search 2c. Xgboost: Random Search Task 3: Explore more time-series features using Google search/Quora/Stackoverflow to reduce the MAPE to < 12%

=========Detail code available here=============

=====================================