Feature Engineering using Featuretools with code
Feature engineering, also known as feature creation, is the process of constructing new features from existing data to train a machine learning model. Typically, feature engineering is a drawn-out manual process, relying on domain knowledge, intuition, and data manipulation. This process can be extremely tedious and the final features will be limited both by human subjectivity and time. Automated feature engineering aims to help the data scientist by automatically creating many candidates features out of a dataset from which the best can be selected and used for training.
Fortunately, featuretools is exactly the solution we are looking for. This open-source Python library will automatically create many features from a set of related tables. Featuretools is based on a method known as “Deep Feature Synthesis”.
Deep feature synthesis stacks multiple transformation and aggregation operations (which are called feature primitives in the vocab of featuretools) to create features from data spread across many tables. Like most ideas in machine learning, it’s a complex method built on a foundation of simple concepts. By learning one building block at a time, we can form a good understanding of this powerful method.
First, let’s take a look at our example data. and the complete collection of tables is as follows:
clients
: basic information about clients at a credit union. Each client has only one row in this dataframeloans
: loans made to the clients. Each loan has only own row in this dataframe but clients may have multiple loans.payments
: payments made on the loans. Each payment has only one row but each loan will have multiple payments.
If we have a machine learning task, such as predicting whether a client will repay a future loan, we will want to combine all the information about clients into a single table. The tables are related (through the client_id
and the loan_id
variables) and we could use a series of transformations and aggregations to do this process by hand. However, we will shortly see that we can instead use featuretools to automate the process.
Featuretools
Featuretools is an open source library for performing automated feature engineering. It is a great tool designed to fast-forward the feature generation process, thereby giving more time to focus on other aspects of machine learning model building. In other words, it makes your data “machine learning ready”.
Before taking Featuretools for a spin, there are three major components of the package that we should be aware of:
- Entities
- Deep Feature Synthesis (DFS)
- Feature primitives
a) An Entity can be considered as a representation of a Pandas DataFrame. A collection of multiple entities is called an Entityset.
b) Deep Feature Synthesis (DFS) is actually a Feature Engineering method and is the backbone of Featuretools. It enables the creation of new features from single, as well as multiple dataframes.
c) DFS create features by applying Feature primitives to the Entity-relationships in an EntitySet. These primitives are the often-used methods to generate features manually. For example, the primitive “mean” would find the mean of a variable at an aggregated level.
Implementation of Featuretools
The objective is to build a predictive model to estimate the sales of each product at a particular store. This would help the decision-makers at BigMart to find out the properties of any product or store, which play a key role in increasing the overall sales.
Now we can start using Featuretools to perform automated feature engineering! It is necessary to have a unique identifier feature in the dataset (our dataset doesn’t have any right now). So, we will create one unique ID for our combined dataset. If you notice, we have two IDs in our data — one for the item and another for the outlet. So, simply concatenating both will give us a unique ID.
It seems Item_Fat_Content contains only two categories, i.e., “Low Fat” and “Regular” — the rest of them we will consider redundant. So, let’s convert it into a binary variable.
we will have to create an EntitySet. An EntitySet is a structure that contains multiple dataframes and relationships between them. So, let’s create an EntitySet and add the dataframe combination to it.
Our data contains information at two levels — item level and outlet level. Featuretools offers a functionality to split a dataset into multiple tables. We have created a new table ‘outlet’ from the BigMart table based on the outlet ID Outlet_Identifier.
As you can see above, it contains two entities — bigmart and outlet. There is also a relationship formed between the two tables, connected by Outlet_Identifier. This relationship will play a key role in the generation of new features.
Now we will use Deep Feature Synthesis to create new features automatically. Recall that DFS uses Feature Primitives to create features using multiple tables present in the EntitySet.
target_entity is nothing but the entity ID for which we wish to create new features (in this case, it is the entity ‘bigmart’). The parameter max_depth controls the complexity of the features being generated by stacking the primitives.
DFS has created 29 new features in such a quick time. It is phenomenal as it would have taken much longer to do it manually. If you have datasets with multiple interrelated tables, Featuretools would still work. In that case, you wouldn’t have to normalize a table as multiple tables will already be available.
There is one issue with this dataframe — it is not sorted properly. We will have to sort it based on the id variable from the combi dataframe.
Lets Build a model on featured data
It is time to check how useful these generated features actually are. We will use them to build a model and predict Item_Outlet_Sales. Since our final data (feature_matrix) has many categorical features, I decided to use the CatBoost algorithm. It can use categorical features directly and is scalable in nature. You can refer to this article to read more about CatBoost.
CatBoost requires all the categorical variables to be in the string format. So, we will convert the categorical variables in our data to string first:
the features created by Featuretools are not just random features, they are valuable and useful. Most importantly, the amount of time it saves in feature engineering is incredible.
Making our data science solutions interpretable is a very important aspect of performing machine learning.. The featuretools package is truly a game-changer in machine learning. While it’s applications are understandably still limited in industry use cases
============Thanks==============
Code: https://github.com/ranasingh-gkp/Feature_engineering_Featuretools
Refrence:
https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219