Introduction to Decision Tree Algorithms

large

Today I’m going to walk through an easy way to understand decision trees. If you follow my other guides you may guess that we’ll be taking a project approach to learning about decision trees, and you’d be right. In this lesson we’re going to build a medical tool that analyzes health data and automatically outputs a recommendation on a user’s health status. First and foremost let’s define a few terms that I’ll discuss in this guide.

Decision tree learning is one of the most popular tools used by data scientists.
Data science is the study of learning from data in order to make better decisions in the future.

Data science is quickly becoming one of the most important sectors in the development community. Therefore it’s important to have a working knowledge of popular big data analysis trends.

Easy Way to Understand Decision Trees

In order to understand decision trees, I think the best approach is to dive right in with an example. A basic decision tree requires a few components.

Tell it what type of data it’s going to be analyzing.
Provide it with training data so it can learn from the data trends.
New data that it can use to generate recommendations.

Don’t worry if this seems a little fuzzy, the example we’re going to walk through should give some clarification. As usual I’ll walk through this example using the Ruby programming language, however the same principles will apply to any other language. I will also be using the Decision Tree RubyGem to process the data.

Provide Training Data

large

In order to satisfy the decision tree’s first requirement we are going to supply training data. As you can see here I’ve setup an array that has a temperature value and a status ranging from healthy to dead. I’ve also declared that the attribute that we’re looking at is the Temp.

Train the Decision Tree

large

Now that we have some training data that the algorithm can learn from, here we’re calling the train method. This train method will build up the decision tree’s knowledge of what is called the domain. In this case, our domain is the health status of users based on their temperature.

Generate Predictions Based on Data

large

With our decision tree trained, now it’s intelligent to make predictions. We’ll start by passing in test data. Let’s begin with an easy one that should match. When we pass in the temperature value of 98.7 the decision tree properly outputs the value healthy.

large

A More Difficult Prediction

large

That was an easy one since 98.7 was already listed in our training data. A decision tree wouldn’t be very helpful if it only gave accurate matches when the test data was a perfect match with the training information. Now let’s pass in a temperature of 105.5. After running the program you can see that the decision tree analyzes all of the data and predicts that the user is dead. This result means that the tree is working properly with the knowledge that it’s been given.

large

Updating the Knowledge Base

large

Lastly, a decision tree wouldn’t be too useful if it didn’t continually expand its knowledge base. Here I’ve added the value 106.5 with a status of crazy sick. After building the tree again and running the program you’ll see that the system has changed it’s prediction and now is recommending that the user is crazy sick instead of dead, which means that it’s working properly.

More Advanced Example

large

So that was a dead simple example of a decision tree, let’s discuss a more complex example that I built last year. I was working with an energy services company that had around a thousand trucks in their fleet. The fleet manager had a problem where he was having to manually guesstimate when a truck should be retired. This was an important problem to solve because if a truck was left in the fleet too long it would become more expensive than simply buying a new vehicle.

My team built out a decision tree implementation that took dozens of varying attributes into account. Such as:

Mileage.
Make and model of the vehicle.
Maintenance records.
Accident history.
Along with a number of other fleet criteria.

The Result

The implementation of the decision tree gave the fleet manager everything he needed to help make decisions on his trucks. He was able to input the values for a vehicle he was reviewing for retirement and the program was able to recommend if the truck should stay in the fleet or not. Eventually we also integrated the decision tree into the fleet management software. This enabled automated notifications to the company whenever the decision tree predicted that a vehicle should be evaluated for retirement.

Machine Learning + Domain Expertise

In addition to being a great exercise on how to build a decision tree. The experience also taught me how effective it can be when machine learning is combined with user expertise.

large

This diagram does a great job in showing the key to working with data science. Domain expertise by itself is no longer enough. No matter how much you know about an industry, your mind can’t work through billions of data elements in order to make a decision. Simultaneously, a data science algorithm is useless if it doesn’t receive accurate information about the domain it’s learning about.

Without the knowledge passed on to our team from the fleet director we wouldn’t have had a clue on how to properly construct the program. However by combining real world expertise with a robust machine learning algorithm we were able to deliver an effective piece of software.

I hope that this has provided you with an easy way to understand decision trees and machine learning algorithms in general. I’ve also included a link to the decision tree program we walked through today in the show notes so you can download it and play with it yourself.

Resources

Link to program code