Global Site

# Regression analysis using gradient boosting regression tree

Technical Articles

Nov 1, 2021
Shoichiro Yokotani, Application Development Expert
AI Platform division

Machine learning algorithms are generally categorized as supervised learning and unsupervised learning. Supervised learning is used for analysis to get predictive values for inputs. "Supervised" learning requires data combinations of inputs and outputs. Learning processes are performed based on these combinations, and the output data is predicted from the input data using the learning results.

In addition, supervised learning is divided into two types: regression analysis and classification. Regression analysis uses machine learning to find the parameters that determine the relationship in order to predict the output y from the input x. Classification is an analysis of a dataset with discrete output y. The output y does not have to be continuous as in regression analysis. For example, it is used to determine the possibility of purchasing a product based on customer information (annual income, family structure, age, address, etc.).

In addition to regression analysis and classification, there are various machine learning algorithms based on mathematical and statistical methods. Depending on the nature of the dataset and the complexity of the model, it is necessary to determine which algorithm is best for the analysis. It is important to fully understand and select the characteristics of each algorithm.

This column introduces the following analysis methods.

(1) Supervised learning, regression analysis.

(2) Machine learning algorithm, gradient boosting regression tree.

Gradient boosting regression trees are based on the idea of an ensemble method derived from a decision tree. The decision tree uses a tree structure. Starting from tree root, branching according to the conditions and heading toward the leaves, the goal leaf is the prediction result. This decision tree has the disadvantage of overfitting test data if the hierarchy is too deep. As a means to prevent this overfitting, the idea of the ensemble method is used for decision trees. This technique uses a combination of multiple decision trees rather than simply a single decision tree. Random forest and gradient boosting are known as typical algorithms.

Random forests create multiple decision trees by splitting a dataset based on random numbers. It prevents overfitting by making predictions for all individual decision trees and averaging the regression results.

Gradient boosting, on the other hand, is a technique for repeatedly adding decision trees so that the next decision tree corrects the previous decision tree error. Compared to Random forest, the results are more sensitive to parameter settings during training. However, with the correct parameter settings, you will get better test results than random forest.

Both random forest and gradient boosting are implemented in scikit-learn. Both learning algorithms can be used in Frovedis as well. There is no parallel processing in the scikit-learn version of Gradient Boosting. However, Frovedis can process individual decision tree creations in parallel. Learning time can be reduced compared to scikit-learn for very large data sets.

This column presents a Gradient Boosting Regression Analysis Demonstration with Frovedis.

This demo uses Kaggle's used car sales price dataset.

The procedure is as follows.

--Integrating data stored in CSV by brand into one table

--Selecting the year of manufacture, mileage, and displacement of the used car as input data.

--Selecting used car price as output data.

--Displaying prediction results in both learning data and test data while changing the parameters of learning_rate and n_estimators

The forecast is performed again using the parameters with the best forecast results. learning_rate controls the degree to which each decision tree corrects the mistakes of the previous decision trees. The number of decision trees is controlled by n_estimators.

used_car_regression

As shown in this column, changing individual parameters to check test accuracy and degree of overfitting is one way to obtain the best learning model. Repeated training of large datasets is time consuming.
By using the processing parallel algorithms of SX-Aurora TSUBASA and Frovedis, it is possible to create a high-performance learning model at a lower cost.