# Gradient Boosted Decision Tree (GBDT)
Implement a Gradient Boosted Decision Tree (GBDT) with TensorFlow. This example is using the Boston Housing Value dataset as training samples. The example supports both Classification (2 classes: value > $23000 or not) and Regression (raw home value as target).

- Author: Aymeric Damien
- Project: https://github.com/aymericdamien/TensorFlow-Examples/

## Boston Housing Dataset

**Link:** https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

**Description:**

The dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.`

*For the full features list, please see the link above*

In [1]:
from __future__ import print_function

# Ignore all GPUs (current TF GBDT does not support GPU).
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "1"

import tensorflow as tf
import numpy as np
import copy

In [2]:
# Dataset parameters.
num_classes = 2 # Total classes: greater or equal to $23,000, or not (See notes below).
num_features = 13 # data features size.

# Training parameters.
max_steps = 2000
batch_size = 256
learning_rate = 1.0
l1_regul = 0.0
l2_regul = 0.1

# GBDT parameters.
num_batches_per_layer = 1000
num_trees = 10
max_depth = 4

In [3]:
# Prepare Boston Housing Dataset.
from tensorflow.keras.datasets import boston_housing
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

# For classification purpose, we build 2 classes: price greater or lower than $23,000
def to_binary_class(y):
 for i, label in enumerate(y):
 if label >= 23.0:
 y[i] = 1
 else:
 y[i] = 0
 return y

y_train_binary = to_binary_class(copy.deepcopy(y_train))
y_test_binary = to_binary_class(copy.deepcopy(y_test))

### GBDT Classifier

In [4]:
# Build the input function.
train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
 x={'x': x_train}, y=y_train_binary,
 batch_size=batch_size, num_epochs=None, shuffle=True)
test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
 x={'x': x_test}, y=y_test_binary,
 batch_size=batch_size, num_epochs=1, shuffle=False)
test_train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
 x={'x': x_train}, y=y_train_binary,
 batch_size=batch_size, num_epochs=1, shuffle=False)
# GBDT Models from TF Estimator requires 'feature_column' data format.
feature_columns = [tf.feature_column.numeric_column(key='x', shape=(num_features,))]

In [5]:
gbdt_classifier = tf.estimator.BoostedTreesClassifier(
 n_batches_per_layer=num_batches_per_layer,
 feature_columns=feature_columns, 
 n_classes=num_classes,
 learning_rate=learning_rate, 
 n_trees=num_trees,
 max_depth=max_depth,
 l1_regularization=l1_regul, 
 l2_regularization=l2_regul
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': ClusterSpec({}), '_model_dir': '/tmp/tmp5h6BoR', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_session_config': allow_soft_placement: true
graph_options {
 rewrite_options {
 meta_optimizer_iterations: ONE
 }
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_session_creation_timeout_secs': 7200, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_experimental_max_worker_delay_secs': None, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_master': ''}


In [6]:
gbdt_classifier.train(train_input_fn, max_steps=max_steps)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp5h6BoR/model.ckpt.
'_Resource' object has no attribute 'name'
INFO:tensorflow:loss = 0.6931475, step = 0
INFO:tensorflow:loss = 0.6931475, ste



In [7]:
gbdt_classifier.evaluate(test_train_input_fn)

INFO:tensorflow:Calling model_fn.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Instructions for updating:
The value of AUC returned by this may race with the update so this is deprected. Please use tf.keras.metrics.AUC instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-07-15T00:50:36Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp5h6BoR/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.56490s
INFO:tensorflow:Finished evaluation at 2020-07-15-00:50:37
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.87376237, accuracy_baseline = 0.63118815, auc = 0.92280567, auc_precision_recall = 0.9104949, average_loss = 0.38236493, global_step = 2000, label/mean = 0.36881188, loss = 0.38619137, precision = 0.8888889, prediction/mean = 0.378958, recall = 0.7516779
'_Resource' object has no at

{'accuracy': 0.87376237,
 'accuracy_baseline': 0.63118815,
 'auc': 0.92280567,
 'auc_precision_recall': 0.9104949,
 'average_loss': 0.38236493,
 'global_step': 2000,
 'label/mean': 0.36881188,
 'loss': 0.38619137,
 'precision': 0.8888889,
 'prediction/mean': 0.378958,
 'recall': 0.7516779}

In [8]:
gbdt_classifier.evaluate(test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-07-15T00:50:38Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp5h6BoR/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.56883s
INFO:tensorflow:Finished evaluation at 2020-07-15-00:50:38
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.78431374, accuracy_baseline = 0.5588235, auc = 0.8458089, auc_precision_recall = 0.86285317, average_loss = 0.49404, global_step = 2000, label/mean = 0.44117647, loss = 0.49404, precision = 0.87096775, prediction/mean = 0.37467176, recall = 0.6
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: /tmp/tmp5h6BoR/model.ckpt-2000


{'accuracy': 0.78431374,
 'accuracy_baseline': 0.5588235,
 'auc': 0.8458089,
 'auc_precision_recall': 0.86285317,
 'average_loss': 0.49404,
 'global_step': 2000,
 'label/mean': 0.44117647,
 'loss': 0.49404,
 'precision': 0.87096775,
 'prediction/mean': 0.37467176,
 'recall': 0.6}

### GBDT Regressor

In [9]:
# Build the input function.
train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
 x={'x': x_train}, y=y_train,
 batch_size=batch_size, num_epochs=None, shuffle=True)
test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
 x={'x': x_test}, y=y_test,
 batch_size=batch_size, num_epochs=1, shuffle=False)
# GBDT Models from TF Estimator requires 'feature_column' data format.
feature_columns = [tf.feature_column.numeric_column(key='x', shape=(num_features,))]

In [10]:
gbdt_regressor = tf.estimator.BoostedTreesRegressor(
 n_batches_per_layer=num_batches_per_layer,
 feature_columns=feature_columns, 
 learning_rate=learning_rate, 
 n_trees=num_trees,
 max_depth=max_depth,
 l1_regularization=l1_regul, 
 l2_regularization=l2_regul
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': ClusterSpec({}), '_model_dir': '/tmp/tmpts3Kmu', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_session_config': allow_soft_placement: true
graph_options {
 rewrite_options {
 meta_optimizer_iterations: ONE
 }
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_session_creation_timeout_secs': 7200, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_experimental_max_worker_delay_secs': None, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_master': ''}


In [11]:
gbdt_regressor.train(train_input_fn, max_steps=max_steps)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpts3Kmu/model.ckpt.
'_Resource' object has no attribute 'name'
INFO:tensorflow:loss = 584.82294, step = 0
INFO:tensorflow:loss = 560.2794, step = 0 (0.369 sec)
INFO:tensorflow:loss = 606.68115, step = 0 (0.156 sec)
INFO:tensorflow:loss = 583.2771, step = 0 (0.155 sec)
INFO:tensorflow:loss = 603.4647, step = 0 (0.160 sec)
INFO:tensorflow:loss = 605.8213, step = 0 (0.153 sec)
INFO:tensorflow:loss = 577.5599, step = 0 (0.157 sec)
INFO:tensorflow:loss = 585.297, step = 0 (0.157 sec)
INFO:tensorflow:loss = 545.26074, step = 0 (0.156 sec)
INFO:tensorflow:loss = 597.91046, step = 0 (0.190 sec)
INFO:tensorflow:loss = 600.553



In [12]:
gbdt_regressor.evaluate(test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-07-15T00:50:45Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpts3Kmu/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.24467s
INFO:tensorflow:Finished evaluation at 2020-07-15-00:50:45
INFO:tensorflow:Saving dict for global step 2000: average_loss = 30.202602, global_step = 2000, label/mean = 23.078432, loss = 30.202602, prediction/mean = 22.536291
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: /tmp/tmpts3Kmu/model.ckpt-2000


{'average_loss': 30.202602,
 'global_step': 2000,
 'label/mean': 23.078432,
 'loss': 30.202602,
 'prediction/mean': 22.536291}