# Create and Load TFRecords

A simple TensorFlow example to parse a dataset into TFRecord format, and then read that dataset.

In this example, the Titanic Dataset (in CSV format) will be used as a toy dataset, for parsing all the dataset features into TFRecord format, and then building an input pipeline that can be used for training models.

- Author: Aymeric Damien
- Project: https://github.com/aymericdamien/TensorFlow-Examples/

## Titanic Dataset

The titanic dataset is a popular dataset for ML that provides a list of all passengers onboard the Titanic, along with various features such as their age, sex, class (1st, 2nd, 3rd)... And if the passenger survived the disaster or not.

It can be used to see that even though some luck was involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class...

#### Overview
survived|pclass|name|sex|age|sibsp|parch|ticket|fare
--------|------|----|---|---|-----|-----|------|----
1|1|"Allen, Miss. Elisabeth Walton"|female|29|0|0|24160|211.3375
1|1|"Allison, Master. Hudson Trevor"|male|0.9167|1|2|113781|151.5500
0|1|"Allison, Miss. Helen Loraine"|female|2|1|2|113781|151.5500
0|1|"Allison, Mr. Hudson Joshua Creighton"|male|30|1|2|113781|151.5500
...|...|...|...|...|...|...|...|...


#### Variable Descriptions
```
survived        Survived
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
```

In [None]:
from __future__ import absolute_import, division, print_function

import csv
import requests
import tensorflow as tf

In [None]:
# Download Titanic dataset (in csv format).
d = requests.get("https://raw.githubusercontent.com/tflearn/tflearn.github.io/master/resources/titanic_dataset.csv")
with open("titanic_dataset.csv", "wb") as f:
    f.write(d.content)

### Create TFRecords

In [None]:
# Generate Integer Features.
def build_int64_feature(data):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[data]))

# Generate Float Features.
def build_float_feature(data):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[data]))

# Generate String Features.
def build_string_feature(data):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[data]))

# Generate a TF `Example`, parsing all features of the dataset.
def convert_to_tfexample(survived, pclass, name, sex, age, sibsp, parch, ticket, fare):
    return tf.train.Example(
        features=tf.train.Features(
            feature={
                'survived': build_int64_feature(survived),
                'pclass': build_int64_feature(pclass),
                'name': build_string_feature(name),
                'sex': build_string_feature(sex),
                'age': build_float_feature(age),
                'sibsp': build_int64_feature(sibsp),
                'parch': build_int64_feature(parch),
                'ticket': build_string_feature(ticket),
                'fare': build_float_feature(fare),
            })
    )

In [None]:
# Open dataset file.
with open("titanic_dataset.csv") as f:
    # Output TFRecord file.
    with tf.io.TFRecordWriter("titanic_dataset.tfrecord") as w:
        # Generate a TF Example for all row in our dataset.
        # CSV reader will read and parse all rows.
        reader = csv.reader(f, skipinitialspace=True)
        for i, record in enumerate(reader):
            # Skip header.
            if i == 0:
                continue
            survived, pclass, name, sex, age, sibsp, parch, ticket, fare = record
            # Parse each csv row to TF Example using the above functions.
            example = convert_to_tfexample(int(survived), int(pclass), name, sex, float(age), int(sibsp), int(parch), ticket, float(fare))
            # Serialize each TF Example to string, and write to TFRecord file.
            w.write(example.SerializeToString())

### Load TFRecords

In [None]:
# Build features template, with types.
features = {
    'survived': tf.io.FixedLenFeature([], tf.int64),
    'pclass': tf.io.FixedLenFeature([], tf.int64),
    'name': tf.io.FixedLenFeature([], tf.string),
    'sex': tf.io.FixedLenFeature([], tf.string),
    'age': tf.io.FixedLenFeature([], tf.float32),
    'sibsp': tf.io.FixedLenFeature([], tf.int64),
    'parch': tf.io.FixedLenFeature([], tf.int64),
    'ticket': tf.io.FixedLenFeature([], tf.string),
    'fare': tf.io.FixedLenFeature([], tf.float32),
}

In [None]:
# Create TensorFlow session.
sess = tf.Session()

# Load TFRecord data.
filenames = ["titanic_dataset.tfrecord"]
data = tf.data.TFRecordDataset(filenames)

# Parse features, using the above template.
def parse_record(record):
    return tf.io.parse_single_example(record, features=features)
# Apply the parsing to each record from the dataset.
data = data.map(parse_record)

# Refill data indefinitely.
data = data.repeat()
# Shuffle data.
data = data.shuffle(buffer_size=1000)
# Batch data (aggregate records together).
data = data.batch(batch_size=4)
# Prefetch batch (pre-load batch for faster consumption).
data = data.prefetch(buffer_size=1)

# Create an iterator over the dataset.
iterator = data.make_initializable_iterator()
# Initialize the iterator.
sess.run(iterator.initializer)

# Get next data batch.
x = iterator.get_next()

In [None]:
# Dequeue data and display.
for i in range(3):
    print(sess.run(x))
    print("")

{'fare': array([ 35.5   ,  73.5   , 133.65  ,  19.2583], dtype=float32), 'name': array(['Sloper, Mr. William Thompson', 'Davies, Mr. Charles Henry',
       'Frauenthal, Dr. Henry William', 'Baclini, Miss. Marie Catherine'],
      dtype=object), 'age': array([28., 18., 50.,  5.], dtype=float32), 'parch': array([0, 0, 0, 1]), 'pclass': array([1, 2, 1, 3]), 'sex': array(['male', 'male', 'male', 'female'], dtype=object), 'survived': array([1, 0, 1, 1]), 'sibsp': array([0, 0, 2, 2]), 'ticket': array(['113788', 'S.O.C. 14879', 'PC 17611', '2666'], dtype=object)}

{'fare': array([ 18.75 , 106.425,  78.85 ,  90.   ], dtype=float32), 'name': array(['Richards, Mrs. Sidney (Emily Hocking)', 'LeRoy, Miss. Bertha',
       'Cavendish, Mrs. Tyrell William (Julia Florence Siegel)',
       'Hoyt, Mrs. Frederick Maxfield (Jane Anne Forby)'], dtype=object), 'age': array([24., 30., 76., 35.], dtype=float32), 'parch': array([3, 0, 0, 0]), 'pclass': array([2, 1, 1, 1]), 'sex': array(['female', 'female', 'fe