# Memmapping

The numpy package makes it possible to memory map large contiguous chunks of binary files as shared memory for all the Python processes running on a given host:

In [1]:
import numpy as np

* Creating a `numpy.memmap` instance with the `w+` mode creates a file on the filesystem and zeros its content. 

In [2]:
# Cleanup any existing file from past session (necessary for windows)
import os

current_dir = os.path.abspath(os.path.curdir)
mmap_filepath = os.path.join(current_dir, 'files', 'small.mmap')
if os.path.exists(mmap_filepath):
 os.unlink(mmap_filepath)

mm_w = np.memmap(mmap_filepath, shape=10, dtype=np.float32, mode='w+')
print(mm_w)

[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


* This binary file can then be mapped as a new numpy array by all the engines having access to the same filesystem. 
* The `mode='r+'` opens this shared memory area in read write mode:

In [3]:
mm_r = np.memmap('files/small.mmap', dtype=np.float32, mode='r+')
print(mm_r)

[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [4]:
mm_w[0] = 42
print(mm_w)

[ 42. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [5]:
print(mm_r)

[ 42. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


* Memory mapped arrays created with `mode='r+'` can be modified and the modifications are shared 
 - in case of multiple process

In [12]:
mm_r[1] = 43

In [13]:
print(mm_r)

[ 42. 43. 0. 0. 0. 0. 0. 0. 0. 0.]


### Memmap Operations

Memmap arrays generally behave very much like regular in-memory numpy arrays:

In [14]:
print(mm_r.sum())
print("sum={0}, mean={1}, std={2}".format(mm_r.sum(), 
 np.mean(mm_r), np.std(mm_r)))

85.0
sum=85.0, mean=8.5, std=17.0014705657959


Before allocating more data let us define a couple of utility functions from the previous exercise (and more) to monitor what is used by which engine and what is still free on the cluster as a whole:

* Let's allocate a 80MB memmap array:

In [15]:
# Cleanup any existing file from past session (necessary for windows)
import os
if os.path.exists('files/big.mmap'):
 os.unlink('files/big.mmap')

np.memmap('files/big.mmap', shape=10 * int(1e6), dtype=np.float64, mode='w+')

memmap([ 0., 0., 0., ..., 0., 0., 0.])

No significant memory was used in this operation as we just asked the OS to allocate the buffer on the hard drive and just maitain a virtual memory area as a cheap reference to this buffer.

Let's open new references to the same buffer from all the engines at once:

In [17]:
%time big_mmap = np.memmap('files/big.mmap', dtype=np.float64, mode='r+')

CPU times: user 393 µs, sys: 577 µs, total: 970 µs
Wall time: 773 µs


In [18]:
big_mmap

memmap([ 0., 0., 0., ..., 0., 0., 0.])

* Let's trigger an actual load of the data from the drive into the in-memory disk cache of the OS, this can take some time depending on the speed of the hard drive (on the order of 100MB/s to 300MB/s hence 3s to 8s for this dataset):

In [19]:
%time np.sum(big_mmap)

CPU times: user 39.4 ms, sys: 89.6 ms, total: 129 ms
Wall time: 602 ms


memmap(0.0)

* Now back into memory

In [20]:
%time np.sum(big_mmap)

CPU times: user 16.6 ms, sys: 2.2 ms, total: 18.8 ms
Wall time: 16.3 ms


memmap(0.0)

### Example of practical use of this approach

This strategy makes it very interesting to load the readonly datasets of machine learning problems, especially when the same data is reused over and over by concurrent processes as can be the case when doing learning curves analysis or grid search (**Hyperparameter Optimisation** & **Model Selection**).

This is of great importance in case of multiple and **embarassingly** parallel processes (like **Grid Search**)

## Memmaping Nested Numpy-based Data Structures with Joblib

**joblib** is a utility library included in the **sklearn** package. Among other things it provides tools to serialize objects that comprise large numpy arrays and reload them as memmap backed datastructures.

To demonstrate it, let's create an arbitrary python datastructure involving numpy arrays:

In [21]:
import numpy as np

class MyDataStructure(object):
 
 def __init__(self, shape):
 self.float_zeros = np.zeros(shape, dtype=np.float32)
 self.integer_ones = np.ones(shape, dtype=np.int64)
 
data_structure = MyDataStructure((3, 4))
data_structure.float_zeros, data_structure.integer_ones

(array([[ 0., 0., 0., 0.],
 [ 0., 0., 0., 0.],
 [ 0., 0., 0., 0.]], dtype=float32), array([[1, 1, 1, 1],
 [1, 1, 1, 1],
 [1, 1, 1, 1]]))

We can now persist this datastructure to disk:

In [22]:
from sklearn.externals import joblib
joblib.dump(data_structure, 'files/data_structure.pkl')

['files/data_structure.pkl',
 'files/data_structure.pkl_01.npy',
 'files/data_structure.pkl_02.npy']

In [23]:
!ls -l files/data_structure*

-rw-r--r-- 1 valerio staff 267 Jul 21 10:17 files/data_structure.pkl
-rw-r--r-- 1 valerio staff 176 Jul 21 10:17 files/data_structure.pkl_01.npy
-rw-r--r-- 1 valerio staff 128 Jul 21 10:17 files/data_structure.pkl_02.npy


A memmapped copy of this datastructure can then be loaded:

In [24]:
memmaped_data_structure = joblib.load('files/data_structure.pkl', 
 mmap_mode='r+')
memmaped_data_structure.float_zeros, memmaped_data_structure.integer_ones

(memmap([[ 0., 0., 0., 0.],
 [ 0., 0., 0., 0.],
 [ 0., 0., 0., 0.]], dtype=float32), memmap([[1, 1, 1, 1],
 [1, 1, 1, 1],
 [1, 1, 1, 1]]))