123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254 |
- ========================
- The Roxie Memory Manager
- ========================
- ************
- Introduction
- ************
- This memory manager started life as the memory manager which was only used for the Roxie engine. It had several
- original design goals:
- * Support link counted rows. (When the last reference is released the row is freed.)
- * Be as fast as possible on allocate and deallocate of small rows.
- * Allow rows serialized from slaves to be used directly without being cloned first.
- * Allow the memory used by a single query, or by all queries combined, to be limited, with graceful recovery.
- * Isolate roxie queries from one another, so that one query can't bring
- down all the rest by allocating too much memory.
- * Guarantee all the memory used by a query is freed when the query finishes, reducing the possibility of memory leaks.
- * Predictable behaviour with no pathogenic cases.
- (Note that efficient usage of memory does not appear on that list - the expectation when the memory
- manager was first designed was that Roxie queries would use minimal amounts of memory and speed was
- more important. Some subsequent changes e.g., Packed heaps, and configurable bucket sizes help mitigate that.)
- **************
- Main Structure
- **************
- The basic design is to reserve (but not commit) a single large block of memory in the virtual address space. This
- memory is subdivided into "pages". (These are not the same as the os virtual memory pages. The memory manager pages
- are currently defined as 1Mb in size.)
- The page bitmap
- ===============
- The system uses a bitmap to indicate whether each page from the global memory has been allocated. All active
- IRowManager instances allocate pages from the same global memory space.
- To help reduce fragmentation allocations for single pages are fulfilled from one end of the address space, while
- allocations for multiple pages are fulfilled from the other.
- IRowManager
- ===========
- This provides the primary interface for allocating memory. The size of a requested allocation is rounded up to the
- next "bucket" size, and the allocation is then satisfied by the heap associated with that bucket size. Different
- engines can specify different bucket sizes - an optional list is provided to setTotalMemoryLimit. Roxie tends to use
- fewer buckets to help reduce the number of active heaps. Thor uses larger numbers since it is more important to
- minimize the memory wasted.
- Roxie uses a separate instance of IRowManager for each query. This provides the mechanism for limiting how much
- memory a query uses. Thor uses a single instance of an IRowManager for each slave/master.
- Heaps
- =====
- Memory is allocated from a set of "heaps - where each heap allocates blocks of memory of a single size. The heap
- exclusively owns a set of heaplet (each 1 page in size), which are held in a doubly linked list, and sub allocates
- memory from those heaplets.
- Information about each heaplet is stored in the base of the page (using a class with virtual functions) and the
- address of an allocated row is masked to determine which heap object it belongs to, and how it should be linked/released
- etc. Any pointer not in the allocated virtual address (e.g., constant data) can be linked/released with no effect.
- Each heaplet contains a high water mark of the address within the page that has already been allocated (freeBase),
- and a lockless singly-linked list of rows which have been released (r_block). Releasing a row is non-blocking and
- does not involve any spin locks or critical sections. However, this means that empty pages need to be returned to
- the global memory pool at another time. (This is done in releaseEmptyPages()).
- When the last row in a page is released a flag (possibleEmptyPages) is set in its associated heap.
- * This is checked before trying to free pages from a particular heap, avoiding waiting on a lock and traversing
- a candidate list.
- Any page which *might* contain some spare memory is added to a lockless spare memory linked list.
- * Items are popped from this list when a heap fails to allocate memory from the current heaplet. Each item is checked
- in turn if it has space before allocating a new heaplet.
- * The list is also walked when checking to see which pages can be returned to the global memory. The doubly linked
- heaplet list allows efficient freeing.
- Each allocation has a link count and an allocator id associated with it. The allocator id represents the type of
- the row, and is used to determine what destructor needs to be called when the row is destroyed. (The count for a
- row also contains a flag in the top bit to indicate if it is fully constructed, and therefore valid for the
- destructor to be called.)
- Huge Heap
- =========
- A specialized heap is used to manage all allocations that require more than one page of memory. These allocations
- are not held on a free list when they are released, but each is returned directly to the global memory pool.
- Allocations in the huge heap can be expanded and shrunk using the resizeRow() functions - see below.
- Specialised Heaps:
- ==================
- Packed
- ------
- By default a fixed size heaps rounds the requested allocation size up to the next bucket size. A packed heap
- changes this behaviour and it is rounded up to the next 4 byte boundary instead. This reduces the amount of
- memory wasted for each row, but potentially increases the number of distinct heaps.
- Unique
- ------
- By default all fixed size heaps of the same size are shared. This reduces the memory consumption, but if the
- heap is used by multiple threads it can cause significant contention. If a unique heap is specified then it
- will not be shared with any other requests. Unique heaps store information about the type of each row in the heaplet
- header, rather than per row - which reduces the allocation overhead for each row.
- (Note to self: Is there ever any advantage having a heap that is unique but not packed??)
- Blocked
- -------
- Blocked is an option on createFixedRowHeap() to allocate multiple rows from the heaplet, and then return the
- additional rows on subsequent calls. It is likely to reduce the average number of atomic operations required for each
- row being allocated, but the heap that is returned can only be used from a single thread because it is not thread safe.
- Scanning
- --------
- By default the heaplets use a lock free singly linked list to keep track of rows that have been freed. This requires
- an atomic operation for each allocation and for each free. The scanning allocator uses an alternative approach. When
- a row is freed the row header is marked, and a row is allocated by scanning through the heaplet for rows that have
- been marked as free. Scanning uses atomic store and get, rather than more expensive synchronized atomic operations,
- so is generally faster than the linked list - provided a free row is found fairly quickly.
- The scanning heaps have better best-case performance, but worse worse-case performance (if large numbers of rows
- need to be scanned before a free row is found). The best-case tends to be true if only one thread/activity is
- accessing a particular heap, and the worse-case if multiple activities are accessing a heap, particularly if the rows
- are being buffered. It is the default for thor which tends to have few active allocators, but not for roxie, which
- tends to have large numbers of allocators.
- Delay Release
- -------------
- This is another varation on the scanning allocator, which further reduces the number of atomic operations. Usually
- when a row is allocated the link count on the heaplet is increased, and when it is freed the link count is
- decremented. This option delays decrementing the link count when the row is released, by marking the row with
- a different free flag. If it is subsequently reallocated there is no need to increment the link count. The
- downside is that it is more expensive to check whether a heaplet is completely empty (since you can no longer rely on
- the heaplet linkcount alone).
- ****************
- Dynamic Spilling
- ****************
- Thor has additional requirements to roxie. In roxie, if a query exceeds its memory requirements then it is terminated.
- Thor needs to be able to spill rows and other memory to disk and continue. This is achieved by allowing any process
- that stores buffered rows to register a callback with the memory manager. When more memory is required these callbacks
- are called to free up memory, and allow the job to continue.
- Each callback can specify a priority - lower priority callbacks are called first since they are assumed to have a
- lower cost associated with spilling. When more memory is required the callbacks are called in priority order until
- one of them succeeds. The can also be passed a flag to indicate it is critical to force them to free up as much memory
- as possible.
- Complications
- =============
- There are several different complications involved with the memory spilling:
- * There will be many different threads allocating rows.
- * Callbacks could be triggered at any time.
- * There is a large scope for deadlock between the callbacks and allocations.
- * It may be better to not resize a large array if rows had to be evicted to resize it.
- * Filtered record streams can cause significant wasted space in the memory blocks.
- * Resizing a multi-page allocation is non trivial.
- Callback Rules
- ==============
- Some rules to follow when implementing callbacks:
- * A callback cannot allocate any memory from the memory manager. If it does it is likely to deadlock.
- * You cannot allocate memory while holding a lock if that lock is also required by a callback.
- Again this will cause deadlock. If it proves impossible you can use a try-lock primitive in the callback,
- but it means you won't be able to spill those rows.
- * If the heaps are fragmented it may be more efficient to repack the heaps than spill to disk.
- * If you're resizing a potentially big block of memory use the resize function with the callback.
- Resizing Large memory blocks
- ============================
- Some of the memory allocations cover more than one "page" - e.g., arrays used to store blocks of rows. (These
- are called huge pages internally, not to be confused with operating system support for huge pages...) When
- one of these memory blocks needs to be expanded you need to be careful:
- * Allocating a new page, copying, updating the pointer (within a cs) and then freeing is safe. Unfortunately
- it may involve copying a large chunk of memory. It may also fail if there isn't memory for the new and old
- block, even if the existing block could have been expanded into an adjacent block.
- * You can't lock, call a resize routine and update the pointer because the resize routine may need to allocate
- a new memory block- that may trigger a callback, which could in turn deadlock trying to gain the lock.
- (The callback may be from another thread...)
- * Therefore the memory manager contains a call which allows you to resize a block, but with a callback
- which is used to atomically update the pointer so it always remains thread safe.
- Compacting heaps
- ================
- Occasionally you have processes which read a large number of rows and then filter them so only a few are still
- held in memory. Rows tend to be allocated in sequence through the heap pages, which can mean those few remaining
- rows are scattered over many pages. If they could all be moved to a single page it would free up a significant
- amount of memory.
- The memory manager contains a function to pack a set of rows into a smaller number of pages: IRowManager->compactRows().
- This works by iterating through each of the rows in a list. If the row belongs to a heap that could be compacted,
- and isn't part of a full heaplet, then the row is moved. Since subsequent rows tend to be allocated from the same
- heaplet this has the effect of compacting the rows.
- *************
- Shared Memory
- *************
- Much of the time Thor doesn't uses full memory available to it. If you are running multiple Thor processes
- on the same machine you may want to configure the system so that each Thor has a private block of memory,
- but there is also a shared block of memory which can be used by whichever process needs it.
- The ILargeMemCallback provides a mechanism to dynamically allocate more memory to a process as it requires it.
- This could potentially be done in stages rather than all or nothing.
- (Currently unused as far as I know... the main problem is that borrowing memory needs to be coordinated.)
- **********
- Huge pages
- **********
- When OS processes use a large amount of memory, mapping virtual addresses to physical addresses can begin to
- take a significant proportion of the execution time. This is especially true once the TLB is not large enough to
- store all the mappings. Huge pages can significantly help with this problem by reducing the number of TLB entries
- needed to cover the virtual address space. The memory manager supports huge pages in two different ways:
- Huge pages can be preallocated (e.g., with hugeadm) for exclusive use as huge pages. If huge pages are enabled
- for a particular engine, and sufficient huge pages are available to supply the memory for the memory manager, then
- they will be used.
- Linux kernels from 2.6.38 onward have support for transparent huge pages. These do not need to be preallocated,
- instead the operating system tries to use them behind the scenes. HPCC version 5.2 and following takes advantage
- of this feature to significantly speed memory access up when large amounts of memory are used by each process.
- Preallocated huge pages tend to be more efficient, but they have the disadvantage that the operating system currently
- does not reuse unused huge pages for other purposes e.g., disk cache.
- There is also a memory manager option to not return the memory to the operating system when it is no longer
- required. This has the advantage of not clearing the memory whenever it is required again, but the same disadvantage
- as preallocated huge pages that the unused memory cannot be used for disk cache. We recommend this option is
- selected when preallocated huge pages are in use - until the kernel allows them to be reused.
- **************************
- Global memory and channels
- **************************
- Changes in 6.x allow Thor to run multiple channels within the same process. This allows data that is constant
- for all channels to be shared between all slave channels - a prime example is the rhs of a lookup join. For
- the queries to run efficiently the memory manager needs to ensure that each slave channel has the same amount
- of memory - especially when memory is being used that is shared between them.
- createGlobalRowManager() allows a single global row manager to be created which also provides slave row managers
- for the different channels via the querySlaveRowManager(unsigned slave) method.
|