KASKADE 7 development version
Static Public Member Functions | List of all members
Kaskade::NumaThreadPool Class Reference

Implementation of thread pools suitable for parallelization of (more or less) memory-bound algorithms (not only) on NUMA machines. More...

#include <threading.hh>

Detailed Description

Implementation of thread pools suitable for parallelization of (more or less) memory-bound algorithms (not only) on NUMA machines.

This class maintains two thread pools satisfying different needs.

The threads in the first (global) pool can be moved by the operating system freely between nodes and CPUs, and should be used (by submitting tasks via run) whenever there is no particular need to execute the task on a particular NUMA node, i.e. if the task is not memory-bandwidth-bound or works on data that is not located on a particular NUMA node. The number of global threads defaults to twice the number of available CPUs (such that all CPUs can be busy even if some threads wait on a mutex), but is at least 4 unless limited to a smaller number on construction of the thread pool, see instance.

The threads in the second (NUMA) pool are pinned to nodes (the OS is allowed to move them between CPUS on the same node), and should be used (by submitting tasks via runOnNode) whenever the tasks are memory-bandwidth-bound and the data resides on a particular NUMA node. Note that as the threads are locked to the nodes, the operating system cannot move the threads. This is intended to keep the thread close to its data in order to have local memory access. On the other hand, on multi-user machines it can lead to several threads competing for the same CPU while other CPUs are idle. Use the node-locked threads only if locality of memory access is of top priority.

Relying on the first-touch policy for controlling memory locality is not guaranteed to keep threads and data close to each other. First, the allocator may re-use memory blocks previously touched and released by a different thread, leading to remote data access. Second, the operating system may decide to move a thread to a different node without knowledge of which thread is memory-bound or compute-bound.

Caveat: When submitting tasks recursively to the task pool, it is easy to create a deadlock. While the top level tasks wait for the completion of the lower level tasks, the lower level task is not processed as all threads in the pool are occupied by the top level tasks. Take care not to submit more recursive tasks than there are worker threads available.

Definition at line 292 of file threading.hh.

Public Member Functions

System information
int nodes () const
 Reports the number of NUMA nodes (i.e., memory interfaces/CPU sockets) More...
 
int cpus () const
 Reports the total number of CPUs (usually a multiple of nodes). More...
 
int runningOnGlobalQueue () const
 Reports how many worker threads are running to work on the global task queue. More...
 
int cpus (int node) const
 Reports the number of CPUs on the given node (usually the same for all nodes). More...
 
int maxCpusOnNode () const
 Reports the maximal number of CPUs on one node. More...
 
bool isSequential () const
 Returns true if tasks are executed sequentially. sequential execution can be enforced by calling NumaThreadPool::instance(1) at program start. Note that nevertheless tasks are executed in a std::packaged_task context, which means that exceptions are caught and rethrown in future:get(). Thus, stack context is lost when debugging even for sequential execution. More...
 
Task submission
Ticket run (Task &&task)
 Schedules a task to be executed on an arbitrary CPU. More...
 
Ticket runOnNode (int node, Task &&task)
 Schedules a task to be executed on a CPU belonging to the given NUMA node. More...
 
NUMA-aware memory management
Kallocallocator (int node)
 Returns the allocator used for the given node. More...
 
void * allocate (size_t n, int node)
 Allocates memory on a specific node. More...
 
void deallocate (void *p, size_t n, int node)
 frees a chunk of memory previously allocated More...
 
size_t alignment (int node) const
 Reports the alignment size of allocator at given NUMA node. More...
 
void reserve (size_t n, size_t k, int node)
 Tells the allocator to prepare for subsequent allocation of several memory blocks of same size. More...
 

Static Public Member Functions

static NumaThreadPoolinstance (int maxThreads=std::numeric_limits< int >::max())
 Returns a globally unique thread pool instance. More...
 

Member Function Documentation

◆ alignment()

size_t Kaskade::NumaThreadPool::alignment ( int  node) const

Reports the alignment size of allocator at given NUMA node.

◆ allocate()

void * Kaskade::NumaThreadPool::allocate ( size_t  n,
int  node 
)

Allocates memory on a specific node.

Note: this is comparatively slow and should only be used for allocating large chunks of memory to be managed locally. The memory has to be released by a subsequent call to deallocate.

◆ allocator()

Kalloc & Kaskade::NumaThreadPool::allocator ( int  node)

Returns the allocator used for the given node.

◆ cpus() [1/2]

int Kaskade::NumaThreadPool::cpus ( ) const
inline

Reports the total number of CPUs (usually a multiple of nodes).

This is based on what std::thread::hardware_concurrency() returns, and is not guaranteed to be what hardware is really available. Precisely, it max(1,hardware_concurrency()).

Definition at line 327 of file threading.hh.

Referenced by Kaskade::VariationalFunctionalAssembler< F, SparseIndex, BoundaryDetector, QuadRule >::assemble(), Kaskade::NumaBCRSMatrix< Entry, Index >::conjugation(), Kaskade::parallelFor(), and Kaskade::PatchDomainDecompositionPreconditioner< Space, m, StorageTag, SparseMatrixIndex >::PatchDomainDecompositionPreconditioner().

◆ cpus() [2/2]

int Kaskade::NumaThreadPool::cpus ( int  node) const
inline

Reports the number of CPUs on the given node (usually the same for all nodes).

Definition at line 343 of file threading.hh.

◆ deallocate()

void Kaskade::NumaThreadPool::deallocate ( void *  p,
size_t  n,
int  node 
)

frees a chunk of memory previously allocated

◆ instance()

static NumaThreadPool & Kaskade::NumaThreadPool::instance ( int  maxThreads = std::numeric_limits< int >::max())
static

Returns a globally unique thread pool instance.

On the very first call of this method, the singleton thread pool is created with a limitation of the number of global (unpinned) threads limited by the given number of maxThreads. Later calls with different value of maxThreads do not change the number of threads. In case the number of global threads shall be limited throughout, call this method at the very start of the program.

Parameters
maxThreadsan upper bound for the number of global threads to create.

As it makes little sense to have multiple thread pools fight for physical ressources, a single instance should be employed.

Referenced by Kaskade::JacobiPreconditionerDetail::DiagonalBlock< Entry, row, col >::apply(), Kaskade::VariationalFunctionalAssembler< F, SparseIndex, BoundaryDetector, QuadRule >::assemble(), Kaskade::GridManagerBase< Grd >::cellRanges(), Kaskade::NumaBCRSMatrix< Entry, Index >::conjugation(), Kaskade::JacobiPreconditionerDetail::DiagonalBlock< Entry, row, col >::DiagonalBlock(), Kaskade::gridIterate(), Kaskade::NumaCRSPattern< Index >::NumaCRSPattern(), Kaskade::NumaCRSPatternCreator< Index >::NumaCRSPatternCreator(), Kaskade::parallelFor(), Kaskade::parallelForNodes(), Kaskade::PatchDomainDecompositionPreconditioner< Space, m, StorageTag, SparseMatrixIndex >::PatchDomainDecompositionPreconditioner(), Kaskade::TransferData< Space, CoarseningPolicy >::TransferData(), and Kaskade::NumaCRSPatternCreator< Index >::~NumaCRSPatternCreator().

◆ isSequential()

bool Kaskade::NumaThreadPool::isSequential ( ) const
inline

Returns true if tasks are executed sequentially. sequential execution can be enforced by calling NumaThreadPool::instance(1) at program start. Note that nevertheless tasks are executed in a std::packaged_task context, which means that exceptions are caught and rethrown in future:get(). Thus, stack context is lost when debugging even for sequential execution.

Definition at line 362 of file threading.hh.

Referenced by Kaskade::parallelFor().

◆ maxCpusOnNode()

int Kaskade::NumaThreadPool::maxCpusOnNode ( ) const
inline

Reports the maximal number of CPUs on one node.

Definition at line 351 of file threading.hh.

◆ nodes()

int Kaskade::NumaThreadPool::nodes ( ) const
inline

◆ reserve()

void Kaskade::NumaThreadPool::reserve ( size_t  n,
size_t  k,
int  node 
)

Tells the allocator to prepare for subsequent allocation of several memory blocks of same size.

Parameters
nthe requested size of the memory blocks
kthe number of memory blocks that will be requested
nodeon which NUMA node

Use this as a hint for the allocator that allows to improve its performance.

◆ run()

Ticket Kaskade::NumaThreadPool::run ( Task &&  task)

Schedules a task to be executed on an arbitrary CPU.

Parameters
taskthe task object to be executed.
Returns
a waitable ticket. Call wait() on the ticket in order to wait for the task to be completed.

Note that waiting for tasks blocks a thread. Waiting for other tasks within a task may use up threads from the thread pool and lead to deadlocks.

Referenced by Kaskade::parallelFor().

◆ runningOnGlobalQueue()

int Kaskade::NumaThreadPool::runningOnGlobalQueue ( ) const
inline

Reports how many worker threads are running to work on the global task queue.

Definition at line 335 of file threading.hh.

◆ runOnNode()

Ticket Kaskade::NumaThreadPool::runOnNode ( int  node,
Task &&  task 
)

Schedules a task to be executed on a CPU belonging to the given NUMA node.

Parameters
nodethe number of the node on which to execute the task. 0 <= node < nodes().
taskthe task object to be executed. The object has to live at least until the call to its call operator returns.
Returns
a waitable ticket. Call wait() on the ticket in order to wait for the task to be completed.

Note that waiting for tasks blocks a thread. Waiting for other tasks within a task may use up threads from the thread pool and lead to deadlocks.

Referenced by Kaskade::parallelForNodes().


The documentation for this class was generated from the following file: