Implementation of thread pools suitable for parallelization of (more or less) memory-bound algorithms (not only) on NUMA machines. More...

#include <threading.hh>

Detailed Description

Implementation of thread pools suitable for parallelization of (more or less) memory-bound algorithms (not only) on NUMA machines.

This class maintains two thread pools satisfying different needs.

The threads in the first (global) pool can be moved by the operating system freely between nodes and CPUs, and should be used (by submitting tasks via run) whenever there is no particular need to execute the task on a particular NUMA node, i.e. if the task is not memory-bandwidth-bound or works on data that is not located on a particular NUMA node. The number of global threads defaults to twice the number of available CPUs (such that all CPUs can be busy even if some threads wait on a mutex), but is at least 4 unless limited to a smaller number on construction of the thread pool, see instance.

The threads in the second (NUMA) pool are pinned to nodes (the OS is allowed to move them between CPUS on the same node), and should be used (by submitting tasks via runOnNode) whenever the tasks are memory-bandwidth-bound and the data resides on a particular NUMA node. Note that as the threads are locked to the nodes, the operating system cannot move the threads. This is intended to keep the thread close to its data in order to have local memory access. On the other hand, on multi-user machines it can lead to several threads competing for the same CPU while other CPUs are idle. Use the node-locked threads only if locality of memory access is of top priority.

Relying on the first-touch policy for controlling memory locality is not guaranteed to keep threads and data close to each other. First, the allocator may re-use memory blocks previously touched and released by a different thread, leading to remote data access. Second, the operating system may decide to move a thread to a different node without knowledge of which thread is memory-bound or compute-bound.

Caveat: When submitting tasks recursively to the task pool, it is easy to create a deadlock. While the top level tasks wait for the completion of the lower level tasks, the lower level task is not processed as all threads in the pool are occupied by the top level tasks. Take care not to submit more recursive tasks than there are worker threads available.

Definition at line 292 of file threading.hh.

Public Member Functions
System information
int	nodes () const
	Reports the number of NUMA nodes (i.e., memory interfaces/CPU sockets) More...

int	cpus () const
	Reports the total number of CPUs (usually a multiple of nodes). More...

int	runningOnGlobalQueue () const
	Reports how many worker threads are running to work on the global task queue. More...

int	cpus (int node) const
	Reports the number of CPUs on the given node (usually the same for all nodes). More...

int	maxCpusOnNode () const
	Reports the maximal number of CPUs on one node. More...

bool	isSequential () const
	Returns true if tasks are executed sequentially. sequential execution can be enforced by calling NumaThreadPool::instance(1) at program start. Note that nevertheless tasks are executed in a std::packaged_task context, which means that exceptions are caught and rethrown in future:get(). Thus, stack context is lost when debugging even for sequential execution. More...

Task submission
Ticket	run (Task &&task)
	Schedules a task to be executed on an arbitrary CPU. More...

Ticket	runOnNode (int node, Task &&task)
	Schedules a task to be executed on a CPU belonging to the given NUMA node. More...

NUMA-aware memory management
Kalloc &	allocator (int node)
	Returns the allocator used for the given node. More...

void *	allocate (size_t n, int node)
	Allocates memory on a specific node. More...

void	deallocate (void *p, size_t n, int node)
	frees a chunk of memory previously allocated More...

size_t	alignment (int node) const
	Reports the alignment size of allocator at given NUMA node. More...

void	reserve (size_t n, size_t k, int node)
	Tells the allocator to prepare for subsequent allocation of several memory blocks of same size. More...

Static Public Member Functions
static NumaThreadPool &	instance (int maxThreads=std::numeric_limits< int >::max())
	Returns a globally unique thread pool instance. More...

Member Function Documentation

◆ alignment()

size_t Kaskade::NumaThreadPool::alignment ( int node ) const

Reports the alignment size of allocator at given NUMA node.

◆ allocate()

void * Kaskade::NumaThreadPool::allocate	(	size_t	n,
		int	node
	)

Allocates memory on a specific node.

Note: this is comparatively slow and should only be used for allocating large chunks of memory to be managed locally. The memory has to be released by a subsequent call to deallocate.

◆ allocator()

Kalloc & Kaskade::NumaThreadPool::allocator ( int node )

Returns the allocator used for the given node.

◆ cpus() [1/2]

int Kaskade::NumaThreadPool::cpus ( ) const

inline

Reports the total number of CPUs (usually a multiple of nodes).

This is based on what std::thread::hardware_concurrency() returns, and is not guaranteed to be what hardware is really available. Precisely, it max(1,hardware_concurrency()).

Definition at line 327 of file threading.hh.

Referenced by Kaskade::VariationalFunctionalAssembler< F, SparseIndex, BoundaryDetector, QuadRule >::assemble(), Kaskade::NumaBCRSMatrix< Entry, Index >::conjugation(), Kaskade::parallelFor(), and Kaskade::PatchDomainDecompositionPreconditioner< Space, m, StorageTag, SparseMatrixIndex >::PatchDomainDecompositionPreconditioner().

◆ cpus() [2/2]

int Kaskade::NumaThreadPool::cpus ( int node ) const

inline

Reports the number of CPUs on the given node (usually the same for all nodes).

Definition at line 343 of file threading.hh.

◆ deallocate()

void Kaskade::NumaThreadPool::deallocate	(	void *	p,
		size_t	n,
		int	node
	)

frees a chunk of memory previously allocated

◆ instance()

static NumaThreadPool & Kaskade::NumaThreadPool::instance ( int maxThreads = std::numeric_limits< int >::max() )

static

Returns a globally unique thread pool instance.

On the very first call of this method, the singleton thread pool is created with a limitation of the number of global (unpinned) threads limited by the given number of maxThreads. Later calls with different value of maxThreads do not change the number of threads. In case the number of global threads shall be limited throughout, call this method at the very start of the program.

Parameters

maxThreads an upper bound for the number of global threads to create.

As it makes little sense to have multiple thread pools fight for physical ressources, a single instance should be employed.

Referenced by Kaskade::JacobiPreconditionerDetail::DiagonalBlock< Entry, row, col >::apply(), Kaskade::VariationalFunctionalAssembler< F, SparseIndex, BoundaryDetector, QuadRule >::assemble(), Kaskade::GridManagerBase< Grd >::cellRanges(), Kaskade::NumaBCRSMatrix< Entry, Index >::conjugation(), Kaskade::JacobiPreconditionerDetail::DiagonalBlock< Entry, row, col >::DiagonalBlock(), Kaskade::gridIterate(), Kaskade::NumaCRSPattern< Index >::NumaCRSPattern(), Kaskade::NumaCRSPatternCreator< Index >::NumaCRSPatternCreator(), Kaskade::parallelFor(), Kaskade::parallelForNodes(), Kaskade::PatchDomainDecompositionPreconditioner< Space, m, StorageTag, SparseMatrixIndex >::PatchDomainDecompositionPreconditioner(), Kaskade::TransferData< Space, CoarseningPolicy >::TransferData(), and Kaskade::NumaCRSPatternCreator< Index >::~NumaCRSPatternCreator().

◆ isSequential()

bool Kaskade::NumaThreadPool::isSequential ( ) const

inline

Returns true if tasks are executed sequentially. sequential execution can be enforced by calling NumaThreadPool::instance(1) at program start. Note that nevertheless tasks are executed in a std::packaged_task context, which means that exceptions are caught and rethrown in future:get(). Thus, stack context is lost when debugging even for sequential execution.

Definition at line 362 of file threading.hh.

Referenced by Kaskade::parallelFor().

◆ maxCpusOnNode()

int Kaskade::NumaThreadPool::maxCpusOnNode ( ) const

inline

Reports the maximal number of CPUs on one node.

Definition at line 351 of file threading.hh.

◆ nodes()

int Kaskade::NumaThreadPool::nodes ( ) const

inline

Reports the number of NUMA nodes (i.e., memory interfaces/CPU sockets)

Definition at line 316 of file threading.hh.

Referenced by Kaskade::JacobiPreconditionerDetail::DiagonalBlock< Entry, row, col >::DiagonalBlock(), Kaskade::NumaCRSPatternCreator< Index >::NumaCRSPatternCreator(), Kaskade::parallelForNodes(), and Kaskade::NumaCRSPatternCreator< Index >::~NumaCRSPatternCreator().

◆ reserve()

void Kaskade::NumaThreadPool::reserve	(	size_t	n,
		size_t	k,
		int	node
	)

Tells the allocator to prepare for subsequent allocation of several memory blocks of same size.

Parameters

n	the requested size of the memory blocks
k	the number of memory blocks that will be requested
node	on which NUMA node

Use this as a hint for the allocator that allows to improve its performance.

◆ run()

Ticket Kaskade::NumaThreadPool::run ( Task && task )

Schedules a task to be executed on an arbitrary CPU.

Parameters

task	the task object to be executed.

Returns: a waitable ticket. Call wait() on the ticket in order to wait for the task to be completed.

Note that waiting for tasks blocks a thread. Waiting for other tasks within a task may use up threads from the thread pool and lead to deadlocks.

Referenced by Kaskade::parallelFor().

◆ runningOnGlobalQueue()

int Kaskade::NumaThreadPool::runningOnGlobalQueue ( ) const

inline

Reports how many worker threads are running to work on the global task queue.

Definition at line 335 of file threading.hh.

◆ runOnNode()

Ticket Kaskade::NumaThreadPool::runOnNode	(	int	node,
		Task &&	task
	)

Schedules a task to be executed on a CPU belonging to the given NUMA node.

Parameters

node	the number of the node on which to execute the task. 0 <= node < nodes().
task	the task object to be executed. The object has to live at least until the call to its call operator returns.

Returns: a waitable ticket. Call wait() on the ticket in order to wait for the task to be completed.

Note that waiting for tasks blocks a thread. Waiting for other tasks within a task may use up threads from the thread pool and lead to deadlocks.

Referenced by Kaskade::parallelForNodes().

The documentation for this class was generated from the following file:

threading.hh

Detailed Description

Public Member Functions

Static Public Member Functions

Member Function Documentation

◆ alignment()

◆ allocate()

◆ allocator()

◆ cpus() [1/2]

◆ cpus() [2/2]

◆ deallocate()

◆ instance()

◆ isSequential()

◆ maxCpusOnNode()

◆ nodes()

◆ reserve()

◆ run()

◆ runningOnGlobalQueue()

◆ runOnNode()