This patch adds automatic multi-threaded execution for rank-1 operations on large arrays.
Patch: parallel-rank-execution.patch
J’s rank operator (") applies a verb to cells of an
array. When processing thousands of cells, this can be parallelised
across multiple CPU cores. This patch adds automatic parallel execution
when:
NB. Without parallelisation: processes cells sequentially
expensive_op"1 large_matrix NB. 10000 rows × single thread
NB. With parallelisation: divides work across available cores
expensive_op"1 large_matrix NB. 10000 rows ÷ 4 threads = ~4x fasterThe patch modifies jtrank1ex in jsrc/cr.c
to check if parallel execution would be beneficial:
const I PARALLEL_THRESHOLD = 1000; // Minimum cells to justify overhead
if (mn >= PARALLEL_THRESHOLD && // Enough work
nthreads > 1 && // Multiple threads available
!(jt->uflags.trace & TRACEDB) && // Not debugging
!(state & ZZFLAGVIRTWINPLACEX)) { // Not inplace-conflicting
// Use parallel execution
} else {
// Fall back to sequential
}Cells are divided evenly among available threads:
Thread 0: cells [0 .. N/T)
Thread 1: cells [N/T .. 2N/T)
...
Thread T-1: cells [(T-1)N/T .. N)
Where N = total cells, T = thread count.
Each worker thread receives a context structure:
typedef struct {
A virtw; // Virtual block template
A fs; // Verb to apply
AF f1; // Function pointer
I wk; // Bytes per cell
I wcn; // Atoms per cell
I rr; // Rank
I mn; // Total cell count
I wf; // Frame length
I state; // Processing flags
J jtfg; // J thread context
A *results; // Output array for chunk results
} rank1_parallel_ctx;After all workers complete: 1. Results from each chunk are collected 2. Chunk results are merged in correct order 3. Final result has the expected frame + cell shape
The implementation ensures thread safety through:
// Allocate results array for each thread
A resultsA; GAT0(resultsA, BOX, nthreads, 1);
pctx.results = AAV(resultsA);
// Each worker allocates its own intermediate results
results = (A*)malloc(chunk_size * sizeof(A));
// Final merge copies data to output array
MC(zzdata + start_cell * cellsize, CAV(pctx.results[t]), chunk_cells * cellsize);| Condition | Reason |
|---|---|
mn >= 1000 |
Parallelisation overhead not worthwhile for small arrays |
nthreads > 1 |
Need multiple workers to benefit |
!(trace & TRACEDB) |
Debugging requires sequential execution for predictable output |
!ZZFLAGVIRTWINPLACEX |
In-place operations could cause race conditions |
J’s thread pool must be initialised for parallel execution to occur. Check your J configuration for thread pool settings.
For debugging, compile with DEBUG_PARALLEL_RANK
defined:
#define DEBUG_PARALLEL_RANKThis enables diagnostic output:
PARALLEL DEBUG: mn=10000, nthreads=4, threshold=1000, trace=0, inplace=0
PARALLEL DEBUG: Taking parallel execution path with 4 threads
Operations applied to many independent cells:
NB. Good candidates for parallelisation:
expensive_calc"1 big_matrix NB. Each row processed independently
normalize"2 tensor NB. Each plane normalised independently
hash"1 strings NB. Each string hashed independently
NB. Less benefit (already fast):
+/"1 matrix NB. Sum is very fast per cellcd /path/to/jsource
git apply parallel-rank-execution.patchTo enable parallel execution diagnostics during development:
# Add to your build flags:
CFLAGS="-DDEBUG_PARALLEL_RANK" makePotential extensions to this parallelisation approach:
result.h)