Parallel Execution Patches for J Source

This patch adds automatic multi-threaded execution for rank-1 operations on large arrays.

Patch: parallel-rank-execution.patch


Overview

J’s rank operator (") applies a verb to cells of an array. When processing thousands of cells, this can be parallelised across multiple CPU cores. This patch adds automatic parallel execution when:

Example Speedup

NB. Without parallelisation: processes cells sequentially
expensive_op"1 large_matrix   NB. 10000 rows × single thread

NB. With parallelisation: divides work across available cores
expensive_op"1 large_matrix   NB. 10000 rows ÷ 4 threads = ~4x faster

How It Works

Automatic Detection

The patch modifies jtrank1ex in jsrc/cr.c to check if parallel execution would be beneficial:

const I PARALLEL_THRESHOLD = 1000;  // Minimum cells to justify overhead

if (mn >= PARALLEL_THRESHOLD &&     // Enough work
    nthreads > 1 &&                 // Multiple threads available
    !(jt->uflags.trace & TRACEDB) && // Not debugging
    !(state & ZZFLAGVIRTWINPLACEX)) { // Not inplace-conflicting
    // Use parallel execution
} else {
    // Fall back to sequential
}

Work Distribution

Cells are divided evenly among available threads:

Thread 0: cells [0 .. N/T)
Thread 1: cells [N/T .. 2N/T)
...
Thread T-1: cells [(T-1)N/T .. N)

Where N = total cells, T = thread count.

Worker Context

Each worker thread receives a context structure:

typedef struct {
    A virtw;      // Virtual block template
    A fs;         // Verb to apply
    AF f1;        // Function pointer
    I wk;         // Bytes per cell
    I wcn;        // Atoms per cell
    I rr;         // Rank
    I mn;         // Total cell count
    I wf;         // Frame length
    I state;      // Processing flags
    J jtfg;       // J thread context
    A *results;   // Output array for chunk results
} rank1_parallel_ctx;

Result Assembly

After all workers complete: 1. Results from each chunk are collected 2. Chunk results are merged in correct order 3. Final result has the expected frame + cell shape


Technical Details

Thread Safety

The implementation ensures thread safety through:

Memory Management

// Allocate results array for each thread
A resultsA; GAT0(resultsA, BOX, nthreads, 1);
pctx.results = AAV(resultsA);

// Each worker allocates its own intermediate results
results = (A*)malloc(chunk_size * sizeof(A));

// Final merge copies data to output array
MC(zzdata + start_cell * cellsize, CAV(pctx.results[t]), chunk_cells * cellsize);

Conditions for Parallel Execution

Condition Reason
mn >= 1000 Parallelisation overhead not worthwhile for small arrays
nthreads > 1 Need multiple workers to benefit
!(trace & TRACEDB) Debugging requires sequential execution for predictable output
!ZZFLAGVIRTWINPLACEX In-place operations could cause race conditions

Usage

Enabling Thread Pool

J’s thread pool must be initialised for parallel execution to occur. Check your J configuration for thread pool settings.

Verifying Parallel Execution

For debugging, compile with DEBUG_PARALLEL_RANK defined:

#define DEBUG_PARALLEL_RANK

This enables diagnostic output:

PARALLEL DEBUG: mn=10000, nthreads=4, threshold=1000, trace=0, inplace=0
PARALLEL DEBUG: Taking parallel execution path with 4 threads

Operations That Benefit

Operations applied to many independent cells:

NB. Good candidates for parallelisation:
expensive_calc"1 big_matrix      NB. Each row processed independently
normalize"2 tensor               NB. Each plane normalised independently
hash"1 strings                   NB. Each string hashed independently

NB. Less benefit (already fast):
+/"1 matrix                      NB. Sum is very fast per cell

Limitations

  1. Monadic rank-1 only: Currently only parallelises monadic functions applied at rank 1
  2. Thread pool required: J must have worker threads initialised
  3. Overhead: For very fast operations, parallelisation overhead may exceed benefits
  4. Memory: Each thread allocates intermediate storage

Applying the Patch

cd /path/to/jsource
git apply parallel-rank-execution.patch

Building with Debug Output

To enable parallel execution diagnostics during development:

# Add to your build flags:
CFLAGS="-DDEBUG_PARALLEL_RANK" make

Future Work

Potential extensions to this parallelisation approach: