^ ^                 ^ ^
=(o.o)=  ~byakuren  =(o.o)=
  m m                 m m

Parallel Execution Patches for J Source

This patch adds automatic multi-threaded execution for rank-1 operations on large arrays.

Patch: parallel-rank-execution.patch

Overview

J’s rank operator (") applies a verb to cells of an array. When processing thousands of cells, this can be parallelised across multiple CPU cores. This patch adds automatic parallel execution when:

The array has 1000+ cells
Multiple worker threads are available
The operation is not being debugged
There are no in-place operation conflicts

Example Speedup

NB. Without parallelisation: processes cells sequentially
expensive_op"1 large_matrix   NB. 10000 rows × single thread

NB. With parallelisation: divides work across available cores
expensive_op"1 large_matrix   NB. 10000 rows ÷ 4 threads = ~4x faster

How It Works

Automatic Detection

The patch modifies jtrank1ex in jsrc/cr.c to check if parallel execution would be beneficial:

const I PARALLEL_THRESHOLD = 1000;  // Minimum cells to justify overhead

if (mn >= PARALLEL_THRESHOLD &&     // Enough work
    nthreads > 1 &&                 // Multiple threads available
    !(jt->uflags.trace & TRACEDB) && // Not debugging
    !(state & ZZFLAGVIRTWINPLACEX)) { // Not inplace-conflicting
    // Use parallel execution
} else {
    // Fall back to sequential
}

Work Distribution

Cells are divided evenly among available threads:

Thread 0: cells [0 .. N/T)
Thread 1: cells [N/T .. 2N/T)
...
Thread T-1: cells [(T-1)N/T .. N)

Where N = total cells, T = thread count.

Worker Context

Each worker thread receives a context structure:

typedef struct {
    A virtw;      // Virtual block template
    A fs;         // Verb to apply
    AF f1;        // Function pointer
    I wk;         // Bytes per cell
    I wcn;        // Atoms per cell
    I rr;         // Rank
    I mn;         // Total cell count
    I wf;         // Frame length
    I state;      // Processing flags
    J jtfg;       // J thread context
    A *results;   // Output array for chunk results
} rank1_parallel_ctx;

Result Assembly

After all workers complete: 1. Results from each chunk are collected 2. Chunk results are merged in correct order 3. Final result has the expected frame + cell shape

Technical Details

Thread Safety

The implementation ensures thread safety through:

Independent virtual blocks: Each worker creates its own virtual cell block
Separate result storage: Each thread writes to its own results slot
No shared mutable state: Workers only read from the input array
Graceful fallback: Any error triggers sequential execution

Memory Management

// Allocate results array for each thread
A resultsA; GAT0(resultsA, BOX, nthreads, 1);
pctx.results = AAV(resultsA);

// Each worker allocates its own intermediate results
results = (A*)malloc(chunk_size * sizeof(A));

// Final merge copies data to output array
MC(zzdata + start_cell * cellsize, CAV(pctx.results[t]), chunk_cells * cellsize);

Conditions for Parallel Execution

Condition	Reason
`mn >= 1000`	Parallelisation overhead not worthwhile for small arrays
`nthreads > 1`	Need multiple workers to benefit
`!(trace & TRACEDB)`	Debugging requires sequential execution for predictable output
`!ZZFLAGVIRTWINPLACEX`	In-place operations could cause race conditions

Usage

Enabling Thread Pool

J’s thread pool must be initialised for parallel execution to occur. Check your J configuration for thread pool settings.

Verifying Parallel Execution

For debugging, compile with DEBUG_PARALLEL_RANK defined:

#define DEBUG_PARALLEL_RANK

This enables diagnostic output:

PARALLEL DEBUG: mn=10000, nthreads=4, threshold=1000, trace=0, inplace=0
PARALLEL DEBUG: Taking parallel execution path with 4 threads

Operations That Benefit

Operations applied to many independent cells:

NB. Good candidates for parallelisation:
expensive_calc"1 big_matrix      NB. Each row processed independently
normalize"2 tensor               NB. Each plane normalised independently
hash"1 strings                   NB. Each string hashed independently

NB. Less benefit (already fast):
+/"1 matrix                      NB. Sum is very fast per cell

Limitations

Monadic rank-1 only: Currently only parallelises monadic functions applied at rank 1
Thread pool required: J must have worker threads initialised
Overhead: For very fast operations, parallelisation overhead may exceed benefits
Memory: Each thread allocates intermediate storage

Applying the Patch

cd /path/to/jsource
git apply parallel-rank-execution.patch

Building with Debug Output

To enable parallel execution diagnostics during development:

# Add to your build flags:
CFLAGS="-DDEBUG_PARALLEL_RANK" make

Future Work

Potential extensions to this parallelisation approach:

Support for dyadic rank operations
Configurable threshold (currently hardcoded at 1000)
Better integration with J’s result assembly system (result.h)
Adaptive threshold based on cell operation complexity
Support for higher ranks (currently rank-1 only)