secp256k1 Point-Addition Challenge with Lumbda Attack
secp256k1 Point-Addition Challenge — Lumbda Attack
Upstream challenge: ecdsa.fail (Eigen Labs · Google Quantum AI lineage)
Substrate: lumbda.com
What This Attack Does
A quantum-reversible attack on a secp256k1 point addition, written entirely in lumbda — a Lisp/Scheme derivative carrying four implementation tiers: Python bytecode VM, C tree-walker + JIT, C + x86_64 JIT, & pure x86_64 assembly. Score equals Toffoli count × peak qubit width. Lower score wins. Every factor of two off a Toffoli count or qubit shaved off peak width multiplies straight through Shor's algorithm into a factor of two off our resource estimate for cracking secp256k1 — a curve that protects Bitcoin & Ethereum.
Where This Work Lives in Our Stack
ecdsa.fail (an Eigen Labs challenge descended from Google Quantum AI's Securing Elliptic Curve Cryptocurrencies against Quantum Vulnerabilities) accepts Rust submissions only — Rust defines a contract format, not our research substrate. Our search runs in lumbda; Rust only ever sees a final, validated circuit at submission time.
Four reasons for lumbda over Rust:
- Cross-tier validation. Same
.lspruns on Python VM, C tree-walker, C + JIT, & x86_64 assembly. Cross-tier byte-identity catches arithmetic drift Rust's single-target build cannot. - Distributed fan-out. Lumbda's TCP socket primitive + S-expression portal format shards candidate evaluation across our fleet — spare CPU on GPU hosts (3090-ai, 4090-ai/ai, cammy, guile) becomes search budget Rust would force us to glue together by hand.
- GPU dispatch lives in our substrate.
bendwire on port 8320 ships an ops binary path + batch count to a GPU host's lumbda worker, which spawns CUDAbend-cuda, returns Σ Clifford & Σ Toffoli via portal. Same protocol fans out across all four lumbda implementation tiers. - CUDA path remains open territory. Growing lumbda's tier ladder feeds back into every other lumbda workload.
Rust stays for two purposes: reproduce upstream SOTA baseline once (archived), & translate our final lumbda circuit into upstream's submission format.
Phase B — Reversible Arithmetic
Twelve steps after Roetteler et al. (2017), modular arithmetic over secp256k1's prime field, lumbda-native:
- Cuccaro reversible n-bit adder + modular addition, subtraction, doubling, halving.
- Litinski schoolbook & Solinas reduction modular multiplication.
- Fermat, textbook Bernstein-Yang (B-Y), refined B-Y, & DIALOG_GCD modular inversion.
- Real point-add bundling each primitive into our quantum-reversible EC add.
Cross-tier validation runs on Python tier locally. C-tier emits run on 3090-ai bare-metal — 24 cores, 62 GB RAM, no virtualization tax. Heavy emit left neoblanka after three prior crashes cemented per-tier escalation rules. See lumbda's CLAUDE shard on asm memory discipline.
Lever Stack
Each variant flips a different combination of Phase B substrate flags:
- mod-mul:
litinski(schoolbook) |solinas(reduction) - mod-inv:
fermat|by-textbook(textbook B-Y) |by-refined(refined B-Y) |by-dialog-gcd(DIALOG_GCD) - ancilla pool: per-width LIFO free-list recycling qubit IDs at our streaming-emit sink (additive — no Phase B primitive changes)
Cartesian product = 4 inv × 2 mul = 8 variants; refined-B-Y under Fermat dispatch collapses (refined path never runs without B-Y dispatch).
DIALOG_GCD ports three core algorithmic knobs from HEAD (production
inversion at mod.rs lines 31052-31257):
- D1 — smooth width envelope:
ideal = N − step × SLOPE + MARGINper Kaliski step. Engages once step crosses ~37 (HEAD-prod knobs). - D2 — truncated comparator window: caps comparator bits below textbook 2n. Buys 5–7 % Toffoli per pass at probe widths.
- D3 — ``ACTIVE_ITERATIONS`` cap below 2n.
Lever Ranking Reads Cleanly Across Widths
Every (a, b) pair where a beats b at p=5 also beats b at p=11 & p=251 — monotone columns across our entire Pareto chain. A small-width screen picks our winner without spending production-scale budget.
Σ Toffoli per variant (sigma Toffoli, sum of Toffoli ops) at p=5 · p=11 · p=251 widths:
- v-fermat-schoolbook: 9,054 · 21,336 · 167,984
- v-fermat-solinas: 7,182 · 16,056 · 84,208
- v-by-text-schoolbook: 5,982 · 12,376 · 45,104
- v-by-text-solinas: 5,646 · 11,704 · 40,176
- v-by-ref-schoolbook: 3,482 · 8,104 · 29,104
- v-by-ref-solinas: 3,146 · 7,432 · 24,176
Knob attribution at p=251:
- Refined-B-Y delta saves a width-quadratic constant independent of mod-mul choice: −16,000 Σ Toffoli regardless (45,104 − 29,104 = 40,176 − 24,176).
- Solinas under Fermat compounds: Fermat runs ~log₂(p) mod-muls so a faster mod-mul stacks. −83,776 Σ Toffoli (50 %).
- Solinas under refined-B-Y fades: B-Y carries mod-mul-light arithmetic so mul-knob impact shrinks. −4,928 Σ Toffoli (17 %).
Toffoli growth runs sub-quadratic up our width ladder: n+1=9 → n+1=32 yields 10.1× Toffoli versus naive n² = 15× prediction. Width-truncated Kaliski iterations carry fixed-cost overhead that amortizes as n grows.
GPU Mesh + Bend Wire
bend ships an ops binary path + batch count to whichever
fleet worker holds a warm CUDA context. Wire port 8320
spells BEND:
8 ~= B (implied infinity B flattened)
3 ~= E (backward)
2 ~= N (rotated 90 degrees)
0 ~= D (flattened)
Three nodes carry GPU workers: 3090-ai (Ampere sm_86), 4090-ai/ai
(Ampere sm_89), cammy (Pascal sm_61). Same ops.bin files
dispatched across all three return Σ Toffoli byte-for-byte
identical — our distributed mesh runs from one canonical lumbda
source.
Per-architecture GPU wall (p=251, 1,024 shots):
Wall time per variant on each GPU architecture:
- v-fermat-schoolbook: 290.1 ms (3090) · 149.2 ms (4090) · 531.7 ms (P40)
- v-by-ref-solinas: 26.5 ms (3090) · 22.0 ms (4090) · 64.0 ms (P40)
- 4090 (sm_89) runs 1.94× faster than 3090 on heaviest circuit, shrinking to 1.20× on lightest. Larger circuits amortize kernel-launch overhead.
- P40 (sm_61) runs ~2× slower than 3090 across our variant chain. Older silicon still serves our mesh at a different throughput tier.
Coordinator (ecdsa/scripts/coordinator-mesh.py)
bin-packs candidates greedy by (predicted Σ Toffoli × per-host speed
factor). Per-host factors: 3090 = 1.00, 4090 = 0.51, P40 = 1.84. First
6-variant p=251 mesh sweep landed 6/6 clean.
GPU wall scales linearly with Toffoli count — no kernel-level fudge factor. Same lever, same ~10× speedup across three independent measurements that all agree.
Closed-Loop Research Engine
Every piece of an end-to-end pipeline sits live:
lever-variants.lspexposes(lever-permutations)for arbitrary knob cartesiansemit-stream.lspstreams Phase B gates straight to a QECCOPS1 binary via lumbda's file-port primitive — constant RAM regardless of body sizebenddispatches anops.binpath to a fleet workerbend-cudareturns Σ Clifford & Σ Toffoli via portaldispatch-sweep.pyauto-resumes against new.binfiles
Production Score vs HEAD
Full secp256k1 width (n+1=257, p = 2²⁵⁶ − 2³² − 977), bend dispatch
on 3090-ai:8320, byte-identity-cpu-gpu true, status PASS.
Static circuit Toffoli count (Σ CCX from QECCOPS1 binary parser)
reproduces bend-dispatched avg Toffoli byte-for-byte at every variant
landed — shot-averaging noise washes out:
Current verified champion: vecD-on (dispatcher PASS ·
byte-identity true · classical-shadow 2000/2000 random + 35/35
structured PASS at production width n+1=257) — raw-pa-architecture +
round84-xtail walk-square + iters=129 + cmp=1 + cadd-direct-window=0 +
vecC *red-tmp-borrowed* + vecD
*dgcd-raw-pa-q1-tmp-into-u* versus HEAD:
- peak qubits: 2,573 (us) · 1,285 (HEAD) — 2.002× over
- avg Toffoli: 920,523 (us) · 1,384,984 (HEAD) — us 33.5 % UNDER HEAD (1.505× better)
- score (avg Toffoli × qubits): 2.3685e+09 (us) · 1.780e+09 (HEAD)
- Δ vs HEAD: +33 % (gap 1.33×)
Toffoli-axis race won. Our gate count dropped under
HEAD's count by a third; full remaining 1.33× gap traces to qubit
envelope alone (Bennett reacquire_vec substrate cleanup).
At HEAD's 1,285 qubits with our 920,523 avg Toffoli our score lands
1.18e+09 = gap 0.665× — we would beat HEAD by
1.505×.
Net session compression (2026-06-10 → 2026-06-11):
2.1156e+10 / 2.3685e+09 = 8.93× over a single session.
Verified ladder:
- 2.1156e+10
night17-m1-s1000000K=0 (prior session, gap 11.89×) - 1.7019e+10
kcorrw0-w0-pm2K=2 + champion stack (gap 9.56×) - 3.9308e+09
raw-pa-architecture— flag*dgcd-raw-pa-architecture* #troutesreal-point-add!through HEAD's 5-phase architecture (raw-quotient · round84-xtail squaring · raw-ipmul · mod-sub-const · mod-neg-inplace · mod-add-const) over Roetteler 12-step + 4 mod-inverse calls (gap 2.21×) - 3.5921e+09
raw-pa-architecture-walksquare+*round84-xtail-walk-square* #t(gap 2.02×) - 3.5630e+09
walk-w0+*cadd-direct-window* 0(gap 2.00×) - 3.5571e+09
wkb-cmp3+ cmp=3 (gap 1.997× — first sub-2×) - 3.5455e+09
champ3-cmp1+ cmp=1 (gap 1.992×) - 3.2235e+09
vecC-on+*red-tmp-borrowed*— aliases red-tmp onto idle lam under raw-pa contract. Qubits 2,830 → 2,573 exactly as projected (−257). Gap 1.81×, first sub-1.9× - 2.3685e+09
vecD-on+*dgcd-raw-pa-q1-tmp-into-u*— rebinds raw-pa q1tmponto kaliskiuregister (idle during apply-bitvector phase), eliminating alloc/free/swap Toffolis. avg Toffoli 1,252,817 → 920,523 (−26.5 %). Qubits flat at 2,573. Toffoli count crosses below HEAD. Current verified champion, gap 1.33×
raw-pa-architecture path correctness false-alarm + resolution:
small-width sim-tier verifier reported wrong (Rx, Ry) at n+1=18,
p=131071 across 12 diverse input tuples — looked like an algorithmic
Ry-lane port defect. Root cause traced to
squaring-sub-from-acc-schoolbook-lowq-shift22! primitive
carrying an n ≥ 22 hardcoded Solinas-shift assumption. At n+1=18 (n=17)
our shift22 lands undefined-behavior, primitive emits garbage
independent of raw-pa correctness. Production-width sim status
leaked-ancilla reflects our Bennett
reacquire_vec substrate gap (cleanup), not wrong output —
Toffoli + qubit measurements stay valid as gate-budget proxies. raw-pa
champion line vindicated; small-width sim-tier verifier requires HEAD's
shift22 assumption documented as out-of-scope for n < 22.
raw-pa-architecture dispatch unblocked via a Phase 2 alloc fix:
round84-emit-fused-square-xtail! passed bare gensyms
(r84-row, r84-pad) to
squaring-sub-from-acc-schoolbook-lowq-shift22!, which
expects caller-allocated registers per its contract
(squaring.lsp:210). Inner schoolbook gates emitted
gate-cx / ccx into registers never declared in
our circuit hash table → hash-table-ref: missing key at
first gate access. Fix wraps each cond branch with alloc! +
matching free! per width (n+1+1 for row, 1 for
pad, n+1+1 for carries). Walk-square branch needs no
row/pad — primitive allocates its own ctrl-copy register internally.
Subagent ports landed in parallel (sweep-047 maj2 substrate, sweep-048 mod_add_qq_fast_from_zero, sweep-050 cadd-nbit-const-direct-trunc-fast PRODUCTION-BEATING −25 % alone, sweep-051 body-carry-trunc + binder-notch, sweep-052 K2-pair-codec primitive byte-mirror, sweep-053 cmp-lt-with-cin + Algorithm 11 + Fiat-Shamir stub, sweep-054 apply-phase substrate skeleton = Gidney 2025 venting, sweep-055 split-EEA skeleton). HEAD primitive-parity ledger: 19 PORTED, 4 PORTED-DORMANT, 2 PORTED-SKELETON, 1 PORTED-STUB. See ecdsa/runs/PRIMITIVE-PARITY.tsv + ecdsa/runs/SOTA-BASELINE.md.
Earlier baseline ladder (DIALOG_GCD + m=154 only, no pseudo-Mersenne,
no Lane B, iters=399) landed score 7.9140e+10 (gap
44.46×). 8 lever ports + iters-knob descent + K=2 substrate +
raw-pa-architecture + vecC *red-tmp-borrowed* + vecD
*dgcd-raw-pa-q1-tmp-into-u* compress gap 44.46× →
1.33× within a single sustained sweep cycle.
HEAD reference: upstream ecdsafail-challenge
origin/main at commit 2dcf00d (2026-06-08
18:33 UTC, jackylee0424, binder-notch map + tail-nonce
retune) → score 1,891,349,200 ≈ 1.891e+09
(1,454,884 avg Toffoli × 1,300 peak qubits). Predecessor
a4db9a5 (2026-06-08 17:00 UTC, SamrendraS,
binder-notch extra + tail-nonce retune) lands on the same K2-pair-codec
score floor that 73f4f48 opened on 2026-06-08 05:42 UTC.
Prior frontier a66b042 (jackylee0424,
2026-06-06, score 1.968e+09) retired after K2-pair 6→3 CCX
core encoder shipped. Earlier baseline 2.402e+09 retired in
sweep-026 once upstream scan caught HEAD's promoted frontier moving.
Re-scan snapshot lives at
ecdsa/runs/lumbda-sweep-043/HEAD-RESCAN-2026-06-08.md.
Production-width emit lands via lumbda C tier on 3090-ai bare-metal,
~322 MB peak RSS, ~5.2 GB binary. File-port streaming writes 56-byte op
records straight to disk via fwrite without an intermediate
string buffer.
Bend dispatch on 3090-ai:8320 validates
byte-identity-cpu-gpu true on every variant landed — CUDA
simulator gate-by-gate agrees with CPU reference. Single
bend-cuda binary built locally on host via
cuda/Makefile with sm_86.
Width-envelope full-saturation finding — at production width
(n+1=257) the release step (margin-1)×1000/slope reaches
the textbook 2n=512 iter count only at
m=154/s=300. Below that margin, partial saturation leaves
the ancilla pool with multiple width buckets, each ratcheting its own
max(qubit_id). At m=154 every Kaliski iter
operates at the same maximum width, ancilla pool sees ONE width bucket,
& peak qubits hit the structural minimum at 4,886 — down from 16,370
at m=38 (−70 %). Score collapses 3.990e+11 → 1.230e+11 (−69 %) in a
single margin-axis re-bracket.
Truncated comparator finding — HEAD's
*dgcd-compare-bits* knob (56 bits at n=256 in HEAD-prod,
~22 % of width) carries real headroom. cmp=4 (~1.6 % of width at
n+1=257) lands as our production minimum after classical-sim probe-width
verification.
cmp=2 caught as algorithmically broken — bend dispatch reports
byte-identity-cpu-gpu true & status PASS, but
classical-sim verify at n+1∈{18, 32} reveals out=0 plus
leaked ancilla. Bend only validates CPU & GPU simulator agreement on
a wrong circuit, not algorithmic correctness vs. expected mod-inverse.
Same defect class re-emerges below at
DIALOG_GCD_ACTIVE_ITERATIONS.
DIALOG_GCD active-iters cap — non-monotonic correctness. Setting
iters below textbook 2n produces wrong mod-inverse output
at specific iter values, sporadically. K-correction (classical-replay
backward sweep using K = classical-mod-inv(p, r_on_1)) depends on
r_on_1 arithmetic that breaks at certain iters.
dgcd-resolve-iters gates r_on_1 != 0, which is
necessary not sufficient.
Probe data:
- n+1=18 (textbook 34): iters ∈ {15: OK, 16: WRONG, 17: WRONG, 18: OK, 25, 26, 27: OK}
- n+1=32 (textbook 62): iters ∈ {10, 12: OK, 15: WRONG, 16, 24, 28, 32, 48, 49, 50: OK}
HEAD ships iters=399 at n=256 (textbook × 0.776) — a
tuned value that survives upstream's test suite. We adopt
iters=399 as our production safe value & retract
earlier iters-cap exploration that relied solely on bend PASS without
classical-sim verification.
HOST_GATED measurement-clear port — HEAD's
mod.rs:24788-24791 pattern (Hadamard + Measure + Reset,
classical-feedback CZ) substitutes our ccx-mask uncompute pass. Per-call
save: (n+1) Toffoli replaced by 2(n+1) Cliffords (1 HMR + 1 CZ per bit).
At n+1=257 production: 833,172 Toffoli saved (matches rng_ops exactly —
one HMR-RNG per replaced CCX). Net score 4.126e+11 → 3.990e+11 (−3.31
%).
QECCOPS1 wire format always supported HMR (kind 12), CZ (kind 9),
push-cond (kind 15), pop-cond (kind 16); our lumbda emit pipeline gap
was at gates.lsp — added Tier-1 + Tier-2 emit primitives.
Classical-sim verification at three probe widths confirms HMR
substitution produces correct mod-inverse output matching control
(non-HMR) baseline.
Pseudo-Mersenne mod-double — Schrottenloher 2026/1128 (arXiv
2606.02235, posted 2026-06-01) Algorithm 7 port (sweep-030). Paper
title: Optimized Point Addition Circuits for Elliptic Curve Discrete
Logarithms; author André Schrottenloher (Univ Rennes, Inria, CNRS,
IRISA). Reference implementation: Qarton,
gitlab.inria.fr/capsule/qarton-projects/ec-point-addition. secp256k1's
prime takes pseudo-Mersenne form p = 2²⁵⁶ − 2³² − 977, so
our off-Mersenne residue f = 2³² + 977 = 4294968273 fits in
33 bits. Control mod-double-inplace! emits an
add-const at full n+1 width followed by a
csub-const at full n+1 width — 4n Toffoli per
call. New mod-double-inplace-pseudo-mersenne! emits a
single cadd-const at width
lsbs = padding + bit-length(f) = 30 + 33 = 63, controlled
on our MSB slot of a pre-double value — 2(lsbs − 1) = 124 Toffoli per
call. Per-call saving at production width: 87.9 %. Full-stack production
result: avg Toffoli 21,917,696 → 16,197,312 (−26.1 %).
Peak qubits 4,886 unchanged — algorithm reuses existing scratch
(carry-in slot, parity flag), allocates zero new ancilla.
Probe-width verification (classical-sim,
verify-pseudo-mersenne-double.lsp):
- n+1=5, p=13: Toffoli 16 → 4 (−75.0 %), exhaustive 13 inputs; pmersenne matches expected output on 11 / 13, misses on a measured 2-input strip predicted by paper's non-exact zone (size ≈ f).
- n+1=9, p=251: Toffoli 32 → 12 (−62.5 %), exhaustive 251 inputs; matches on 247 / 251, misses on a 4-input strip (predicted 5).
- n+1=18, p=131,071 (Mersenne): Toffoli 68 → 30
(−55.9 %), 1000 random inputs match 1000 / 1000. Mersenne case has
f = 1so our non-exact strip shrinks to{p − 1}. - n+1=32, p=2³¹ − 1 (Mersenne): Toffoli 124 → 58 (−53.2 %), 1000 random inputs match 1000 / 1000.
Non-exact zone scales as f / p; at secp256k1's
f / p ≈ 2³³ / 2²⁵⁶ ≈ 2⁻²²³, mispredict probability sits
below any 9024-shot harness can ever sample. Bend dispatch on
3090-ai:8320 returns byte-identity-cpu-gpu true, status
PASS.
Borrowed-carries finding — Cuccaro adder's HMR-uncompute variant
(HEAD mod.rs:1097-1144, cuccaro_add_fast)
replaces n CCX carry-uncompute ops per add with n HMR + n
classical-conditioned CZ. Algorithmic saving: ~n Toffoli per add. Where
a caller stages our scratch carries register changes
everything:
- Per-call allocation (sweep-016, rolled back):
classical-sim verify PASSes at four probe widths. Production-width peak
qubits explode. Each call's fresh
carriesopens a new ancilla bucket our pool never recycles. - Shared host-allocation (sweep-016b): alloc once at host scope, reuse across Kaliski iters. Peak qubits stabilize, net score lands +1.4 % WORSE — long-lived shared register competes with downstream ancilla pool reuse patterns.
- Borrowed-from-caller (sweep-017b, lands): caller
passes its own scratch region as
carries-reg+carries-offset. No fresh bucket, no long-lived shared register. Production score 1.230e+11 → 1.186e+11 (gap 49.4×). Extension to mod-add!/mod-sub! call sites (sweep-017c) drives score to 1.1513e+11 (gap 47.9×). - cmp-lt-into-fast (sweep-017d, queued for production
emit): HEAD's
cmp_lt_into_fastatmod.rs:3643-3693— HMR-uncompute of our comparator carry chain. Dispatches from mod-add!/mod-sub! under our same*cuccaro-use-borrowed*flag. Borrowed-carries discipline carries through to our comparator backward sweep.
Recurring lesson — probe widths cannot surface peak-qubit regressions that only manifest under our production-width ancilla pool's allocation history. Cross-tier byte-identity validates arithmetic correctness; structural-allocation cost requires end-to-end production emit to surface.
ops_hash vs trace_hash — what byte-identity actually proves
byte-identity-cpu-gpu true (returned by bend dispatch)
compares an ops_hash — a fingerprint of our emitted QECCOPS1
binary, byte-for-byte identical between CPU reference & CUDA
bend-cuda. ops_hash equality proves our two execution paths
consume an identical op stream & agree gate-by-gate on the simulator
state encoding. ops_hash equality does not prove
algorithmic correctness — see cmp=2 & active-iters cap
above, where bend returns PASS + byte-identity true on a circuit whose
mod-inverse output is wrong.
trace_hash — a fingerprint of execution-residual state
(ancilla populations, measurement outcomes, accumulated phase after
uncompute) across our simulator's full run — would surface the
residual-path class of defect ops_hash cannot. Our sim framework
computes a residual fingerprint internally, but score artifacts only
emit ops_hash. sweep-042 adds trace_hash to our score-table schema so
every promoted production .bin carries both fingerprints alongside
status PASS & classical-sim verification. Until then, algorithmic
correctness rides on classical-sim probes
(verify-mod-inv-output.lsp,
verify-pseudo-mersenne-double.lsp) at sub-production widths
plus the cross-tier ops_hash agreement.
What HEAD Still Owns
Gap of 1.33× under champion vecD-on decomposes:
- Peak qubits: 2,573 vs 1,285 = 2.002× over. Bennett
alloc-tmp route per raw-quotient / raw-ipmul call carries a transient
2Nqubit envelope HEAD'sreacquire_vecprimitive avoids. Allocator trace under our prior champion (2,830 qubits) confirmed peak concurrent live width = 2,827 (3-qubit gap from peak qubit id) — pool fragmentation negligible. Vector C*red-tmp-borrowed*retired one 257-wide bucket (−257 qubits exactly as projected). Working set, not fragmentation, drives our remaining qubit count. Vector B*allocator-slot-shrink-on-narrow*queued, predicted −256 next. - avg Toffoli: 920,523 vs 1,384,984 — under HEAD by 33.5 %
(1.505× better). Vector D
*dgcd-raw-pa-q1-tmp-into-u*rebinding q1tmponto idle kaliskiuregister dropped avg Toffoli 1,252,817 → 920,523 (−26.5 %), pushing our gate count decisively under HEAD's frontier. Toffoli-axis race won; full remaining gap traces to qubits alone. Matching HEAD's 1,285 qubits at our 920,523 avg Toffoli would put us at score1.18e+09— gap0.665×(1.505× ahead).
Qubit Attack Path
Allocator-trace decomposition of our prior champion's 2,830 qubit
peak (op_seq 514,667 — alloc of
rawpa-q1-app-mask-step0):
- 2 input registers (
tx,ty) — 257 each - 3 caller-scope point-add scratch (
lam,tmp,red-tmp) — 257 each - 4 kaliski-style state (
u,v_w,r,s) — 257 each - 1 HOST_GATED mask (
shared-mask) — 257 - 2 K-correction scratch (
mc-tmp,mc-pow) — 257 each - 1 mod-inverse output (
pa-tx-inv) — 257 - 1 m-history — 129
- 3 width-1 ancillas (
cin,flag,f)
Sum = 2,830 ✓ (full trace at
ecdsa/runs/lumbda-rawpa-stress/qubit-attack-plan.md).
Five attack vectors, sequenced by difficulty:
Vector A · *dgcd-raw-pa-share-mask-name* (low) —
Predicted −257 qubits. Forward tobitvector + apply bitvector phases use
2 mask-name families (fwd-mask, app-mask)
whose lifetimes never overlap. Pool keys allocations by NAME so each
family claims a separate 257-wide slot. Sharing one name = pool recycles
one slot. Shipped, no-op.
*allocator-aggressive-fold* already coalesces via
range-pool fold, not by-name slot reuse. Cells vecA-on +
vecA-off produced byte-identical emit (qubits = 2,830
both).
Vector B · *allocator-slot-shrink-on-narrow* (medium) —
Predicted −256 qubits. Targets a 256-qubit fragmentation hole at base
1,288..1,543: pa-c1y (width 257) freed at op 9,188, then
g0 (width 1) recycles into our same slot at op 9,190,
leaving 256 qubits dormant for ~6.22 M ops. Splits a wide free slot when
a narrow alloc lands (width ≤ ½ slot). Shipped in our
lumbda/emit-stream.lsp pool layer. Dispatcher
result pending.
Vector C · *red-tmp-borrowed* (medium) — Predicted −257
qubits. red-tmp lives our entire emit even though usage
clusters inside mod-reduce! callsites. Implementation
aliases red-tmp onto our lam register, which
sits idle under raw-pa-architecture per its contract documentation at
lumbda/dgcd-raw-pa.lsp:1428 (lam-reg UNUSED —
dialog-log subsumes slope λ). Same gate sequence, one fewer 257-wide
register opened. Shipped + verified ``vecC-on`` — qubits 2,830 →
2,573 exactly as projected (−257). Score 3.5455e+09 → 3.2235e+09 (−9.1
%), gap 1.81×.
Vector D · *dgcd-raw-pa-q1-tmp-into-u* (medium) —
Originally tagged as a qubit vector; lands as a Toffoli
vector. rawpa-q1-tmp (apply-bitvector product
accumulator) allocates at op 513,638 — exactly when
rawpa-q1-u (kaliski u register) becomes idle (bitvector
lives in dialog-log post-tobitvector). Rebinds tmp =
u in apply-bitvector phase + restores u from
target before load-p-bits unload at step 9. Allocator pool
already reuses our 257-wide slot for tmp+u under aggressive-fold, so
qubits held flat; alloc/free/swap gate sequence elided entirely.
Shipped + verified ``vecD-on`` — qubits flat 2,573 (no change vs
vecC), avg Toffoli 1,252,817 → 920,523 (−26.5 %). Score 3.2235e+09 →
2.3685e+09, gap 1.33×. avg Toffoli now sits UNDER HEAD's 1,384,984 by
33.5 %.
Vector E · g1 width audit (parked) — Initially flagged as defensive
over-allocation. Subagent traced bare g1 register name in
our allocator trace back to (gensym 'rawpa-tmp-add) at
lumbda/dgcd-raw-pa.lsp:1456 — lumbda C-tier
bi_gensym (builtins.c:2254) drops its prefix
argument + emits sequential g0/g1/g2… regardless of source
intent. Width n+1 = 257 = Cuccaro carries lane for mod-add!
(contract documented dgcd-raw-pa.lsp:727), threaded through
16+ downstream callsites. Narrowing would corrupt every mod-add inside
our 3 raw-* calls. Genuinely structural defensive allocation, not
over-allocation. (Bonus side-finding: lumbda C-tier gensym prefix-drop
merits separate substrate-cleanup ticket — would make every future
allocator-trace report carry meaningful register names.)
Status after C + D landed: 2,830 → 2,573 qubits, avg Toffoli 1.257 M
→ 920 K. Vector B remains queued — landing it would carry 2,573 → 2,317
qubits (−10 %); score at flat Toffoli
2.3685e+09 × (2317/2573) = 2.13e+09 = gap
1.20×. Beyond B, every additional 257-wide register
retired (one Bennett reacquire_vec port + future
caller-scope tightening) multiplies straight through our already-won
Toffoli budget. HEAD parity sits inside 5 more 257-wide buckets.
Lever-ranking finding (recurring): lever combinatorics under raw-pa
stack saturated at our prior 3.5455e+09 ceiling across 30+ HEAD-route
flag variants tested individually + paired (raw-quotient-terminal-reuse,
raw-ipmul-terminal-reuse, raw-ipmul-clear-p-residual,
host-reverse-raw-block, compressed-block-lifecycle, composite-scratch,
compressed-log-u-high-runway, borrow-zero-raw-future, …) — single shared
upstream change captures whatever these flags claim, not additive. Same
pattern earlier verified champion saw at 1.7019e+10. Breakthrough past
3.5455e+09 ceiling came from direct substrate edits (vecC
register-aliasing + vecD register-rebinding), not knob-tuning. Future
gap closure runs through more substrate work — Bennett
reacquire_vec port, vector B slot-shrink-on-narrow,
additional caller-scope tightening.
Lumbda Upstream Wins
Production emit at full secp256k1 width surfaced lumbda C-tier defects & gaps. Each landed upstream alongside our score run:
Precise NaN-box garbage collection
Lumbda C tier shipped with GC_disable() called
unconditionally after GC_INIT() — every allocation leaked
by design, an upstream-documented stopgap because Boehm's conservative
pointer scan cannot see through lumbda's NaN-box Value layout (heap
pointers live in low 48 bits, tag bits in upper mantissa, raw word never
looks like a heap address).
Our fix registers a custom Boehm mark kind whose mark proc walks
8-byte words in mixed mode: when QNAN (quiet not-a-number) bits set AND
tag identifies a pointer-bearing slot, extract our low-48 pointer; else
fall through to raw-pointer validation. A
GC_set_push_other_roots callback decodes NaN-boxed Values
on our C stack via setjmp anchor. GC_disable
deleted.
Measured: alloc-test (1 M cons pair
allocations, dropped each iteration) ran 0.6 s leaky / 156 MB pre-fix;
0.37 s flat / 4 MB post-fix. Tests pass 88/88 c-test,
4/4 regression-named-let-leak (very test that motivated
GC_disable), 410/410 functional, zoe-favorites across all
four tiers.
Binary-safe port primitives
QECCOPS1 op records carry 0x00 bytes throughout. Five C-tier defects silently corrupted any binary write:
bi_get_output_string&bi_write_string(file-port path) usedstrlen-based string functions — body truncated at first null byte. Switched to known-lengthfwrite&make_string.bi_write_charignored string-port destinations entirely & wrote to stdout. Added PORT_STRING branch routing throughport_write_str.string-refon bytes 0x80–0xFF returned char with codepoint −1 becauses->data[idx]sign-extended as signedchar. Cast throughunsigned char.pack_u64_slottreated bignum slots as raw fixnums —*no-slot*(sentinel 2⁶⁴ − 1) packed as raw pointer bytes instead of 0xFF…FF. Added IS_BIGNUM branch extracting low 64 magnitude bits.
New file I/O primitives
Four primitives mirrored across C tier & Python tier:
open-binary-output-file path— opensw+bso a caller can seek back to rewrite a header.port-set-position! port offset— fseek absolute offset on a file port.append-binary-file path data— opensab, fwrite-s bytes through.append-port-to-binary-file path port— streams a string-output port's buffer directly to disk without materializing(get-output-string port).
Plus port_write_str growth shifted from 2× to 1.5× past
256 MB — realloc transient peak (old + new) at 2× needs 24 GB for 8→16
GB; 1.5× bounds peak at 2.5×.
emit-stream file-port refactor
Previous emit accumulated body in a string-output port, materialized
via get-output-string, then
(string-append header body) before
write-binary-file. Peak RAM hit 3× body size — a 6 GB body
needed ~18 GB transient.
New path opens an open-binary-output-file, reserves 16
placeholder bytes for header, streams every gate straight to disk, then
port-set-position! 0 + write-string to rewrite
QECCOPS1 magic + n_ops u64 LE once n_ops carries a final value.
Constant RAM regardless of body size — 89.58 M ops /
4.7 GB body landed at 322 MB peak RSS.
Net upstream impact: lumbda C tier went from "GC disabled, leaks every alloc, fails on any binary write with embedded 0x00" to "precise NaN-box tracing + binary-safe ports + constant-RAM file emit" — usable for any future multi-GB binary workload, not solely our ecdsa pipeline.
Pareto Frontier We Aim To Beat
Full frontier as upstream README reports it (2026-06), plus our landing point — Toffoli (avg/shot) · peak qubits · score:
- HEAD production (jackylee0424 ``2dcf00d``, 2026-06-08): 1,454,884 Toffoli @ 1,300 qubits → 1.891e+09
- HEAD predecessor (SamrendraS ``a4db9a5``, 2026-06-08 17:00 UTC): same 1.891e+09 K2-pair-codec score floor (binder-notch + tail-nonce retune lane)
- HEAD K2-pair-codec opener (``73f4f48``, 2026-06-08 05:42 UTC): 1.891e+09 first landing
- HEAD prior frontier (jackylee0424 a66b042, 2026-06-06): 1,503,355 Toffoli @ 1,309 qubits → 1.968e+09 — retired
- Google low-gate: 2,100,000 Toffoli @ 1,425 qubits → 3.0e+09
- Google low-qubit: 2,700,000 Toffoli @ 1,175 qubits → 3.2e+09
- Textbook initial: 3,942,753 Toffoli @ 2,715 qubits → 1.07e+10
- DIALOG_GCD + borrowed-everywhere (sweep-017c): 23,563,264 Toffoli @ 4,886 qubits → 1.1513e+11
- + pseudo-Mersenne mod-double (sweep-030): 16,197,312 Toffoli @ 4,886 qubits → 7.9140e+10
Google's two private points each sit under 1,500 peak qubits — 1,175 & 1,425 respectively. HEAD production already passes both (1,300 qubits / 1.891e+09 score) — a public submission beat Google's private frontier before our entry landed. Our active race runs HEAD-vs-us; Google's points sit as a historical landmark, not a live target.
Our peak qubits sit at 2,573 vs HEAD's 1,285 — 2.00× over. Toffoli count 920,523 runs under HEAD's 1,384,984 by 33.5 % (1.505× better). Qubit-envelope reduction owns full remaining gap closure path. See Qubit Attack Path above for vectors A-E, B queued next, ~5 more 257-wide buckets to HEAD parity.
Session total since sweep-003 baseline (4.368e+13, our
entry score): 18,442× score reduction shipped, gap
closed from 23,099× behind HEAD's current promoted frontier
(+2,309,900 %) down to 1.33× behind
(+33 %).
Our lumbda-tier validation gates each variant against byte-identical cross-tier portals before we ever spend Rust submission budget.
Why Track Publicly
A challenge of this shape rewards aggressive, asymmetric search. Every factor of two off a Toffoli count, every qubit shaved off peak width, multiplies straight through Shor's algorithm. Documenting our path publicly — what we tried, what saturated, what surprised us — contributes intellectual capital to a commons that historically locked this work behind paid lab pages.
This page tracks public-facing progress only. Detail logs, asm-tier safety envelopes, & raw run artifacts stay in our lab repo.
Related Pages
- foxhop gpu-mesh — our 3-GPU pool hardware, bench numbers, & coexistence with our LLM services (qwen, Hermes, ollama, speech).
- lumbda.com/bend — bend primitive catalog; wire protocol; CUDA form list.
Remarkbox
Comments