DEMON: Diffusion Engine for Musical Orchestrated Noise

DEMON

Diffusion Engine for Musical Orchestrated Noise

Streaming Diffusion Engine for Real-Time Music Generation

Abstract

DEMON is a streaming diffusion engine for music, built on ACE-Step v1.5 (2 B turbo and 5 B XL turbo). A ring buffer of in-flight generations, each carrying its own per-slot timestep schedule, is advanced by one batched decoder forward pass per tick; after warmup, at full pipeline depth, every tick produces a finished song latent. Every solver-side parameter (per-frame source preservation, velocity scaling, ODE noise injection, classifier-free guidance, x0-target morphing, channel gain) accepts a scalar or a per-frame curve at the latent's 25 Hz frame resolution, and a shared-mutable-curve lane propagates parameter changes to every in-flight slot on the next tick, independent of pipeline depth. Native TensorRT engines cover 60, 120, and 240 s song lengths; longer songs route through a sliding 60 s window. The decoder runs under TensorRT with a refit-enabled engine that hot-swaps LoRAs without rebuild.

Metric	depth = 1	depth = 4	depth = 8
Throughput (gen/s)	8.9	11.3	12.3
Per-tick latency	14.0 ms	42.8 ms	81.1 ms
Submission-time parameter convergence	112 ms	471 ms	649 ms
Shared-curve latency	1 tick	1 tick	1 tick
Per-frame control resolution	25 Hz (40 ms)
VAE windowed decode (3 s)	7 ms
LoRA refit	1.2 s, no engine rebuild

Metric

depth = 1

depth = 4

depth = 8

Throughput (gen/s)

8.9

11.3

12.3

Per-tick latency

14.0 ms

42.8 ms

81.1 ms

Submission-time parameter convergence

112 ms

471 ms

649 ms

Shared-curve latency

1 tick

Per-frame control resolution

25 Hz (40 ms)

VAE windowed decode (3 s)

7 ms

LoRA refit

1.2 s, no engine rebuild

DEMON is the runtime, control surface, and acceleration layer; the diffusion model is ACE-Step v1.5, released by the ACE-Step team under MIT. The engine maintains a ring buffer of in-flight generations at staggered denoising stages. Crucially, each in-flight slot carries its own denoise scalar and its own timestep schedule: one batched decoder forward pass per tick advances slots that are simultaneously at different stages of different schedules. Native TensorRT engines cover 60, 120, and 240 s song lengths; longer songs route through a sliding 60 s window that advances at chunk boundaries.

Two control lanes coexist. Submission-time parameters (text conditioning, source audio, denoise) enter the pipeline when a new request is submitted and reach the next emptied slot within one tick; they then take effect over that slot's remaining schedule. Step-time parameters (per-frame source preservation, x0-target morph, velocity scaling, ODE noise injection, channel gain, guidance, CFG rescale, APG momentum, DCW scalers) live in a shared mutable registry that every slot reads on every forward pass; writing to that registry takes effect on the next tick for every in-flight slot at once, regardless of pipeline depth. The decoder runs under TensorRT with refit enabled, so LoRA deltas are written into the live engine without a rebuild. The paper covers the SDE derivation behind the per-frame source-preservation curve, the windowed-decode receptive-field analysis, and the TensorRT precision recipe.

@article{acestep2026, title = {ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation}, author = {Gong and others}, journal = {arXiv preprint arXiv:2602.00744}, year = {2026} }

DEMON

Abstract

Bird demo with hand control

Live control on the 5 B model

Genre transfer via LoRA

Agentic control via MCP

Streaming pipeline does not degrade quality

Per-frame source preservation on a 1 − t³ curve

Per-tick scalar denoise on the same 1 − t³ curve

Heterogeneous per-slot scheduling vs. global reset

Per-frame latent morph between two cover variants

DEMON

ACE-Step