Ribbon renderer data corruption on AMD GPUs

When using an AMD GPU, the ribbon renderer will be susceptible to memory corruption when working under stress. This manifests as a performance hit and/or visible rendering glitches, where the ribbon’s triangles “explode” or expand to infinity.

This behavior does not happen on ribbon renderers with CPU sim stages, or GPUs from other manufacturers.

Included are step-by-step instructions to reproduce this problem on a vanilla UE 5.6 installation, as well as a sample DxDiag output from one of our local AMD GPU workstations.

Steps to Reproduce
Using a workstation equipped with an AMD GPU, unzip the provided Niagara System (NS_GPU_Ribbons) into the content folder of an Unreal 5.6 distribution. Place an instance of this system in an empty level, and observe the normal behavior of the emitters.

Open the Niagara System. Notice the emitter labeled “Spawn_Per_Frame_Ribbon” at the very bottom, inside a red comment box; switch off its SpawnRate module and switch on its SpawnPerFrame module. Save the system. Observe that the emitters in the level will eventually stutter and render erratically.

Switch off SpawnPerFrame, and switch SpawnRate back on. Emitters may go back to normal behavior. Change the spawn rate to 100. Observe the same erratic behavior happening again.

Finally, switch all ribbons to CPU Sim. Repeat the experiment and observe that they always behave normally.

Reproduce this experiment on a workstation with a non-AMD GPU. Observe that the emitter always behaves normally.

Another important detail: The emitter component Spawn Per Frame seems to generate this problem way faster than any other method (see repro steps above)

(Attaching files once again just in case, since I don’t see them on the original message)

Hey Dan,

Have you tested this on consoles which use AMD GPUs at all to see if it repros?

If it’s PC AMD specific and there are no -d3ddebug issues / -rhivalidation warnings then it’s likely going to be something I need to contact AMD about.

I can ask QA to take a stab at repro’ing also just to confirm it’s not driver version specific.

And one last question, what GPUs have you tested this with?

Thanks,

Stu

Hey Stu!

I had some time this week to investigate this further, and I think I figured out what’s going on.

The issue happens on FNiagaraRibbonAggregationStepCS (NiagaraRibbonAggregationStep.usf), and it’s because of a mismatch between the FRibbonAccumulationValues struct size on the shader (NiagaraRibbonCommon.ush) and the stride size on the TransientAccumulation buffers.

The shader structure is correct. Each ribbon renderer determines the correct stride at GetAccumulationStructSize and sets the right permutation vector defs. However, if you have two consecutive ribbon renderers with different strides, and the size of the first allocated transient buffer happens to be enough to fit the data of the second, they won’t get re-initialized, and the stride will be wrong.

Example:

Ribbon Renderer A: Using full ribbon IDs, stride: 24

Ribbon Renderer B: Not using them, stride: 20

After A, TransientAccumulation is enough to fit B, so it won’t Initialize (InitOrUpdateBuffers); its stride size remains at 24. In AMD cards, you’ll get the wrong transient buffer offsets for B, and your ribbons will explode :frowning:

One work-around is this:

`// At FNiagaraRibbonGPUInitComputeBuffers, InitOrUpdateBuffers

const uint32 AccumulationBufferStructSize = GetAccumulationStructSize(bWantsMultiRibbon, bWantsTessellation, bWantsTessellationTwist) * sizeof(float);

if (TransientAccumulation[0].NumBytes < (AccumulationBufferStructSize * NeededSize)
// FIX: Account for stride differences
|| (TransientAccumulation[0].Buffer->GetStride() != AccumulationBufferStructSize))
{

}`I’ve tested this in our AMD workstation, and everything seems to be working fine now. You can also test it in a vanilla distribution by verifying that things break using the original repro above, and then applying this patch.

I’m going to run some diagnostics on my NVidia workstation, I wonder if NVidia handles this buffer’s memory differently.

Yeah, running with -rhivalidation on my NVidia workstation throws the expected size mismatch error as well. Somehow, NVidia GPUs don’t seem aggravated by it. The patch fixes the error there too.

Perhaps a better work-around for now, rather than re-allocating buffers, is to pay a bit more memory and have a constant FRibbonAccumulationValues structure:

struct FRibbonAccumulationValues { float RibbonDistance; uint SegmentCount; uint MultiRibbonCount; float TessTotalLength; float TessAvgSegmentLength; float TessAvgSegmentAngle; float TessTwistAvgAngle; float TessTwistAvgWidth; };We then make AccumulationBufferStructSize always be 32.

A bit ugly, but works.

We have been experiencing corruption on ribbons as well in 5.6 when using different renderers with different tessellation configurations.

Our current workaround is to use a different set of compute buffers (FNiagaraRibbonGPUInitComputeBuffers) in FNiagaraGpuRibbonsDataManager::GenerateAllGPUData() for each configuration*,* replacing the ComputeBuffer member by a new ComputeBuffersPerConfig set.

This add the following line in FNiagaraGpuRibbonsDataManager::GenerateAllGPUData() :

auto& ComputeBuffers = GetComputeBuffersForConfig(RendererToGen.Renderer->GenerationConfig);

We are using the HasRibbonIDs(), WantsAutomaticTessellation() and HasTwist() to select the compute buffers, but maybe as suggested here, it is simply a matter of having a FNiagaraRibbonGPUInitComputeBuffers for every sizes of FRibbonAccumulationValues.

The best solution would be the one using the least amount of memory, so let us know if there is a fix that can keep the size to a minimum while allowing reusing the buffers.

Uriel

Hi,

Catching up summer post break, thanks for getting to the bottom of this.

I can see a benefit to either directions, did you happen to measure the performance with / without permutations Dan? Part of me wonders about not using a structured buffer, or having separate views on the buffer instead.

Thanks,

Stu

I tested out your method Dan, this seems like the simplest approach, and I didn’t notice any changes in performance (I didn’t remove any permutations just fixed size like you did).

An alternative might be to change to a ByteAddressBuffer instead, but I think I’m going to follow this for now.

We do want to add a single pass compute path (i.e. keep things in group shared) for lower particle counts, or make a new ribbon renderer which is more GPU centric, so that might be a time when we revisit this.

Thanks,

Stu

My fix is submitted in CL 44042760 (UE5 Main). I’ve basically moved the struct into a shared header and made it constant size.

Thanks for finding and figuring this out.

Thanks,

Stu

Hi, Uriel!

I thought of using feature-fitted buffers, but then you have to manage more RHI resources, and don’t get to re-use some of the memory you’ve already allocated.

At the end, we went with just using a full FRibbonAccumulationValuesstruct, because it is simpler, the divergence is minimal, and you only waste a bit of ‘padding’ space when you don’t have the full feature set (this waste is less than having to allocate another full buffer for a ribbon that uses a different feature set)

The change is simple:

  1. On NiagaraRibbonCommon.ush, make FRibbonAccumulationValues a constant size structure (remove the permutation defs)
  2. At InitOrUpdateBuffers for FNiagaraRibbonGPUInitComputeBuffers, make AccumulationBufferStructSize = 8 * sizeof(float)

Give this a try; it might be all you need as well.

Stu will give us professional and final advice later, of course.

As an addendum, our use case is such that our Niagara Systems have each a mix of ribbon renderers with a mix of features. If your case is renderers with the same features on each NS, then I’d agree that a feature-fitted buffer is probably best.

Hey, Stu!

Welcome back.

Right. Other than verifying that AMD GPUs were happy (and they were, no RHI/D3D errors or stalls), I didn’t measure performance in too much detail because I was expecting it to be very similar, albeit with some slightly larger buffers. Those I measured, and in our case, the “waste” would go from 6% to 25% of the buffer’s size (worst case) in a NS with about 6 ribbon renderers with a full range of different features. Mind you, we’re talking about a buffer of 75 ribbons, which amounts to roughly 2400 bytes (total buffer size), so quite small when put in perspective.

The fitted buffer idea is a very good one too, if we also want to optimize storage, but I think we’d have to do this more globally to reap the best benefits. At this point a GPU centric system like you describe becomes the better option, but of course it means larger changes.

Thank you both for the ideas and discussion!

Dan

Thanks for your help, Stu!