Been trying to optimize parallax occlusion mapping,here are some progress,come discuss!(need help!)

anonymous_user_4a8d4e3f · December 18, 2016, 9:31am

I love cg,but I havnt studied systematicly,if you see my mistakes or even blunders,feel free to correct me.

1.tangent is better!
There are two key parameters about pom:HeightRatio(texture height relative to texturewidth) and precision(size of the biggest lump want to see clearly) ,precision is basically step size, step size means evey lumps that fall between can be consider linear,so we can compute accurate ray intersection,step size is computed from steps,so computing steps is important.
Unreal uses cosine and max/minsteps to emulate steps count,but the actual steps count is rising as a tangent curve,some cases it can causes waste of computing.

Here is a bad case for demostration:

You can see part2 is actually more accurate than part1,thats because the rising rate difference between cosine and tan,if you want to see part2 clearly then part1 will be feed more computing power than it needs,and part1 is the bigger area on screen.
No need to worry about tan() increases instuction count,we have a free sine and tan() can be reused in many cases later like pixel depth offset,in the end it actually reduces some instructins.
As you see on the gragh,reason we compute steps rather than use step size directly is that const step size causes steps to go sky high as camera looking at samll angle,
so the best solution is like this (performancewise):

and the material graph:

Here using tan to calculate pdo avoid a vec3 distance().but the result is weird,dont know why,just scale it down before use and youll be good,also I think use opacity mask work too if you dont need accurate intersection and cheaper.I havnt test position yet,its not vrey often used,I also dont think calculate shadow is very pratical,I see that we have contact shadow now,should be better.
A comparison:

2.Ideal about “RippleRayPOM”.
Pom is expensive because it takes loops or steps,but most steps just hit the air and go wasted,so I came up a ideal,imagine this,in texture space,for every pixel,grow a hemisphere from top trying to mask out as much empty space,until it collides with heighmap geometry,then store the radius in another channel,and when ray tracing,read the value in that channel,if ray is inside that hemisphere,it can jump toward whatever direction it heads until it leaves the sphere knowing that it wont miss anything,then again read another hemisphere,since we need to look up the texture anyway,why not use up the other channel?
So the whole idea is using simple geometry(ones have radius) to block out empty space to reduce steps that hit the air,cylinder is better,it takes two parameters or two channels to define a cylinder,but it adapted to different shape nicer,especially when using pom,texture space is narrow vertically.

I wrote a small programe to generate this kind of map,here is the result,it looks like shape,but clearly it isnt working as expected,I decide to give up,all this is just beyond my knowledge,if you were interested,please check out,tell me if I misundetand it, i ll upload detail later.

Mateial graph,codes in text editor cant complie,i inline the function mannually in custom node:

3.I want to cut out the part that is being looked at narrow angle and leave a sillouette using opacity mask,i try to clamp ray length and ray height but didnt work,dont know why. i ll upload detail later

4.Some question
.if its better to store tan() in texture?if so it can be easy to emulate steps more controllably,is there one dimension single channel texture?
.How much texture lookup cost generally speaking,dependant/indendent lookup,mip lookup,channel lookup,continuous lookup.
.would using many independant if statement in hlsl affect pefomance?dependent?
.what is causing twinkling when looking at pom at narrow angle?

Deathrey · December 18, 2016, 11:30am

Good stuff!
I just want to add that it would be very nice to see some metrics alongside these experiments.

This what I have to say:

Complicating max steps calculation to better re-balance distribution of detail might be not worth it. You calculation seems more expensive than a simple dot.
You are idea about minimizing waste from tracing empty space is not very far from relaxed cone step mapping.

I’ve never tried that, but perhaps using a LUT might be better, provided that it is pretty low res and you are using it to replace quite complex math.

Yep. By a pretty good margin actually.

Temporal Anti-Aliasing most likely.

anonymous_user_4a8d4e3f · December 18, 2016, 12:24pm

I did this out of interest,actually Im very new to hlsl so I cant give very scientific measuring，but later on I ll post whole mechanics and algorithm.
It take 4 instructions to compute tangent,and it saves at least a vec3 dot,a vec3 distance,thats at least above 6(information from google),and tan is crucial for RippleRay trace,big part is tan save unnecessary texture lookups, i didnt calculate how much,but its not a few,think this,original method use 8 steps minimum even at straight angle where image is flat.
I ll look into conesteps mapping,cant believe someone did this before,now I do think cone is better than cylinder,maybe a cone cobined with cylinder.

RyanB · December 18, 2016, 7:21pm

I did experiments using the sphere DF-at-surface method a while back and posted some performance results. Its fairly promising, just requires additional texture processing per asset and turned out to not be of much benefit until the step count was fairly high, which meant it didn’t help me make it run any faster on PS4 (where we limit 16-32 MAX, and the real savings kicked in after 64). Of course I am sure that could change with the right tweaks:

https://forums.unrealengine.com/showthread.php?49169-POM-material&p=392889&viewfull=1#post392889

Also looks like you are using my debug shader from the contentexamples POM level to show the steps debug. There is actually another mode kind of hidden in that debug material called “distance field cone slices” that performs a differenet sphere test based on incoming angle. it is kind of like “relax cone step mapping” but it instead rejects samples based on incoming angle.

Here is a basic image of the concept (I think you can actually make the shader in content examples display this but I cant remember what the param was called):

In the start it shows just a single sphere of distance to nearest point from surface, but then it shows the effect of extending each sphere based on the minimum point it could hit at that incoming angle which can be far more aggressive. This is using 4 cones.

For the single-sphere only (like the thread link) you can generate that texture using the gpu in realtime pretty easily, but the DF-cone-slice requires searching almost the entire texture for each pixel so I did it in c++ using the “composite texture” feature. And it took over 5 minutes to process the 512x512 texture since I wasn’t doing anything using multi threading. I think now that we have a better interface for writing directly to render targets using BPs, I would try to redo it all using materials using an iterative approach that would be much faster.

This method was actually way faster than the first test I did, but required additional V4 of uncompressed texture data, and the speedup was very content dependent. Ie, in this example heightmap with lots of big, low-frequency negative space it worked well. For the noisy rocks example like the physics rock-pile I made for the kite demo above, it did not save very much because the ray was typically hitting stuff quickly anyways (which means the cone-spheres never cast far) But again, I probably didn’t find the most optimal implementation. I was only checking the cone slices once at the beginning and using that to determine an advanced starting position, but then degenerating to a linear search.

The problem with continuing to use the DF samples is that:
A) DF sphere is only truly valid at surface, which means to take full advantage in loop means doing ray-sphere intersection which saves little extra
B) Full 3D DF can accelerate further but means you need an actual 3d volume texture storing the DF at every Z value.
C) You make the inner loop more expensive when it may not always help.

anonymous_user_4a8d4e3f · December 19, 2016, 1:23am

“DF sphere is only truly valid at surface”
Actually I want the ray to leave the sphere if its in it,not jut on sphere origin,but I cant figure out a simple algorithm to do that, so I changed to cylinder,which is easy to calculate intersection and more adapted to empty space,especially upper space where the ray travels at narrow angle.

" in this example heightmap with lots of big, low-frequency negative space it worked well."
yes I was consider this case in the first place,I think its a general case,like sifi scene.

“turned out to not be of much benefit until the step count was fairly high”
I dont follow,wouldnt it be better because in the sphere it jump over more steps outside the sphere it carry on just like a normal pom.

“but required additional V4 of uncompressed texture data”
Isnt there 3 or 2 channel slots besides heightmap?Can it be 3 cones? since we have to lookup at that uv anyway,can we just grab other channel by the way?

I sense that you havnt go through all my sujects,I ll post detail later,hope you can take a look.

RyanB · December 19, 2016, 2:01am

If you store the distances (using any method) using a 2d plane, at the surface, the distances will only be valid from the surface. Of course you can still continue to use them, but the effective usefulness quickly dies off beyond the 1st iteration, so I was bringing that up as a point to say, these kinds of acceleration techniques often work better just done once or twice completely outside of the main loop.

Re: cylinder vs spheres, its hard to say. The cylinder may appear better for some content the sphere may work better for others. You could build an ideal heightmap where either outperforms. Cones are probably best, especially Relaxed Cones which dethrey posted a link about.

If you store the distances as a volume texture, you can accelerate the same amount each step. This is the basis for using true sphere tracing acceleration.

Did you look at the link with the profile gpu comparisons I posted showing the timing of each step count, with and without the sphere acceleration? When you are only doing 16 or so steps, its already pretty fast. Doing a bit of math to save a fraction of those steps ends up not mattering much because in this low step count, the overhead of the shader itself still matters and the iterations are cheap. Yes, you can add some math and maybe save 4-8 lookups, but that likely won’t show up in profiling.

Once we are doing more than 64 iterations, the cost of the raw iterations dwarfs the other costs, so the extra work to skip some steps helps more. Its just that when you have a shader that isn’t doing much work, going to a bunch of effort to optimize it doesn’t always pay off.

RyanB · December 19, 2016, 3:34am

To show why I started looking at a method that worked better as a first step, here is an example of using sphere traces to accelerate. Imagine a heightmap where the value is always black for simplicity. Using sphere distance, the first iteration can start at:

This acceleration is almost free because you can simply accelerate the ray by TraceVector * the radius (normalized for the height of course). But to take the next optimized step like this:

requires more math to calculate correctly because you can’t just use the radial distance. It ends up being a sphere intersection instead of just a multiply and add. So it has some overhead. I ended up trying some conservative approximation instead that mostly worked. I can’t remember exactly what it was, but I basically shorted the acceleration by some linear factor instead of calculating it correctly.

Your idea to encode steps using tangent instead sounds promising. I will need to test it at some point and see how it interacts with tempAA and other things.

I suggest you recreate your setup using the “preview shader complexity” setup in the POM node to see how many steps this is actually taking visualized by color.

In your example comparing the image of min/max at 8-32 and the precision/max at 0.01 and max 32 may be a bit misleading, since it may end up doing closer to the max steps for more of the screen area.

anonymous_user_4a8d4e3f · December 19, 2016, 3:42am

Detail:

1.Tangent steps
best step size is smallest uv distance which the heightfields fall between can be consider relatively linear,if the step size keeps getting smaller it wont looks better,but will take more texture lookups,that means the steps count it takes to look “just” good(one step more,more waste,one step less,more ditortion) is fix.the relation to camera angle is:steps equals height of the texture divides tan(camera angle)to get uv distance then divides step size.But using cosine it will be:steps=(maxsteps-minsteps)cos+minsteps,it means even looking at straight angle,it takes steps equaled minsteps(usually averrage steps is half of theoretical steps) where it only need one step,so to speak if you make your material look good at 45 degree then every area above 45 degree use more steps than it needs,thats the bigger area on screen.In all,in the graph above,all steps fall between the two curve are potential waste.

2.Tangent steps material
after camera vector transformed to tangent space,its z component is sin(camera angle),so after 4 instructions,we get tan(camera angle)


float rh=1;//ray height
float prh;//previous rayheight
float th;//texture height
float pth;//previous texture height
float ri=0;//intersection ratio

for(int i=0;i<Steps+1;i++)
{th=Texture.SampleLevel(TextureSampler,UV,0)[HeightMapChannel];
if(rh<=th){ri=(th-rh)/(prh-pth+th-rh);UV+=UVStep*ri;break;}  //ray beneaths surface,calculate intersection
prh=rh;
pth=th;
rh-=HeightStep;
UV-=UVStep;}

return float3 output(UV,ri,rh);

I dont compute distance here,I use tan() from before

3.RippleStepsPOM
the nature of heightfield is the higher the emptier,so geometry which is bigger at top smaller at bottom will be suitable:cone,hemishpere,cut off cone,cone under a cylinder,sphere isnt so good,because usually texture space is narrow vertically,sphere will collides with heightfield very quickly. personnally i think cone under a cylinder can block most space,and only use 3 parameter to define.At first I thought the cylinder is best when its volume is biggest which is harder to compute,then I realize the upper space is more valuable to block,because this is where ray travels at narrow angle(I think even at narrow angle the ray wont go far in uvspace,and is more likely to stop at higher heightfield ),so I just compute the cylinders radius as average radius from top to heightfield under,its affected by upper space more because higher is empitier,but also take loower space into account .then i just use this radius to grow the cylinder downward untill it collides,the coollision judge is rather complex,I just make it stop when collides.



void Simulate()
{
	//iterate every pixle
	for (float v = 0.f; v <1.f; v += PixelWidth)
	{
		cout << v;
		for (float u = 0.f; u < 1.f; u += PixelWidth)
		{
			//calculate radius
			float ch=LookUp(u, v, 0.f, 0.f, HeightMapChannel);//central height
			float pr = 0.f;//collect sums of radius at every height step
			//iterate pixel as growing circle
			for (float d = TextureHeightRatio; d >ch; d-= PixelHeight/ SimulationPrecision)
			{
				for (float r = PixelWidth; r <=1; r+=PixelWidth )
				{
					bool f = false;
					for(float a=0.f;a<2* PI;a+=0.1)//a+=PixelWidth/r
					{
						float uo = r*cos(a);
						float vo = r*sin(a);
						//if collides of growing into max,add the radius to sums
						if (r == 1 || d < LookUp(u, v, uo, vo, HeightMapChannel)) { pr += r; f = true; break; }
					}
					if(f) {break;}
				}
			}
			//calculate average radius,at least bigger than smallest step size
			pr = pr *(PixelHeight / SimulationPrecision) / (TextureHeightRatio - ch);
			pr = pr>TexturePrecision?pr:TexturePrecision;
			ReWrite(u, v, 0.f, 0.f, ProbeRadiusChannel, pr);

			//calculate cylnder depthh by growing downwards
			for(float d= TextureHeightRatio;d>=0.f; d -= PixelHeight / SimulationPrecision)
			{
				bool f=false;
				//search if coollides within radius
				for (float r = TexturePrecision/2; r <= pr- TexturePrecision / 2; r += PixelWidth)
				{
					for(float a=0.f;a<2* PI;a+= 0.1)
					{
						float uo = r*cos(a);
						float vo = r*sin(a);
						if(LookUp(u, v, uo, vo, HeightMapChannel)>d|| d == 0.f)
						{
							ReWrite(u, v, 0.f, 0.f, ProbeDepthChannel, d/ TextureHeightRatio);
							f = true;
							break;
						}
					}
					if (f) { break; }
				}
				if (f) { break; }
			}
		}
	}
}

material



float3 tex;//pixel
float rh=1;//ray height
float prh;//previus ray height
float th;//texture height
float pth;//previous texture height
float hs;//height step
float ph;//peek height,above cylinder depth
float uvd=0;

void update(float ho)
{prh=rh;
pth=th;
rh-=ho;
UV-=float2(ho*Tanx,ho*Tany);//use height offset to compute uv ooffset
uvd=ho/Tan;}

for(int i=0;i<MaxSteps;i++)
{tex=Texture.SampleLevel(TextureSampler,UV,0);
th=tex[HeightChannel];
if(rh<=th){UV+=uvd*(th-rh)/(prh-pth+th-rh);break;}//ray insides surface
if(rh<tex[ProbeDepthChannel])//if ray insides cylinder,if not foward step size
{if(float hs=StepSize*Tan<rh){update(hs);continue;}//if ray go beneathh ground
 update(rh);break;}
ph=rh-tex[ProbeDepthChannel];
if(float hs=tex[ProbeRadiusChannel]*Tan<ph){update(hs);continue;}//if ray collides cylinder bottoom or wall
if(ph/Tan>StepSize){update(ph);continue;}
if(hs=StepSize*Tan<rh){update(hs);continue;}//if ray go beneathh ground
update(rh);break;}

return UV;

RyanB · December 19, 2016, 4:23am

I haven’t checked it all, but you mentioned your PDO results looked odd, and to scale them down to avoid issues. PDO shouldn’t really be scaled down since the intersection won’t match.

Technically, the camera Z in tangent space is the Cosine, not the sine. Maybe that is affecting your math. Looks like you are treating it as the sine at the end.

edit I see, i think you are mentally rotating the problem by 90 degrees and defining angle from the plane rather than the normal. I think I see what you mean there. So you are only using tan to correct the partial amount from the last step, rather than calculating the whole value using it.

This makes me realize I could already remove the distance with 0 simply by solving it using cosine, but chances are the compiler is already catching that. It is usually good about optimizing distance calculations.

anonymous_user_4a8d4e3f · December 19, 2016, 8:49am

material uasset

RyanB · December 19, 2016, 8:02pm

just incase you are waiting, I am not going to be looking at the example (just don’t have time sadly). I am happy to continue to provide feedback on the concepts though. I think the next test you should do is to test using the debug shader complexity so we can see how many steps are actually being taken across the screen for the two approaches. If you can show a big savings with your method, I will investigate it more.

FWIW, I was able to remove 3 instructions from the cost of POM PDO by removing the Distance(0, OffsetVector) and instead doing RayHeight.Z / Cosine. It doesn’t seem necessary to use tan to get that optimization but you can rewrite those kind of slopes intersection problems to use any of the trig functions.

anonymous_user_4a8d4e3f · December 20, 2016, 12:37am

No I decide to quit,its too hard for me,I ll learn from basic when I have chance in the future.

JUST one more thought,a early comment remind me of this http://http.developer.nvidia.com/GPUGems3/gpugems3_ch18.html,
I see people have been using binary search for tracing,wouldnt that be better,whin 10 steps we can narrow the step size to uvdist/2^11=0.0004*uvdist,way more accurate for most case,just a thought.

edit:maybe you can also use the free sine to learp min/max steps
I also move some commputing for pdo outside custom node,maybe better since custom node wont be folded.

RyanB · December 20, 2016, 1:08am

Trying out a binary search is not a bad idea at all. It remains to be seen at what point it would be faster than a linear search. My initial guess is that it will be faster at getting a ‘good’ result but slightly slower at getting an ‘ok’ result because of slightly more overheard but that could be totally wrong. I am sure there is some direct comparison out there. There has been a ton of research on these kinds of methods.

anonymous_user_a28c835c · December 22, 2016, 8:55am

There was interesting research on QuadTree Displacement Mapping a while ago.

Also old comparison with POM etc.

Could give some new ideas.

Deathrey · December 22, 2016, 1:12pm

Sadly QTDM seems to shine at iteration count and depth, that is higher than what we can realistically use in a game.

anonymous_user_fbe2d247 · December 22, 2016, 6:59pm

Those performance comparisions are over 7 year old. With modern GPU’s ALU/tex ratio is much higher and cost of dependant texture fetches and branches are smaller. It would interesting to see new comparisions that match visual quality.