Road to intel embree v3.9.0 and beyond #20

syoyo · 2020-03-02T17:24:35Z

Merge neon-fix to embree-aarch64 master(v3.6.1)
- Will solve Noise(wrong intersection result) in curve test #17 and verify segfault on aarch64 linux #16
Create aarch64-v3.8.0 branch from intel embree v3.8.0(recent version as of writing this issue(2nd March, 2020))
Merge neon-fix to aarch64-v3.8.0
Merge @maikschulze 's Fixing verify / Fixing EMBREE_RAY_PACKETS=OFF RenderKit/embree#275 and Fixing verify / RayMasksTest & SphereFilterMultiHitTest RenderKit/embree#276
- 672a488
- Will solve verify fails to pass with internal tasking system #18
Fix verify segfault: verify NEON seg faults #22
Run tests, then merge aarch64-v3.8.0 to master
Merged v3.9.0 from Intel Embree

Note

There are still some amount of work required for Improving NEON code path. So neon-fix branch will continue to alive even after syncing embree-aarch64 with intel embree v3.8.0.

The text was updated successfully, but these errors were encountered:

syoyo · 2020-03-08T15:17:41Z

All done by this commit: 4433618

maikschulze · 2020-03-28T10:56:07Z

Hi,

I've made a comparitive test between the older Embree 3.5 / 3.6 state and the current commit 4433618 for Embree 3.8.

I obtain the following results for the different platforms:
Measurements.pdf

I could not identify any quality regression over previous aarch64 versions. The timings look acceptable to me. I refrain from arguing that any particular feature is x percent faster, because of the very high speed variation in the x64 results. I would not expect such major performance changes in the matured code basis. I'm unsure whether this is caused by new compiler settings, new compiler versions, lower ambient temperature on the mobile test hardware, or improvements in the code. Because of these high variations, I don't put much trust into my aarch64 timings.

I would argue that we require better testing, as we discussed in #17 to reduce the uncertainty here.

I did run into one compilation issue that is specific to IOS:
The BUILD_IOS-enabled code in bvh_intersector_hybrid.cpp contains an ambiguity for vbool<4> conversion which resulted in a compiler error. I avoided this by def-disabling the respective two areas in the cpp file.

The above-linked PDF file contains two remarks. I have noticed that some results look worse than on x64 and would argue we should investigate a bit. Please note that these issues have occurred in earlier Embree aarch64 already. I have just not noticed them earlier.

There's a hole in the displacement geometry test only on aarch64.

Windows x64:

Android aarch64:

iOS aarch64:

Comparing Windows (left) to iOS (right):

There's a hole in the curve geometry only on aarch64.

Windows x64:

Android aarch64:

iOS aarch64:

Comparing Windows (left) to iOS (right):

Since these two artifacts are not located in the axis center and do not seem to occur on x64, I think it's worth to take a closer look. I will try to repro this in the original Embree tutorials with a particular camera setup. This should allow us to debug the problematic pixels.

syoyo · 2020-03-28T14:41:20Z

@maikschulze Awesome!

As you may already know, I have added Windows CI build support for Github Actions: https://github.com/lighttransport/embree-aarch64/actions/runs/52064026

So its ready to run regression tests once your code is available.

maikschulze · 2020-03-29T15:50:52Z

Hi,
I've dug a bit deeper into the hole in the displacement test. The test can be reproduced with Embree's 'displacement_geometry' tutorial when setting the camera the following:

Tutorial()
      : TutorialApplication("displacement_geometry",FEATURE_RTCORE) 
    {
      /* set default camera */
      camera.from = Vec3fa(0.811794400, 1.00000000, 1.95984364);
      camera.to   = Vec3fa(0.0f,0.0f,0.0f);
    }

I've tried different approaches at rcp, sqrt and rsqrt precision but could not get rid off the artifact this way. I've then created higher resoluted images, up to 16K with a high contrast likelihood against the occluded backfase using color = abs(Ng).

It turns out that these holes / artifacts also appear in x64. In both x64 and arm they are consistently appearing on the primitive edges. It's best to zoom in closely. The first x64 artifact appears in the second level (top row) whereas the arm artifact appear in the first level (bottom row):

My data is available here:
https://drive.google.com/open?id=1Wy1lBMk_WAZClEYvsmQA9p-fv3MRO-yf

This makes me wonder whether we are seeing the following known issue:
https://software.intel.com/en-us/forums/embree-photo-realistic-ray-tracing-kernels/topic/805343

maikschulze · 2020-03-29T16:17:31Z

Thank you very much for the CI setup, @syoyo ! I will take a look at it and hopefully soon find the time to set up some public tests.

syoyo · 2020-03-29T16:45:49Z

I've dug a bit deeper into the hole in the displacement test.

@maikschulze The hole may be a bug in Embree, but may be inherent to displacement algorithm itself(e.g. the normal may face opposite direction due to insufficient sampling rate).

In the latter case, we could implement reference displacement algorithm in NanoRT (https://github.com/lighttransport/nanort/tree/master/examples/vdisp ) and verify how it goes.

For the former case, I'd recommend the following step to debug(what I did for debugging curve intersector of Embree)

Dump BVH : https://github.com/lighttransport/embree-aarch64/blob/master/test/curve/main.cc#L311
Add conditional breakpoint when the program enters a pixel where artifact happens, then print ray's intersection state during step exectution(or add debug printf to suspicious place)

This is a brute force way and time consuming, but should work well.

maikschulze · 2020-03-30T11:42:45Z

Hi Syoyo,

I've looked at the noise function and its effect on the surface curvature. It does not seem to create any overlap. The sampling density is sufficient, normals are not rotated inwards. The artifacts appear even when lowering the noise displacement to a very small factor, effectively resulting in a sphere-like subdivision.

On x64 such a hole is reproducible with this ray:


RTCIntersectContext context;
rtcInitIntersectContext(&context);

RTCRayHit rayhit;
rayhit.ray.org_x = 0.811794400;
rayhit.ray.org_y = 1.00000000;
rayhit.ray.org_z = 1.95984340;

rayhit.ray.dir_x = -0.186146364;
rayhit.ray.dir_y = -0.291051656;
rayhit.ray.dir_z = -0.938423395;

rayhit.ray.time = 0.0;
rayhit.ray.tnear = 0.0;
rayhit.ray.tfar = 100000.0;
rayhit.ray.mask = -1;

rtcIntersect1(g_scene,&context,&rayhit); // misses prim 4 at tfar=1.5 and hits prim 0 at tfar=3 instead.

The surface's front face that should be hit is prim 4, which is indeed hit for any slight variation in ray origin or direction. Instead, for this particular ray it's missed and the back face prim 0 is hit.

This setup enters:
GridSOAIntersector1::intersect

which then enters:
PlueckerIntersector1::intersect

Here, prim 4 is discarded because it fails to pass this test:

const vfloat<M> eps = float(ulp)*abs(UVW);
vbool<M> valid = (min(U,V,W) >= -eps) | (max(U,V,W) <= eps);
if (unlikely(none(valid))) return false;

I can imagine that Loader::gather is the cause here (called from GridSOAIntersector1::intersect) and may create a tiny gap between neighboring grid patches. I don't believe there's much I can do here with my limited understanding of the algorithm's numerics. Increasing the PlueckerIntersector1::intersect::eps if fed by GridSOAIntersector1 seems necessary but hacky to me if the factor is not properly determined.

maikschulze · 2020-03-30T15:02:02Z

Hi,

for the hole in the curve I have better news. While it is quite prevalent on arm, I fail to find holes on x64 curves. Moreover, I was able to fix the holes quickly by highly improving the rcp and rsqrt precision as a sanity check.

Firstly, for arm the hole is reproducible in the 'interpolation' scene with this ray:

RTCIntersectContext context;
rtcInitIntersectContext(&context);

RTCRayHit rayhit;
rayhit.ray.org_x = 6.00000000;
rayhit.ray.org_y = 0.00000000;
rayhit.ray.org_z = -4.50000000;

rayhit.ray.dir_x = -0.651344419;
rayhit.ray.dir_y = 0.682293713;
rayhit.ray.dir_z = 0.332002729;

rayhit.ray.time = 0.0;
rayhit.ray.tnear = 0.0;
rayhit.ray.tfar = 100000.0;
rayhit.ray.mask = -1;

rtcIntersect1(g_scene, &context, &rayhit);

When rendering with this camera:

camera.from = Vec3fa(6, 0.00000000, -4.5);
camera.to   = Vec3fa(0.0f,0.0f,1.0f);
width = 1024;
height = 1024;

and removing everything but the curve with this shading: color = (Vec3fa(1.0) + Ng) / Vec3fa(2.0);

I get these results:
aarch64:

x64:

Comparison of x64 (left) and aarch64 (right) yields:

aarch64 stabilized:

Comparing x64 (left) to a 'stabilized' aarch64 (right) yields:

The stabilized version contains the change of rcp and rsqrt functions like this:

Vec3fa rcp  ( const Vec3fa& a )
{
	const Vec3fa r = _mm_rcp_ps(a.m128);
	const Vec3fa res = _mm_mul_ps(r,_mm_sub_ps(vfloat4(2.0f), _mm_mul_ps(r, a)));
	return res;
}

Vec3fa rsqrt( const Vec3fa& a )
{
	__m128 r = _mm_rsqrt_ps(a.m128);
	return _mm_add_ps(_mm_mul_ps(_mm_set1_ps(1.5f),r), _mm_mul_ps(_mm_mul_ps(_mm_mul_ps(a, _mm_set1_ps(-0.5f)), r), _mm_mul_ps(r, r)));
}

and four Newton-Raphson iterations inside _mm_rcp_ps and _mm_rsqrt_ps for sanity. I will try to narrow it down a bit more.

maikschulze · 2020-03-30T15:34:16Z

The issue is related to the five rsqrt functions (color, vec2, vec3, vfloat4, math):

Vec3fa rsqrt( const Vec3fa& a )
{
	__m128 r = _mm_rsqrt_ps(a.m128);
	return _mm_add_ps(_mm_mul_ps(_mm_set1_ps(1.5f),r), _mm_mul_ps(_mm_mul_ps(_mm_mul_ps(a, _mm_set1_ps(-0.5f)), r), _mm_mul_ps(r, r)));
}

The issue appears when returning r right away and skipping the second line. @syoyo , do you happen to know what this formula does?

Two Newton-Raphson iterations suffice luckily. I will create a submit with the stabilizations in my fork. If we're not perfectly confident that we can skip the similar epilogue in the rcp functions, I argue we should re-add this there as well. While we still cannot guarantee water tightness without digging deep into numerical differences between SSE and NEON, I'm sure it will help with robustness.
I guess it's worth to re-evaluate performance as well.

syoyo · 2020-03-30T16:10:04Z

@maikschulze Good catch! > curve

It looks ARM NEON's rsqrte and rcpe has less accuracy(6~8bits?) than SSE2(~12bits?), so we'll need 2x ~ 3x more rounds of Newton-Raphon iteration to get the same level of accuracy of SSE2.

https://qiita.com/sanmanyannyan/items/4d06b00078dd4abc4225

I'm not sure > rsqrt equation.
We'll be enough to use standard Newton-Raphson iterations for rsqrt and rcp

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/CIHDIACI.html

I think using improved rsqrt and rcp also solves the displacement issue.

syoyo · 2020-03-31T05:50:23Z

Vec3fa rsqrt( const Vec3fa& a )
{
	__m128 r = _mm_rsqrt_ps(a.m128);
	return _mm_add_ps(_mm_mul_ps(_mm_set1_ps(1.5f),r), _mm_mul_ps(_mm_mul_ps(_mm_mul_ps(a, _mm_set1_ps(-0.5f)), r), _mm_mul_ps(r, r)));
}

The second line is exactly the equation of one round of Newton-Raphson iteration https://en.wikipedia.org/wiki/Fast_inverse_square_root#Newton's_method

syoyo · 2020-03-31T06:05:40Z

Filed an rcp/rsqrt issue to #24

syoyo · 2020-05-01T12:23:13Z

@maikschulze Synched embree-aarch64 master with Intel embree v3.9.0. This would solve some issue.

syoyo · 2020-07-21T12:31:03Z

We are now v3.11.0 and v3.9.0 become old. So close the issue

syoyo mentioned this issue Mar 31, 2020

Improve the accuracy of minps, maxps, rcp, rsqrt in NEON #24

Closed

syoyo changed the title ~~Road to intel embree v3.8.0 and beyond~~ Road to intel embree v3.9.0 and beyond May 1, 2020

syoyo closed this as completed Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Road to intel embree v3.9.0 and beyond #20

Road to intel embree v3.9.0 and beyond #20

syoyo commented Mar 2, 2020 •

edited

syoyo commented Mar 8, 2020

maikschulze commented Mar 28, 2020

syoyo commented Mar 28, 2020

maikschulze commented Mar 29, 2020

maikschulze commented Mar 29, 2020

syoyo commented Mar 29, 2020

maikschulze commented Mar 30, 2020 •

edited

maikschulze commented Mar 30, 2020 •

edited

maikschulze commented Mar 30, 2020

syoyo commented Mar 30, 2020

syoyo commented Mar 31, 2020

syoyo commented Mar 31, 2020

syoyo commented May 1, 2020

syoyo commented Jul 21, 2020

Road to intel embree v3.9.0 and beyond #20

Road to intel embree v3.9.0 and beyond #20

Comments

syoyo commented Mar 2, 2020 • edited

Note

syoyo commented Mar 8, 2020

maikschulze commented Mar 28, 2020

syoyo commented Mar 28, 2020

maikschulze commented Mar 29, 2020

maikschulze commented Mar 29, 2020

syoyo commented Mar 29, 2020

maikschulze commented Mar 30, 2020 • edited

maikschulze commented Mar 30, 2020 • edited

maikschulze commented Mar 30, 2020

syoyo commented Mar 30, 2020

syoyo commented Mar 31, 2020

syoyo commented Mar 31, 2020

syoyo commented May 1, 2020

syoyo commented Jul 21, 2020

syoyo commented Mar 2, 2020 •

edited

maikschulze commented Mar 30, 2020 •

edited

maikschulze commented Mar 30, 2020 •

edited