Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Road to intel embree v3.9.0 and beyond #20

Closed
7 tasks done
syoyo opened this issue Mar 2, 2020 · 14 comments
Closed
7 tasks done

Road to intel embree v3.9.0 and beyond #20

syoyo opened this issue Mar 2, 2020 · 14 comments

Comments

@syoyo
Copy link
Contributor

syoyo commented Mar 2, 2020

Note

There are still some amount of work required for Improving NEON code path. So neon-fix branch will continue to alive even after syncing embree-aarch64 with intel embree v3.8.0.

@syoyo
Copy link
Contributor Author

syoyo commented Mar 8, 2020

All done by this commit: 4433618

@maikschulze
Copy link
Contributor

Hi,

I've made a comparitive test between the older Embree 3.5 / 3.6 state and the current commit 4433618 for Embree 3.8.

I obtain the following results for the different platforms:
Measurements.pdf

I could not identify any quality regression over previous aarch64 versions. The timings look acceptable to me. I refrain from arguing that any particular feature is x percent faster, because of the very high speed variation in the x64 results. I would not expect such major performance changes in the matured code basis. I'm unsure whether this is caused by new compiler settings, new compiler versions, lower ambient temperature on the mobile test hardware, or improvements in the code. Because of these high variations, I don't put much trust into my aarch64 timings.

I would argue that we require better testing, as we discussed in #17 to reduce the uncertainty here.

I did run into one compilation issue that is specific to IOS:
The BUILD_IOS-enabled code in bvh_intersector_hybrid.cpp contains an ambiguity for vbool<4> conversion which resulted in a compiler error. I avoided this by def-disabling the respective two areas in the cpp file.

The above-linked PDF file contains two remarks. I have noticed that some results look worse than on x64 and would argue we should investigate a bit. Please note that these issues have occurred in earlier Embree aarch64 already. I have just not noticed them earlier.

  1. There's a hole in the displacement geometry test only on aarch64.

Windows x64:
win_x64_OK

Android aarch64:
android_aarch64_NOK

iOS aarch64:
ios_aarch64_NOK

Comparing Windows (left) to iOS (right):
comparison_win_vs_ios

  1. There's a hole in the curve geometry only on aarch64.

Windows x64:
win_x64_OK

Android aarch64:
android_aarch64_NOK

iOS aarch64:
ios_aarch64_NOK

Comparing Windows (left) to iOS (right):
comparison_win_vs_ios

Since these two artifacts are not located in the axis center and do not seem to occur on x64, I think it's worth to take a closer look. I will try to repro this in the original Embree tutorials with a particular camera setup. This should allow us to debug the problematic pixels.

@syoyo
Copy link
Contributor Author

syoyo commented Mar 28, 2020

@maikschulze Awesome!

As you may already know, I have added Windows CI build support for Github Actions: https://github.com/lighttransport/embree-aarch64/actions/runs/52064026

So its ready to run regression tests once your code is available.

@maikschulze
Copy link
Contributor

Hi,
I've dug a bit deeper into the hole in the displacement test. The test can be reproduced with Embree's 'displacement_geometry' tutorial when setting the camera the following:

Tutorial()
      : TutorialApplication("displacement_geometry",FEATURE_RTCORE) 
    {
      /* set default camera */
      camera.from = Vec3fa(0.811794400, 1.00000000, 1.95984364);
      camera.to   = Vec3fa(0.0f,0.0f,0.0f);
    }

I've tried different approaches at rcp, sqrt and rsqrt precision but could not get rid off the artifact this way. I've then created higher resoluted images, up to 16K with a high contrast likelihood against the occluded backfase using color = abs(Ng).

It turns out that these holes / artifacts also appear in x64. In both x64 and arm they are consistently appearing on the primitive edges. It's best to zoom in closely. The first x64 artifact appears in the second level (top row) whereas the arm artifact appear in the first level (bottom row):
comparison_zoom

My data is available here:
https://drive.google.com/open?id=1Wy1lBMk_WAZClEYvsmQA9p-fv3MRO-yf

This makes me wonder whether we are seeing the following known issue:
https://software.intel.com/en-us/forums/embree-photo-realistic-ray-tracing-kernels/topic/805343

@maikschulze
Copy link
Contributor

Thank you very much for the CI setup, @syoyo ! I will take a look at it and hopefully soon find the time to set up some public tests.

@syoyo
Copy link
Contributor Author

syoyo commented Mar 29, 2020

I've dug a bit deeper into the hole in the displacement test.

@maikschulze The hole may be a bug in Embree, but may be inherent to displacement algorithm itself(e.g. the normal may face opposite direction due to insufficient sampling rate).

In the latter case, we could implement reference displacement algorithm in NanoRT (https://github.com/lighttransport/nanort/tree/master/examples/vdisp ) and verify how it goes.

For the former case, I'd recommend the following step to debug(what I did for debugging curve intersector of Embree)

  1. Dump BVH : https://github.com/lighttransport/embree-aarch64/blob/master/test/curve/main.cc#L311
  2. Add conditional breakpoint when the program enters a pixel where artifact happens, then print ray's intersection state during step exectution(or add debug printf to suspicious place)

This is a brute force way and time consuming, but should work well.

@maikschulze
Copy link
Contributor

maikschulze commented Mar 30, 2020

Hi Syoyo,

I've looked at the noise function and its effect on the surface curvature. It does not seem to create any overlap. The sampling density is sufficient, normals are not rotated inwards. The artifacts appear even when lowering the noise displacement to a very small factor, effectively resulting in a sphere-like subdivision.

On x64 such a hole is reproducible with this ray:


RTCIntersectContext context;
rtcInitIntersectContext(&context);

RTCRayHit rayhit;
rayhit.ray.org_x = 0.811794400;
rayhit.ray.org_y = 1.00000000;
rayhit.ray.org_z = 1.95984340;

rayhit.ray.dir_x = -0.186146364;
rayhit.ray.dir_y = -0.291051656;
rayhit.ray.dir_z = -0.938423395;

rayhit.ray.time = 0.0;
rayhit.ray.tnear = 0.0;
rayhit.ray.tfar = 100000.0;
rayhit.ray.mask = -1;

rtcIntersect1(g_scene,&context,&rayhit); // misses prim 4 at tfar=1.5 and hits prim 0 at tfar=3 instead.

The surface's front face that should be hit is prim 4, which is indeed hit for any slight variation in ray origin or direction. Instead, for this particular ray it's missed and the back face prim 0 is hit.

This setup enters:
GridSOAIntersector1::intersect

which then enters:
PlueckerIntersector1::intersect

Here, prim 4 is discarded because it fails to pass this test:

const vfloat<M> eps = float(ulp)*abs(UVW);
vbool<M> valid = (min(U,V,W) >= -eps) | (max(U,V,W) <= eps);
if (unlikely(none(valid))) return false;

I can imagine that Loader::gather is the cause here (called from GridSOAIntersector1::intersect) and may create a tiny gap between neighboring grid patches. I don't believe there's much I can do here with my limited understanding of the algorithm's numerics. Increasing the PlueckerIntersector1::intersect::eps if fed by GridSOAIntersector1 seems necessary but hacky to me if the factor is not properly determined.

@maikschulze
Copy link
Contributor

maikschulze commented Mar 30, 2020

Hi,

for the hole in the curve I have better news. While it is quite prevalent on arm, I fail to find holes on x64 curves. Moreover, I was able to fix the holes quickly by highly improving the rcp and rsqrt precision as a sanity check.

Firstly, for arm the hole is reproducible in the 'interpolation' scene with this ray:

RTCIntersectContext context;
rtcInitIntersectContext(&context);

RTCRayHit rayhit;
rayhit.ray.org_x = 6.00000000;
rayhit.ray.org_y = 0.00000000;
rayhit.ray.org_z = -4.50000000;

rayhit.ray.dir_x = -0.651344419;
rayhit.ray.dir_y = 0.682293713;
rayhit.ray.dir_z = 0.332002729;

rayhit.ray.time = 0.0;
rayhit.ray.tnear = 0.0;
rayhit.ray.tfar = 100000.0;
rayhit.ray.mask = -1;

rtcIntersect1(g_scene, &context, &rayhit);

When rendering with this camera:

camera.from = Vec3fa(6, 0.00000000, -4.5);
camera.to   = Vec3fa(0.0f,0.0f,1.0f);
width = 1024;
height = 1024;

and removing everything but the curve with this shading: color = (Vec3fa(1.0) + Ng) / Vec3fa(2.0);

I get these results:
aarch64:
color_arm_1k

x64:
color_x64_1k

Comparison of x64 (left) and aarch64 (right) yields:
image

aarch64 stabilized:
color_arm_1k_stabilized

Comparing x64 (left) to a 'stabilized' aarch64 (right) yields:
image

The stabilized version contains the change of rcp and rsqrt functions like this:

Vec3fa rcp  ( const Vec3fa& a )
{
	const Vec3fa r = _mm_rcp_ps(a.m128);
	const Vec3fa res = _mm_mul_ps(r,_mm_sub_ps(vfloat4(2.0f), _mm_mul_ps(r, a)));
	return res;
}

Vec3fa rsqrt( const Vec3fa& a )
{
	__m128 r = _mm_rsqrt_ps(a.m128);
	return _mm_add_ps(_mm_mul_ps(_mm_set1_ps(1.5f),r), _mm_mul_ps(_mm_mul_ps(_mm_mul_ps(a, _mm_set1_ps(-0.5f)), r), _mm_mul_ps(r, r)));
}

and four Newton-Raphson iterations inside _mm_rcp_ps and _mm_rsqrt_ps for sanity. I will try to narrow it down a bit more.

@maikschulze
Copy link
Contributor

The issue is related to the five rsqrt functions (color, vec2, vec3, vfloat4, math):

Vec3fa rsqrt( const Vec3fa& a )
{
	__m128 r = _mm_rsqrt_ps(a.m128);
	return _mm_add_ps(_mm_mul_ps(_mm_set1_ps(1.5f),r), _mm_mul_ps(_mm_mul_ps(_mm_mul_ps(a, _mm_set1_ps(-0.5f)), r), _mm_mul_ps(r, r)));
}

The issue appears when returning r right away and skipping the second line. @syoyo , do you happen to know what this formula does?

Two Newton-Raphson iterations suffice luckily. I will create a submit with the stabilizations in my fork. If we're not perfectly confident that we can skip the similar epilogue in the rcp functions, I argue we should re-add this there as well. While we still cannot guarantee water tightness without digging deep into numerical differences between SSE and NEON, I'm sure it will help with robustness.
I guess it's worth to re-evaluate performance as well.

@syoyo
Copy link
Contributor Author

syoyo commented Mar 30, 2020

@maikschulze Good catch! > curve

It looks ARM NEON's rsqrte and rcpe has less accuracy(6~8bits?) than SSE2(~12bits?), so we'll need 2x ~ 3x more rounds of Newton-Raphon iteration to get the same level of accuracy of SSE2.

https://qiita.com/sanmanyannyan/items/4d06b00078dd4abc4225

I'm not sure > rsqrt equation.
We'll be enough to use standard Newton-Raphson iterations for rsqrt and rcp

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/CIHDIACI.html

I think using improved rsqrt and rcp also solves the displacement issue.

@syoyo
Copy link
Contributor Author

syoyo commented Mar 31, 2020

Vec3fa rsqrt( const Vec3fa& a )
{
	__m128 r = _mm_rsqrt_ps(a.m128);
	return _mm_add_ps(_mm_mul_ps(_mm_set1_ps(1.5f),r), _mm_mul_ps(_mm_mul_ps(_mm_mul_ps(a, _mm_set1_ps(-0.5f)), r), _mm_mul_ps(r, r)));
}

The second line is exactly the equation of one round of Newton-Raphson iteration https://en.wikipedia.org/wiki/Fast_inverse_square_root#Newton's_method

@syoyo
Copy link
Contributor Author

syoyo commented Mar 31, 2020

Filed an rcp/rsqrt issue to #24

@syoyo syoyo changed the title Road to intel embree v3.8.0 and beyond Road to intel embree v3.9.0 and beyond May 1, 2020
@syoyo
Copy link
Contributor Author

syoyo commented May 1, 2020

@maikschulze Synched embree-aarch64 master with Intel embree v3.9.0. This would solve some issue.

@syoyo
Copy link
Contributor Author

syoyo commented Jul 21, 2020

We are now v3.11.0 and v3.9.0 become old. So close the issue

@syoyo syoyo closed this as completed Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants