Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android: Runtime bug - ANR in com.ansca.corona.Controller - new insight from Google Play #643

Open
appfamily opened this issue Oct 27, 2023 · 7 comments

Comments

@appfamily
Copy link

Describe the bug
Play Store has been reporting ANRs deep inside Solar2d for years. It has been discussed in forums, such as here but without conclusive results.

There is a new built-in analysis of ANRs in the Play Store and perhaps it can shed light on the root cause? See below

To Reproduce
Steps to reproduce the behavior:

  • Not reproducible on demand, but some insights discussed in the forum post linked above.

Target platform and where build was made:

  • Device: onn 100071485
  • OS: Android 12 (SDK 31)
  • Solar2d 2023.3696

Additional context
image

whole test project or build.settings
Available on demand

@vegasfun
Copy link

Similar issue:
ANR-img-1
ANR-img-2

@pouwelsjochem
Copy link
Contributor

Not sure if it's 100% related, but definitely could be: https://discuss.cocos2d-x.org/t/is-there-any-solution-of-anr/43080.

@pouwelsjochem
Copy link
Contributor

Another interesting topic: https://issuetracker.google.com/issues/263307511 it's from a native Android app, but a bit below is an example project can be found & a video showing what happens when the ANR pops up

@clang-clang-clang
Copy link
Contributor

I have the demo, steps, and screen recording to reproduce this ANR. @Shchvova

Use the Lua code as below or modified sample - Fishies to reproduction the Controller.stop() ANR with Solar2D 3699:

  --[[
      This is a demonstration for reproducing an ANR issue with
      Android Controller.stop()/start().
      Triggering the Android lifecycle while Lua is processing heavy 
      work (as shown below) can replicate the problem.
  ]]
  
  local sum = 0
  local co = coroutine.create(function ()
	  for i = 1, 100000 do
  
		  sum = sum + i
  
		  local start_time = os.clock()
		  while os.clock() - start_time < 0.1  do
		  end
  
		  coroutine.yield(sum)
	  end
  end)
  
  timer.performWithDelay(10, function()
	  for i = 1, 100 do
		  if co and coroutine.status(co) ~= "dead" then
			  coroutine.resume(co)
		  end
	  end
  end, 0)

The steps are as follows (Using the timeline of screen recording):

  1. 00:00:01 Start by testing the app as usual.
  2. 00:00:13 Enter the game scenes. Due to heavy computation, there is only one frame update at 22 and 32 seconds, which is not significant.
  3. 00:00:33 Exit or kill the app.
  4. 00:00:35 Manually restart the app from the desktop. Android Studio's first re-run won't reproduce this ANR.
  5. 00:00:37 Lock the screen by pressing the power button before fully entering the game scenes. Wait for at least 5 seconds.
  6. 00:00:44 Unlock the screen.
  7. 00:00:45 Seeing the ANR pop-up, use adb bugreport to retrieve it from the phone. This reproduced Controller.stop() ANR.
  8. 00:00:50 Exit or kill the app.
  9. 00:01:00 Manually restart the app from the desktop.
  10. 00:01:01 Exit to the desktop before fully entering the game scenes.
  11. 00:01:04 Re-enter the app.
  12. 00:01:06 Lock the screen by pressing the power button before fully entering the game scenes. Wait for at least 5 seconds.
  13. 00:01:16 See whether the ANR pop-up appears or not, as the app may have been killed by the system. This reproduced Controller.start() ANR (sometimes together with Controller.stop() ANR).

The screen recording:

output.mp4

@clang-clang-clang
Copy link
Contributor

I tested 3699 with a simple patch (either a draft or proof of concept) and the ANR disappeared. However, I'm not familiar with the entire locking process, so I'm not sure about potential effects.

platform/android/sdk/src/com/ansca/corona/graphics/opengl/GLSurfaceView.java

        public void surfaceDestroyed() {
            synchronized(sGLThreadManager) {
                if (LOG_THREADS) {
                    Log.i("GLThread", "surfaceDestroyed tid=" + getId());
                }
                mHasSurface = false;
                sGLThreadManager.notifyAll();
-                while((!mWaitingForSurface) && (!mExited)) {
-                    try {
-                        sGLThreadManager.wait();
-                    } catch (InterruptedException e) {
-                        Thread.currentThread().interrupt();
-                    }
-                }
+//                while((!mWaitingForSurface) && (!mExited)) {
+//                    try {
+//                        sGLThreadManager.wait();
+//                    } catch (InterruptedException e) {
+//                        Thread.currentThread().interrupt();
+//                    }
+//                }
            }
        }
platform/android/sdk/src/com/ansca/corona/CoronaActivity.java

    private void requestResumeCoronaRuntime() {
...
		// Start/resume the Corona runtime.
-		fController.start();
+		new Thread(fController::start).start();

...
    private void requestSuspendCoronaRuntime() {
		// Suspend the Corona runtime.
		if (fController != null) {
-			fController.stop();
+			new Thread(fController::stop).start();
		}

I hope the information helpful. Thank you.

@pouwelsjochem
Copy link
Contributor

Interesting @clang-clang-clang, is there a reason why you've use coroutines for the example? Wouldn't the ANR trigger otherwise? That would explain why not everyone is seeing the high ANR rate.

I'm not quite sure this is the right solution though. With my limited knowledge of locking in Java, simply removing the wait in surfaceDestroyed() could result in possible crashes in other methods which think everything is correctly destroyed, while the "destroy" is not actually finished yet.

I did spend some of last week on the ANR issue as well, and have actually come up with a different solution by looking at the libgdx library which also uses Android's GLSurfaceView.java. They had similar issues on the onPause state and avoid it by using two different methods.

  1. Killing the app when onPause takes longer than 4000ms (ANR triggers at 5000ms)
  2. Using an optimised updateRuntimeState() where not everything is locked/synchronized all of the time, just the moments when controller is actually used.

Start of last weekend I've integrated these solutions in my own fork of Solar2D (https://github.com/pouwelsjochem/solar2d/blob/40f0134d8ee7447373d0900257909f36fd863ba3/platform/android/sdk/src/com/ansca/corona/Controller.java#L222) and slowly rolled that out to my users. Our ANR rate of this last build decreased by roughly 90%, from ~1% to ~0.1% (over 14K sessions in the last 24 hours). I believe Vlad is going to incorporate my changes into the main repo soon.

@clang-clang-clang
Copy link
Contributor

Great news. Glad to hear there's a solution.

Since I can't reproduce the online situation, I can only approximate the ANR by simulating the calculated pressure.
Just keeping the GLThread busy and triggering the lifecycle can cause an ANR.

As for the 'patch' (it's a draft or proof of concept), it can't been production yet. I suspect that the sGLThreadManager also has a locking issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants