Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference time using Interpreter API on Android inconsistent and 10–50 times slower than same tflite model on iOS #66025

Open
jakubdolejs opened this issue Apr 18, 2024 · 11 comments
Assignees
Labels
Android comp:lite TF Lite related issues type:performance Performance Issue type:support Support issues

Comments

@jakubdolejs
Copy link

Issue type

Performance

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

2.15.0

Custom code

Yes

OS platform and distribution

No response

Mobile device

Google Pixel 4a running Android 13

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I'm running inference on Yolov8-based tflite model on Android using the Interpreter API. I noticed that the first 30 or so calls to the Interpreter.run() function take much longer than the subsequent calls. The difference is quite marked, starting at about 3500ms per run and ending at about 500ms.

I thought perhaps it was something about the input data so I tried a test with running the same call with the same input 100 times in a loop. Same behaviour, the first handful of inference runs take around 3 seconds, slowly speeding up to about 500–700ms by the 100th iteration.

I wanted to find out whether there is a specific combination of the interpreter options causing this behaviour so I wrote a test matrix initialising interpreters with different options:

  • Using GPU delegate
    • Using Google Play Services runtime
      • Using model with precision reduced from float32 to float16
    • Using bundled runtime
      • Using model with precision reduced from float32 to float16
  • Using NNAPI delegate
    • Using Google Play Services runtime
      • Using model with precision reduced from float32 to float16
    • Using bundled runtime
      • Using model with precision reduced from float32 to float16
  • Using CPU with XNNPACK
    • Using Google Play Services runtime
      • Using model with precision reduced from float32 to float16
    • Using bundled runtime
      • Using model with precision reduced from float32 to float16
  • Using CPU without XNNPACK
    • Using Google Play Services runtime
      • Using model with precision reduced from float32 to float16
    • Using bundled runtime
      • Using model with precision reduced from float32 to float16

There doesn't seem to be any difference whichever combination runs first takes suspicious amount of time for the first handful of inference runs. Sometimes the time never decreases and all the inference runs for the given configuration take a very long time (~3 seconds).

I'm including the code using the bundled runtime. The Play Services runtime times were in line with the bundled runtime.

The device (Google Pixel 4a) is used only for development. There are no other apps installed aside from the test app and whatever was pre-installed on the phone. The device wasn't connected to the internet while running the test.

iOS comparison

In comparison, version 2.14.0 of TfLite for Swift (latest available on CocoaPods) using the CoreML delegate runs inference on the same input using the same model in 70ms on iPhone 12.

Standalone code to reproduce the issue

fun testInferenceSpeed() {
    val context = InstrumentationRegistry.getInstrumentation().context
    val assetManager = context.assets
    // Input serialized as a float array in JSON
    val jsonFile = "face_on_iPad_001.jpg-flat.json"
    assetManager.open(jsonFile).use { inputStream ->
        val json = inputStream.bufferedReader().use { it.readText() }
        val floatArray = Json.decodeFromString<FloatArray>(json)
        // Models – float32 and float16
        val models = arrayOf("ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite", "ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite")
        val options = arrayOf("gpu", "nnapi", "cpu", "xnnpack")
        for (model in models) {
            assetManager.open(model).use { modelInputStream ->
                // Copy the model from assets to the cache directory
                val modelFile = File(context.cacheDir, model)
                modelFile.outputStream().use { outputStream ->
                    modelInputStream.copyTo(outputStream)
                }
                for (option in options) {
                    val interpreterOptions = InterpreterApi.Options()
                    val compatibilityList = CompatibilityList()
                    when (option) {
                        "gpu" -> {
                            compatibilityList.use {
                                if (it.isDelegateSupportedOnThisDevice) {
                                    interpreterOptions.addDelegate(
                                        GpuDelegate(
                                            it.bestOptionsForThisDevice
                                        )
                                    )
                                }
                            }
                        }
                        "nnapi" -> {
                            if (android.os.Build.VERSION.SDK_INT >= android.os.Build.VERSION_CODES.P) {
                                interpreterOptions.addDelegate(NnApiDelegate())
                                interpreterOptions.useNNAPI = true
                            }
                        }
                        "cpu" -> {
                            interpreterOptions.numThreads =
                                Runtime.getRuntime().availableProcessors()
                            interpreterOptions.useXNNPACK = false
                        }

                        "xnnpack" -> {
                            interpreterOptions.numThreads =
                                Runtime.getRuntime().availableProcessors()
                            interpreterOptions.useXNNPACK = true
                        }
                        else -> throw IllegalArgumentException("Unknown option: $option")
                    }
                    InterpreterApi.create(modelFile, interpreterOptions)
                        .use { interpreterApi ->
                            val times = mutableListOf<Long>()
                            for (i in 0 until 100) {
                                interpreterApi.allocateTensors()
                                val input = FloatBuffer.wrap(floatArray)
                                val output =
                                    FloatBuffer.allocate(5 * 8400).also { it.rewind() }
                                val time = measureTimeMillis {
                                    interpreterApi.run(input, output)
                                }
                                times.add(time)
                            }
                            Log.d(
                                TAG,
                                "Model: $model, Option: $option, Inference times (ms): [${times.map { it.toString()+"ms" }.joinToString()}], Average inference time: ${times.average()} ms"
                            )
                        }
                }
            }
        }
    }
}

Relevant log output

Model: ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite, Option: gpu, Inference times (ms): [2502ms, 3011ms, 2987ms, 2723ms, 3529ms, 4245ms, 3387ms, 4510ms, 4133ms, 4034ms, 4015ms, 3307ms, 3207ms, 3240ms, 2718ms, 2978ms, 2985ms, 3357ms, 2751ms, 2969ms, 2942ms, 3028ms, 2916ms, 3029ms, 4428ms, 2727ms, 4982ms, 4320ms, 3211ms, 2980ms, 4010ms, 3239ms, 2712ms, 3974ms, 3994ms, 3999ms, 3997ms, 3047ms, 3687ms, 3744ms, 2972ms, 2944ms, 3709ms, 3936ms, 3971ms, 3998ms, 3315ms, 4495ms, 3285ms, 4655ms, 2758ms, 3307ms, 4880ms, 4912ms, 3599ms, 2750ms, 2004ms, 2643ms, 3383ms, 3372ms, 1664ms, 3297ms, 2969ms, 1714ms, 2834ms, 3381ms, 1764ms, 2303ms, 1715ms, 3314ms, 3379ms, 1434ms, 3221ms, 2842ms, 1783ms, 1784ms, 1418ms, 1618ms, 1400ms, 1777ms, 1960ms, 1962ms, 1471ms, 2355ms, 2883ms, 1494ms, 2806ms, 2281ms, 2482ms, 2915ms, 1504ms, 2772ms, 3376ms, 1753ms, 3300ms, 1748ms, 2584ms, 3377ms, 3384ms, 1648ms], Average inference time: 3021.08 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite, Option: nnapi, Inference times (ms): [2288ms, 2105ms, 1637ms, 2280ms, 2085ms, 1695ms, 1634ms, 1759ms, 1637ms, 2006ms, 2210ms, 2018ms, 2050ms, 1979ms, 1698ms, 2201ms, 2105ms, 1989ms, 2040ms, 1966ms, 2034ms, 1970ms, 2031ms, 1970ms, 2033ms, 1968ms, 2034ms, 1966ms, 1763ms, 2160ms, 2077ms, 1987ms, 2040ms, 1966ms, 2033ms, 1859ms, 2106ms, 1993ms, 2041ms, 1965ms, 1826ms, 2117ms, 2073ms, 1979ms, 2041ms, 1969ms, 1632ms, 2109ms, 2212ms, 2024ms, 1362ms, 1284ms, 1970ms, 1806ms, 1212ms, 1800ms, 1231ms, 1452ms, 1465ms, 1128ms, 1185ms, 1519ms, 1246ms, 1824ms, 1224ms, 1719ms, 1234ms, 1964ms, 1133ms, 1973ms, 1689ms, 1241ms, 1890ms, 1194ms, 1187ms, 1108ms, 1089ms, 1091ms, 1086ms, 1084ms, 958ms, 1021ms, 1009ms, 999ms, 964ms, 1025ms, 1041ms, 980ms, 850ms, 1082ms, 1091ms, 976ms, 960ms, 1021ms, 1019ms, 991ms, 958ms, 850ms, 1008ms, 873ms], Average inference time: 1614.26 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite, Option: cpu, Inference times (ms): [1445ms, 1504ms, 1364ms, 1337ms, 1383ms, 1350ms, 1364ms, 1365ms, 1354ms, 1413ms, 1403ms, 1310ms, 1336ms, 1823ms, 1355ms, 1728ms, 1450ms, 1492ms, 1383ms, 1274ms, 1370ms, 1251ms, 1719ms, 1800ms, 1539ms, 1546ms, 1722ms, 1390ms, 1394ms, 1330ms, 1338ms, 1373ms, 1362ms, 1424ms, 1604ms, 1316ms, 1431ms, 1313ms, 1381ms, 1265ms, 1449ms, 1663ms, 1354ms, 1372ms, 1358ms, 1419ms, 1356ms, 1355ms, 1310ms, 1430ms, 1346ms, 1304ms, 1405ms, 1315ms, 1816ms, 1320ms, 1397ms, 1311ms, 1393ms, 1345ms, 1416ms, 1375ms, 1370ms, 1373ms, 1274ms, 1365ms, 1433ms, 1362ms, 1352ms, 1304ms, 1351ms, 1337ms, 1438ms, 1401ms, 1369ms, 1365ms, 1633ms, 1670ms, 1396ms, 1657ms, 1367ms, 1404ms, 1373ms, 1439ms, 1387ms, 1371ms, 1339ms, 1411ms, 1416ms, 1370ms, 1483ms, 1389ms, 1341ms, 1402ms, 1320ms, 1370ms, 1424ms, 1479ms, 1520ms, 1308ms], Average inference time: 1414.73 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite, Option: xnnpack, Inference times (ms): [1159ms, 1131ms, 1130ms, 1130ms, 1130ms, 1131ms, 1130ms, 1122ms, 1130ms, 1130ms, 1130ms, 1131ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1129ms, 1131ms, 1130ms, 1130ms, 1131ms, 1131ms, 1130ms, 1130ms, 1130ms, 1130ms, 1132ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1129ms, 1130ms, 1131ms, 1130ms, 1129ms, 1129ms, 1131ms, 1130ms, 1130ms, 1129ms, 1131ms, 1130ms, 1130ms, 1129ms, 1130ms, 1130ms, 1131ms, 1130ms, 1129ms, 1129ms, 1130ms, 1130ms, 1130ms, 1129ms, 1130ms, 1134ms, 1129ms, 1131ms, 1130ms, 1129ms, 1130ms, 1130ms, 1130ms, 1131ms, 1129ms, 1131ms, 1130ms, 1129ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1129ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1129ms, 1131ms, 1129ms], Average inference time: 1130.3 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite, Option: gpu, Inference times (ms): [418ms, 714ms, 771ms, 622ms, 817ms, 814ms, 785ms, 813ms, 810ms, 812ms, 591ms, 812ms, 812ms, 812ms, 815ms, 662ms, 811ms, 812ms, 815ms, 624ms, 810ms, 807ms, 809ms, 811ms, 813ms, 814ms, 810ms, 813ms, 809ms, 809ms, 784ms, 810ms, 810ms, 809ms, 809ms, 770ms, 775ms, 812ms, 811ms, 804ms, 787ms, 809ms, 811ms, 810ms, 663ms, 816ms, 809ms, 812ms, 601ms, 809ms, 811ms, 808ms, 810ms, 809ms, 810ms, 816ms, 811ms, 810ms, 675ms, 809ms, 811ms, 810ms, 624ms, 808ms, 808ms, 813ms, 812ms, 811ms, 810ms, 816ms, 810ms, 809ms, 810ms, 812ms, 809ms, 660ms, 811ms, 806ms, 810ms, 808ms, 808ms, 812ms, 811ms, 820ms, 809ms, 809ms, 814ms, 813ms, 812ms, 811ms, 812ms, 817ms, 809ms, 810ms, 809ms, 811ms, 810ms, 589ms, 812ms, 812ms], Average inference time: 786.15 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite, Option: nnapi, Inference times (ms): [1156ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1126ms, 1127ms, 1129ms, 1128ms, 1128ms, 1128ms, 1128ms, 1129ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1127ms, 1128ms, 1128ms, 1127ms, 1128ms, 1128ms, 1127ms, 1127ms, 1128ms, 1128ms, 1128ms, 1127ms, 1129ms, 1128ms, 1127ms, 1129ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1130ms, 1126ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1130ms, 1128ms, 1127ms, 1127ms, 1129ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1128ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1127ms, 1127ms, 1129ms, 1127ms, 1127ms, 1127ms, 1123ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1128ms, 1126ms, 1128ms], Average inference time: 1127.71 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite, Option: cpu, Inference times (ms): [1293ms, 1412ms, 1377ms, 1389ms, 1452ms, 1516ms, 1465ms, 1520ms, 1476ms, 1383ms, 1373ms, 1440ms, 1557ms, 1592ms, 1405ms, 1328ms, 1385ms, 1342ms, 1356ms, 1348ms, 1743ms, 1693ms, 1603ms, 1329ms, 1391ms, 1356ms, 1441ms, 1439ms, 1316ms, 1309ms, 1305ms, 1556ms, 1467ms, 1641ms, 1385ms, 1420ms, 1352ms, 1342ms, 1584ms, 1272ms, 1332ms, 1388ms, 1327ms, 1311ms, 1446ms, 1699ms, 1380ms, 1692ms, 1779ms, 1335ms, 1389ms, 1598ms, 1441ms, 1441ms, 1340ms, 1363ms, 1435ms, 1360ms, 1407ms, 1321ms, 1447ms, 1422ms, 1362ms, 1474ms, 1366ms, 1390ms, 1622ms, 1723ms, 1386ms, 1438ms, 1412ms, 1352ms, 1650ms, 1679ms, 1432ms, 1742ms, 1469ms, 1291ms, 1403ms, 1446ms, 1419ms, 1416ms, 1395ms, 1280ms, 1491ms, 1644ms, 1297ms, 1314ms, 1391ms, 1429ms, 1379ms, 1755ms, 1505ms, 1551ms, 1662ms, 1396ms, 1317ms, 1409ms, 1366ms, 1360ms], Average inference time: 1444.19 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite, Option: xnnpack, Inference times (ms): [1158ms, 1127ms, 1128ms, 1127ms, 1128ms, 1128ms, 1128ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1131ms, 1127ms, 1129ms, 1128ms, 1126ms, 1127ms, 1127ms, 1126ms, 1127ms, 1128ms, 1128ms, 1127ms, 1127ms, 1130ms, 1128ms, 1128ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1129ms, 1126ms, 1128ms, 1129ms, 1127ms, 1128ms, 1128ms, 1128ms, 1129ms, 1127ms, 1128ms, 1128ms, 1129ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1126ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1128ms, 1130ms, 1127ms, 1127ms, 1128ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1125ms, 1128ms, 1128ms, 1127ms, 1128ms], Average inference time: 1127.84 ms
@google-ml-butler google-ml-butler bot added type:performance Performance Issue type:support Support issues labels Apr 18, 2024
@sawantkumar sawantkumar added comp:lite TF Lite related issues Android labels Apr 19, 2024
@sawantkumar
Copy link

sawantkumar commented Apr 19, 2024

Hi @jakubdolejs

There could be a bunch of reasons behind performance issues on pixel 4a compared to the iPhone 12. When you use Core ML delegate on the iPhone, it is using NPU which is much faster compared to the gpu on pixel 4a . Can you also benchmark your model on the pixel 4a using tensorflow profiler which will give you detailed information regarding your model execution like how many partitions of the model are created before execution and how many layers are falling back to the cpu in case of gpu delegate. Also pixel 4a's GPU is not optimised for fp32 calculations , it is only optimised for fp16 operations , so that could be the culprit behind poor gpu performance while using fp32. Can you share the tensorflow lite profiler results once you benchmark your tflite model on the pixel using profiler.

@sawantkumar sawantkumar added the stat:awaiting response Status - Awaiting response from author label Apr 19, 2024
@jakubdolejs
Copy link
Author

Thank you @sawantkumar. I'll try the profiler and upload the results here.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 19, 2024
@jakubdolejs
Copy link
Author

Hello @sawantkumar,

I ran the benchmark tool with different options on the float32 and float16 models. Please see the attached results. The file names ending with gpu are from runs that had the --use_gpu flag set to true. The ones ending with nnapi had the --use_nnapi flag set to true. The commands used to invoke the tests are included in the txt files.

Please let me know if you see anything unexpected in the results.

fp16_gpu.txt
fp16_nnapi.txt
fp16.txt
fp32_gpu.txt
fp32_nnapi.txt
fp32.txt

@sawantkumar
Copy link

Hello @jakubdolejs,

I've reviewed the log files, and everything appears as expected, except for the discrepancies noted in the files fp32_gpu.txt and fp16_gpu.txt. While the average latency GPU numbers from the TFLite profiler seem almost identical for both fp16 and fp32 models, the logs from your Android code indicate a clear difference between fp32 and fp16 GPU numbers. To facilitate a more accurate comparison, could you also profile your models on an iPhone 12 using TFLite Profiler for iOS?

Regarding the inconsistency in inference numbers during the first few runs on the Pixel 4a, could you integrate a few warm-up loops in your Android code before benchmarking and let me know the results? Please feel free to reach out if you encounter any difficulties during this process.

@jakubdolejs
Copy link
Author

Thank you @sawantkumar. I'll try the iOS app and report back. I really appreciate you helping me through this.

@jakubdolejs
Copy link
Author

jakubdolejs commented Apr 19, 2024

Hi @sawantkumar,

Here are the benchmarks from iOS (iPhone 12 mini). It looks like the app runs the inference on the UI thread. For all the models I get this warning in the log: This method should not be called on the main thread as it may lead to UI unresponsiveness.. I redacted the above messages from the log output for brevity.

FP16 Model on CPU

INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Inter-run delay (seconds): [-1]
INFO: Num threads: [4]
INFO: Benchmark name: [arc_psd_001_fp16_benchmark]
INFO: Min warmup runs: [1]
INFO: Graph: [/private/var/containers/Bundle/Application/1D088489-D83E-4A30-B2A8-26180514520A/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite]
INFO: Input layers: [in0]
INFO: Input shapes: [1,640,640,3]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model /private/var/containers/Bundle/Application/1D088489-D83E-4A30-B2A8-26180514520A/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite
Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
WARN: Tensor # 0 is named inputs_0 but flags call it in0
Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 22.4319
INFO: Initialized session in 150.857ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=3 first=206791 curr=185620 min=185245 max=206791 avg=192552 std=10069

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

INFO: count=50 first=189002 curr=222573 min=189002 max=222573 avg=208663 std=9484

INFO: Inference timings in us: Init: 150857, First inference: 206791, Warmup (avg): 192552, Inference (avg): 208663
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=87.2656 overall=159.392

FP16 model with CoreML delegate:

INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Inter-run delay (seconds): [-1]
INFO: Num threads: [4]
INFO: Benchmark name: [arc_psd_001_fp16_benchmark]
INFO: Min warmup runs: [1]
INFO: Graph: [/private/var/containers/Bundle/Application/50D22DE8-EDBA-4763-8298-7187B6D7FD12/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite]
INFO: Input layers: [in0]
INFO: Input shapes: [1,640,640,3]
INFO: Use CoreML: [1]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model /private/var/containers/Bundle/Application/50D22DE8-EDBA-4763-8298-7187B6D7FD12/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite
Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
WARN: Tensor # 0 is named inputs_0 but flags call it in0
coreml_version must be 2 or 3. Setting to 3.
INFO: COREML delegate created.
CoreML delegate: 215 nodes delegated out of 384 nodes, with 15 partitions.
INFO: CoreML delegate: 215 nodes delegated out of 384 nodes, with 15 partitions.
INFO: Explicitly applied COREML delegate, and the model graph will be partially executed by the delegate w/ 13 delegate kernels.
Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 22.4319
INFO: Initialized session in 2540.03ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.

Note that when running the FP16 model with CoreML delegate I got a EXC_BAD_ACCESS error here:

TfLiteBenchmarkTfLiteModelRunWithArgs(benchmark, argc, argv.data());

That's why the log is truncated.

FP32 model on CPU:

INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Inter-run delay (seconds): [-1]
INFO: Num threads: [4]
INFO: Benchmark name: [arc_psd_001_fp32_benchmark]
INFO: Min warmup runs: [1]
INFO: Graph: [/private/var/containers/Bundle/Application/6731B3C4-1818-4B08-977B-9D7C0C8DBD81/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite]
INFO: Input layers: [in0]
INFO: Input shapes: [1,640,640,3]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model /private/var/containers/Bundle/Application/6731B3C4-1818-4B08-977B-9D7C0C8DBD81/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite
Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
WARN: Tensor # 0 is named inputs_0 but flags call it in0
Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 44.7677
INFO: Initialized session in 195.673ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=3 first=200505 curr=187763 min=185514 max=200505 avg=191261 std=6600

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

INFO: count=50 first=190545 curr=222754 min=190545 max=223493 avg=210610 std=9386

INFO: Inference timings in us: Init: 195673, First inference: 200505, Warmup (avg): 191261, Inference (avg): 210610
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=44.5 overall=116.626

FP32 model with CoreML delegate:

INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Inter-run delay (seconds): [-1]
INFO: Num threads: [4]
INFO: Benchmark name: [arc_psd_001_fp32_benchmark]
INFO: Min warmup runs: [1]
INFO: Graph: [/private/var/containers/Bundle/Application/8336E5F7-0CCF-4333-9BD9-4CC385A1B930/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite]
INFO: Input layers: [in0]
INFO: Input shapes: [1,640,640,3]
INFO: Use CoreML: [1]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model /private/var/containers/Bundle/Application/8336E5F7-0CCF-4333-9BD9-4CC385A1B930/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite
Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
WARN: Tensor # 0 is named inputs_0 but flags call it in0
coreml_version must be 2 or 3. Setting to 3.
INFO: COREML delegate created.
CoreML delegate: 215 nodes delegated out of 253 nodes, with 15 partitions.
INFO: CoreML delegate: 215 nodes delegated out of 253 nodes, with 15 partitions.
INFO: Explicitly applied COREML delegate, and the model graph will be partially executed by the delegate w/ 13 delegate kernels.
Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 44.7677
INFO: Initialized session in 3285.41ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=12 first=66600 curr=41255 min=40902 max=66600 avg=43699.3 std=6953

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

INFO: count=50 first=40886 curr=41157 min=40386 max=41937 avg=41173.8 std=325

INFO: Inference timings in us: Init: 3285407, First inference: 66600, Warmup (avg): 43699.3, Inference (avg): 41173.8
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=230.595 overall=303.407

@jakubdolejs
Copy link
Author

Hi @sawantkumar,

I ran a test on the Pixel 4a with the different model combinations. I ran 50 iterations but this time I included a warmup of 10 inference runs. The first few runs are still very slow. Is this to be expected?

How do you recommend the warmup is handled in production? The app I'm building will need to run inference on a few images at a time but it shouldn't take 3 seconds per image.

Here is the test function that produced the results in this CSV file:

@Test
fun testInferenceSpeed() {
    val context = InstrumentationRegistry.getInstrumentation().context
    val assetManager = context.assets
    // Input serialized as a float array in JSON
    val jsonFile = "face_on_iPad_001.jpg-flat.json"
    assetManager.open(jsonFile).use { inputStream ->
        val json = inputStream.bufferedReader().use { it.readText() }
        val floatArray = Json.decodeFromString<FloatArray>(json)
        // Models – float32 and float16
        val models = mapOf(Pair("fp32", "ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite"), Pair("fp16","ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite"))
        val options = arrayOf("gpu", "nnapi", "cpu", "xnnpack")
        val table = mutableMapOf<String,Array<Long>>()
        val runCount = 50
        val warmupRunCount = 10
        for (model in models.entries) {
            assetManager.open(model.value).use { modelInputStream ->
                // Copy the model from assets to the cache directory
                val modelFile = File(context.cacheDir, model.value)
                modelFile.outputStream().use { outputStream ->
                    modelInputStream.copyTo(outputStream)
                }
                for (option in options) {
                    val interpreterOptions = InterpreterApi.Options()
                    val compatibilityList = CompatibilityList()
                    when (option) {
                        "gpu" -> {
                            compatibilityList.use {
                                if (it.isDelegateSupportedOnThisDevice) {
                                    interpreterOptions.addDelegate(
                                        GpuDelegate(
                                            it.bestOptionsForThisDevice
                                        )
                                    )
                                }
                            }
                        }
                        "nnapi" -> {
                            if (android.os.Build.VERSION.SDK_INT >= android.os.Build.VERSION_CODES.P) {
                                interpreterOptions.addDelegate(NnApiDelegate())
                                interpreterOptions.useNNAPI = true
                            }
                        }
                        "cpu" -> {
                            interpreterOptions.numThreads =
                                Runtime.getRuntime().availableProcessors()
                            interpreterOptions.useXNNPACK = false
                        }

                        "xnnpack" -> {
                            interpreterOptions.numThreads =
                                Runtime.getRuntime().availableProcessors()
                            interpreterOptions.useXNNPACK = true
                        }
                        else -> throw IllegalArgumentException("Unknown option: $option")
                    }
                    InterpreterApi.create(modelFile, interpreterOptions)
                        .use { interpreterApi ->
                            for (i in 0 until warmupRunCount) {
                                interpreterApi.allocateTensors()
                                val input = FloatBuffer.wrap(floatArray)
                                val output =
                                    FloatBuffer.allocate(5 * 8400).also { it.rewind() }
                                interpreterApi.run(input, output)
                            }
                            val times = mutableListOf<Long>()
                            for (i in 0 until runCount) {
                                interpreterApi.allocateTensors()
                                val input = FloatBuffer.wrap(floatArray)
                                val output =
                                    FloatBuffer.allocate(5 * 8400).also { it.rewind() }
                                val time = measureTimeMillis {
                                    interpreterApi.run(input, output)
                                }
                                times.add(time)
                            }
                            table.getOrPut("${model.key}-${option}") { times.toTypedArray() }
                        }
                }
            }
        }
        var csv = table.keys.map { "\"$it\"" }.joinToString(",")
        val rowCount = table.values.map { it.size }.min()
        for (i in 0 until rowCount) {
            csv += "\n"
            csv += table.keys.map { table[it]!![i].toString() }.joinToString(",")
        }
        File(context.cacheDir, "inference_speed.csv").outputStream().use { fileOutputStream ->
            OutputStreamWriter(fileOutputStream).use { outputStreamWriter ->
                outputStreamWriter.write(csv)
            }
        }
    }
}

@sawantkumar
Copy link

Hi @jakubdolejs,

Apologies for the delay; I wasn't available over the weekend. After analyzing the iOS numbers, it's evident that the Core ML delegate on the iPhone 12 Mini outperforms the GPU delegate on the Pixel 4a by approximately 7x for fp32 models. Additionally, the iPhone 12 Mini's CPU executes models roughly 2x faster than the Pixel 4a's CPU. These results clearly indicate that the iPhone 12 Mini offers faster model execution both on CPU and GPU compared to the Pixel 4a.

However, if you're aiming to maximize performance on your Pixel device, consider utilizing its DSP. Please ensure third-party access to the DSP is permitted on the Pixel phone, then optimize performance using SNPE provided by Qualcomm.

@sawantkumar
Copy link

Also regarding handling GPU warm up runs in production , from my experience i have also seen that the first few inference runs on Android TFLite GPU can be slower because of initialization Overhead that is when you run inference for first few times, TensorFlow Lite needs to initialize various components, such as loading the model, allocating memory, and setting up the GPU context. This initialization process can take some time, causing the first few inferences to be slower. To handle such scenarios in production you can perform the GPU warm-up runs during the app's startup. This could be something like an inference loop of 50 or 100 iterations on dummy data on app's startup . Please let me know if you have any further issues or questions.

@sawantkumar sawantkumar added the stat:awaiting response Status - Awaiting response from author label Apr 23, 2024
@jakubdolejs
Copy link
Author

Hello @sawantkumar,

I've done some more testing and profiling. I built an Android app that lets me change between the FP16 and FP32 models and toggle the different options. Here are my findings:

  • The initial slowdown only happens on some devices (e.g., Pixel 4a, Elo touch). On some devices (Pixel 6, Galaxy Tab S6 Lite) the inference runs at consistent speed from start to finish with any given options.
  • The initial slowdown happens regardless of which delegates are used.
  • I tried using the same model converted to NCNN and it runs at consistent speed on any device. The speed is comparable to the TfLite model after the "warmup".
  • The slower devices I mentioned like Pixel 4a run first 50 or so iterations in about 3500 ms, after which the speed increases to about 400–500 ms per inference. Even this is not consistent. Sometimes the inference keeps running slowly at over 3000 ms even after hundreds of iterations. For comparison, the faster devices run inference at roughly 300 ms from the get go.
  • On iOS, the story is slightly different. It takes about 2 seconds to load the model but afterwards the inference runs consistently at about 70 ms (with the CoreML delegate).

From using NCNN I can see that even the underpowered devices don't require a warmup to run at acceptable speeds. I believe there may be a bug in TfLite. It shouldn't take 3 minutes to "warm up".

Would you like me to file a separate issue with a bug report or can you escalate this one?

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 25, 2024
@sawantkumar
Copy link

Hi @jakubdolejs ,

When it comes to speed and performance, NCNN is generally considered to be faster than TFLite in many scenarios so your results are somewhat as expected. However i will replicate the issue on my available pixel phone using tflite and i will get back to you . I don't think there is a need to file a separate issue yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Android comp:lite TF Lite related issues type:performance Performance Issue type:support Support issues
Projects
None yet
Development

No branches or pull requests

3 participants