New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference time using Interpreter API on Android inconsistent and 10–50 times slower than same tflite model on iOS #66025
Comments
Hi @jakubdolejs There could be a bunch of reasons behind performance issues on pixel 4a compared to the iPhone 12. When you use Core ML delegate on the iPhone, it is using NPU which is much faster compared to the gpu on pixel 4a . Can you also benchmark your model on the pixel 4a using tensorflow profiler which will give you detailed information regarding your model execution like how many partitions of the model are created before execution and how many layers are falling back to the cpu in case of gpu delegate. Also pixel 4a's GPU is not optimised for fp32 calculations , it is only optimised for fp16 operations , so that could be the culprit behind poor gpu performance while using fp32. Can you share the tensorflow lite profiler results once you benchmark your tflite model on the pixel using profiler. |
Thank you @sawantkumar. I'll try the profiler and upload the results here. |
Hello @sawantkumar, I ran the benchmark tool with different options on the float32 and float16 models. Please see the attached results. The file names ending with Please let me know if you see anything unexpected in the results. fp16_gpu.txt |
Hello @jakubdolejs, I've reviewed the log files, and everything appears as expected, except for the discrepancies noted in the files fp32_gpu.txt and fp16_gpu.txt. While the average latency GPU numbers from the TFLite profiler seem almost identical for both fp16 and fp32 models, the logs from your Android code indicate a clear difference between fp32 and fp16 GPU numbers. To facilitate a more accurate comparison, could you also profile your models on an iPhone 12 using TFLite Profiler for iOS? Regarding the inconsistency in inference numbers during the first few runs on the Pixel 4a, could you integrate a few warm-up loops in your Android code before benchmarking and let me know the results? Please feel free to reach out if you encounter any difficulties during this process. |
Thank you @sawantkumar. I'll try the iOS app and report back. I really appreciate you helping me through this. |
Hi @sawantkumar, Here are the benchmarks from iOS (iPhone 12 mini). It looks like the app runs the inference on the UI thread. For all the models I get this warning in the log: FP16 Model on CPU
FP16 model with CoreML delegate:
Note that when running the FP16 model with CoreML delegate I got a Line 133 in 4615e17
That's why the log is truncated. FP32 model on CPU:
FP32 model with CoreML delegate:
|
Hi @sawantkumar, I ran a test on the Pixel 4a with the different model combinations. I ran 50 iterations but this time I included a warmup of 10 inference runs. The first few runs are still very slow. Is this to be expected? How do you recommend the warmup is handled in production? The app I'm building will need to run inference on a few images at a time but it shouldn't take 3 seconds per image. Here is the test function that produced the results in this CSV file: @Test
fun testInferenceSpeed() {
val context = InstrumentationRegistry.getInstrumentation().context
val assetManager = context.assets
// Input serialized as a float array in JSON
val jsonFile = "face_on_iPad_001.jpg-flat.json"
assetManager.open(jsonFile).use { inputStream ->
val json = inputStream.bufferedReader().use { it.readText() }
val floatArray = Json.decodeFromString<FloatArray>(json)
// Models – float32 and float16
val models = mapOf(Pair("fp32", "ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite"), Pair("fp16","ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite"))
val options = arrayOf("gpu", "nnapi", "cpu", "xnnpack")
val table = mutableMapOf<String,Array<Long>>()
val runCount = 50
val warmupRunCount = 10
for (model in models.entries) {
assetManager.open(model.value).use { modelInputStream ->
// Copy the model from assets to the cache directory
val modelFile = File(context.cacheDir, model.value)
modelFile.outputStream().use { outputStream ->
modelInputStream.copyTo(outputStream)
}
for (option in options) {
val interpreterOptions = InterpreterApi.Options()
val compatibilityList = CompatibilityList()
when (option) {
"gpu" -> {
compatibilityList.use {
if (it.isDelegateSupportedOnThisDevice) {
interpreterOptions.addDelegate(
GpuDelegate(
it.bestOptionsForThisDevice
)
)
}
}
}
"nnapi" -> {
if (android.os.Build.VERSION.SDK_INT >= android.os.Build.VERSION_CODES.P) {
interpreterOptions.addDelegate(NnApiDelegate())
interpreterOptions.useNNAPI = true
}
}
"cpu" -> {
interpreterOptions.numThreads =
Runtime.getRuntime().availableProcessors()
interpreterOptions.useXNNPACK = false
}
"xnnpack" -> {
interpreterOptions.numThreads =
Runtime.getRuntime().availableProcessors()
interpreterOptions.useXNNPACK = true
}
else -> throw IllegalArgumentException("Unknown option: $option")
}
InterpreterApi.create(modelFile, interpreterOptions)
.use { interpreterApi ->
for (i in 0 until warmupRunCount) {
interpreterApi.allocateTensors()
val input = FloatBuffer.wrap(floatArray)
val output =
FloatBuffer.allocate(5 * 8400).also { it.rewind() }
interpreterApi.run(input, output)
}
val times = mutableListOf<Long>()
for (i in 0 until runCount) {
interpreterApi.allocateTensors()
val input = FloatBuffer.wrap(floatArray)
val output =
FloatBuffer.allocate(5 * 8400).also { it.rewind() }
val time = measureTimeMillis {
interpreterApi.run(input, output)
}
times.add(time)
}
table.getOrPut("${model.key}-${option}") { times.toTypedArray() }
}
}
}
}
var csv = table.keys.map { "\"$it\"" }.joinToString(",")
val rowCount = table.values.map { it.size }.min()
for (i in 0 until rowCount) {
csv += "\n"
csv += table.keys.map { table[it]!![i].toString() }.joinToString(",")
}
File(context.cacheDir, "inference_speed.csv").outputStream().use { fileOutputStream ->
OutputStreamWriter(fileOutputStream).use { outputStreamWriter ->
outputStreamWriter.write(csv)
}
}
}
} |
Hi @jakubdolejs, Apologies for the delay; I wasn't available over the weekend. After analyzing the iOS numbers, it's evident that the Core ML delegate on the iPhone 12 Mini outperforms the GPU delegate on the Pixel 4a by approximately 7x for fp32 models. Additionally, the iPhone 12 Mini's CPU executes models roughly 2x faster than the Pixel 4a's CPU. These results clearly indicate that the iPhone 12 Mini offers faster model execution both on CPU and GPU compared to the Pixel 4a. However, if you're aiming to maximize performance on your Pixel device, consider utilizing its DSP. Please ensure third-party access to the DSP is permitted on the Pixel phone, then optimize performance using SNPE provided by Qualcomm. |
Also regarding handling GPU warm up runs in production , from my experience i have also seen that the first few inference runs on Android TFLite GPU can be slower because of initialization Overhead that is when you run inference for first few times, TensorFlow Lite needs to initialize various components, such as loading the model, allocating memory, and setting up the GPU context. This initialization process can take some time, causing the first few inferences to be slower. To handle such scenarios in production you can perform the GPU warm-up runs during the app's startup. This could be something like an inference loop of 50 or 100 iterations on dummy data on app's startup . Please let me know if you have any further issues or questions. |
Hello @sawantkumar, I've done some more testing and profiling. I built an Android app that lets me change between the FP16 and FP32 models and toggle the different options. Here are my findings:
From using NCNN I can see that even the underpowered devices don't require a warmup to run at acceptable speeds. I believe there may be a bug in TfLite. It shouldn't take 3 minutes to "warm up". Would you like me to file a separate issue with a bug report or can you escalate this one? |
Hi @jakubdolejs , When it comes to speed and performance, NCNN is generally considered to be faster than TFLite in many scenarios so your results are somewhat as expected. However i will replicate the issue on my available pixel phone using tflite and i will get back to you . I don't think there is a need to file a separate issue yet. |
Issue type
Performance
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
2.15.0
Custom code
Yes
OS platform and distribution
No response
Mobile device
Google Pixel 4a running Android 13
Python version
No response
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
I'm running inference on Yolov8-based tflite model on Android using the Interpreter API. I noticed that the first 30 or so calls to the
Interpreter.run()
function take much longer than the subsequent calls. The difference is quite marked, starting at about 3500ms per run and ending at about 500ms.I thought perhaps it was something about the input data so I tried a test with running the same call with the same input 100 times in a loop. Same behaviour, the first handful of inference runs take around 3 seconds, slowly speeding up to about 500–700ms by the 100th iteration.
I wanted to find out whether there is a specific combination of the interpreter options causing this behaviour so I wrote a test matrix initialising interpreters with different options:
There doesn't seem to be any difference whichever combination runs first takes suspicious amount of time for the first handful of inference runs. Sometimes the time never decreases and all the inference runs for the given configuration take a very long time (~3 seconds).
I'm including the code using the bundled runtime. The Play Services runtime times were in line with the bundled runtime.
The device (Google Pixel 4a) is used only for development. There are no other apps installed aside from the test app and whatever was pre-installed on the phone. The device wasn't connected to the internet while running the test.
iOS comparison
In comparison, version 2.14.0 of TfLite for Swift (latest available on CocoaPods) using the CoreML delegate runs inference on the same input using the same model in 70ms on iPhone 12.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: