We're trying to duplicate NPU acceleration with various models (on Android). We can obtain the same results as the benchmark utility according to the TensorFlow Lite on Android User's Guide. However, when we try to train those same models with our own data for our custom application, the model has many operations that are falling back to CPU and we are no longer achieving desirable execution times.
Is there a sample application or demo that goes through the process of training a model to running and measuring performance? We would be interested in both npu usage with quantized models as well as GPU acceleration with float16.