i have a custom trained keras model, mobilenetv1 and having the imx8plus evk targeting its NPU.
i have done PTQ and QAT quantizations to tflite int8, with:
converter._experimental_disable_per_channel = True
PTQ i get 3.5ms inference, just as benchmarks on imx8p npu has shown. and i understand that PTQ does use per-tensor quant so it is fast as the NXP benchmarks, but it does have an accuracy loss vs original.
my qat model, which shows superior accuracy vs PTQ, seems to be 2x slower, at 7ms inference.
My understanding is that TF-QAT by default uses per-channel quantization.
My question is is there a way to enable per-tensor for QAT? either before fine-tuning or during tflite conversion?