Hello Xu,
Your local work group size has to be an even multiple of your global divisible by 4, or 1 divisible by 4, so you will get this error. Error -54 is CL_INVALID_WORK_GROUP_SIZE - values in your "globalWorkSize" array are not divisible with values in your "localWorkSize" array, and that's it (remember that global work size is really total number of threads along each dimension, and not the size of the "block" of threads along corresponding dimension).
If you want a local size of 16, 4, 4, you will need to use a global size of {16a, 4b, 4*c}.
You always want to have a workgroup size be a multiple of 32 for your gpu.
Regards