All All
New update.
I switched form transferring 1k blocks to 32kByte blocks and now I am seeing a difference:

DTM to OCRAM transfer of 32k taking about 513us.
I notice that shortly after the transfer starts the FlexSPI activity stops; the code is spinning waiting for the DMA controller to flag that it has completed, which will be cached at this pint, with no reason to access the QSPI Flash chip.

QSPI Flash to OCRAM transfer of 32k taking about 657us (some 144us, or +28% longer).
The interesting difference is that during the wait the QSPI Flash is being accessed, which presumably means that the data is being retrieved so that the DMA controller can then perform/complete the transfer.
This is more in line with expectations but the picture is not visible with short transfers. Also the speed reduction is not great.
In the memory to memory transfer case a data rate of around 65MByte/s is seen and in the QSPI to memory it drops to about 48MBytes/s.
Now I am questioning the speed of memory to memory DMA transfers, so I repeated but instead of doing long word DAM transfers I did byte for byte CPU (memcpy()) transfers instead:

OCRAM to OCRAM takes 240us (133MBytes/s)

QSPI Flash to OCRAM takes 527us (60MByte/s)
Again it can be seen that data from the QSPI Flash is accessed during the process (via the FlexSPI bus).
Since I have 125MHz FlexSPI speed this doesn't seem unrealistic
But is does show that it is faster to d a simple memcpy() that to do a memory to memory transfer via DMA. At the moment I use the same code that was used in Kinetsi projects which had a good speed advantage in comparison with memcpy() so this looks like it needs to be investigated: is it a restriction that can be overcome in the i.MX RT??
Regards
Mark