I did get this working eventually...
Here's some things I learned along the way that may help if someone else has similar trouble.
Don't worry about forcing actual code to run from internal RAM, I don't think it makes any difference. I modified the usb host MSD FreeRTOS SDK sample to run XIP from external QSPI flash and it was fine as is my own app which is considerably larger.
If using FreeRTOS then ensure configAPPLICATION_ALLOCATED_HEAP is set to 1 and configured appropriately. i.e. you have pick one of the heapx.c options and then declare the ucHeap static array in your app code thus:
/* Allocate the memory for the heap. */
#if defined(configAPPLICATION_ALLOCATED_HEAP) && (configAPPLICATION_ALLOCATED_HEAP)
USB_DMA_NONINIT_DATA_ALIGN(USB_DATA_ALIGN_SIZE) uint8_t ucHeap[configTOTAL_HEAP_SIZE];
#endif
Note that the USB_DMA_NONINIT_DATA_ALIGN define will expand to a section you chose as per the notes in the article I mentioned in the original post. For me this was non-cached external SDRAM because I used an generous 1MB for the FreeRTOS heap which was too big to fit inside internal memory. Having the FreeRTOS heap allocated to non-cacheable memory is critical and this feature was disabled in my application because I didn't intend to use the heap in this way.
Using the tightly coupled internal memory (SRAM_ITC, SRAM_DTC) is better as there will never be cache issues (but there will if using SRAM_OC). I had assumed part of the problem was that internal memory had to be used for performance reasons, for code as well as data, but it seems that it's data coherency that is mostly the issue that needs sorting. This is why you can get away with using say external SDRAM for all data as long as it's in a non-cacheable region.
I guess that was probably enough to get things working, but I did additionally modify my linker scripts to put all .data for *_usb_*.o modules into SRAM_DTC just because I thought it was a good idea. I'm not sure this is needed however.
Another problem I had that I think is a C/C++ interop issue (and I didn't resolve it but worked around it) is that I tried putting my USBHost FreeRTOS task into a C++ class (actually a singleton) but I had lots of weird linker errors along the lines of:
"relocation truncated to fit: R_ARM_PREL31 for .text bla bla"
I think this was related to jumping code between different areas of memory greater than 31-bits. I worked around this by just making the code plain 'C' instead of 'C++'. The only consequence for me was that I wasn't able to use my fancy debug logging functionality that is C++ based.