i.MX8M plus caching problem

heinzhjb · ‎04-25-2023

We have written a Linux kernel driver which uses DMA transfers (MIPI, NXP CSI, ISI).
The memory we are using as target for DMA transfers is reserved memory in device tree.
Because we suffered from slow copy of this page to userspace, we remapped this page to cacheable memory:

void* Image = memremap(pyhsmem.start, physmem.size, MEMREMAP_WB);

This improved the performance tremendously.
Unfortunately we now suffer from caching issues at smaller transfers. 64 bytes of data aren't filled with the correct data buf with 0xff when the buffer is copied to userspace sometimes. We are transferring around 1.1 MBytes with one transfer and sometimes a few cachelines get incorrect data.

Which flush/invalidate cache functions can be called to guarantee cache coherency?

heinzhjb · ‎05-03-2023

Hello Dhruvit,

thank you for the answer.

void page_cache_release(struct page *page);
I have physical address (reserved in device tree) and kernel virtual address, gotten from memremap. But I don't have page structs. Don't know how to get them and whether to call this function for every page in this range.

Generic DMA layer:

Not sure whether CSI is a DMA which is covered by dma layer.

Coherent: Because of uncached memory, I guess there wouldn't be any speed improvement.
Streaming: This API is probably the correct choice to handle this problem. But the CSI driver which is doing the MIPI transfer doesn't set up the buffer. So, functions like dma_sync_single_for_cpu/dma_sync_single_for_device either crash or have no function at all.

I've had a look at the example in your referenced link. It seems to me unmapping before reading and accessing the physical memory directly wouldn't gain speed improvement. So, I guess I should use the sync functions without unmapping the memory.

Will try again to setup the memory similiar to the example and use sync functions. Though I don't have a pci dev struct for csi and sensor driver. I'm in sensor interrupt driver.

Thanks
Bernd

Dhruvit · ‎05-09-2023

Hi @heinzhjb,

I hope you are doing well.

->One may want to try using dma_sync_single_for_cpu/dma_sync_single_for_device to ensure cache coherency before and after each transfer or consider using dma_cache_sync or dma_cache_inv to ensure that all cache lines associated with the DMA buffer are synchronized with main memory.

"Will try again to set up the memory similar to the example and use sync functions."
->Sure, It will help you.

I hope this information helps!

Thanks & Regards,
Dhruvit Vasavada

heinzhjb · ‎05-24-2023

Thank you Dhruvit,

have tried with DMA streaming API. This seems to work:

void *vmem = kmalloc(1228*921, GFP_KERNEL);
dma_addr pmem = dma_map_single(dev_intern, vmem, 1228*921, DMA_FROM_DEVICE);

With this configuration, I can use dma_sync_single_for_device/cpu() functions and it seems to work well.

But I need more memory than I can allocate with kmalloc. We have a reserved memory region especially for the DMA buffer:
ar1820_reserved: ar1820@7EE00000
{
no-map;
reg = <0 0x7EE00000 0 0x1200000>;
};
How can I create a virtual mapping of that buffer for use with dma_map_single() ?

Alternatively, I've tried to use dma_alloc_attrs(dev_intern, DMASIZE, GFP_KERNEL | GFP_DMA, 0); instead of kmalloc for allocating the buffer to use with dma_map_single(), but always got " rejecting DMA map of vmalloc memory" error from dma_map_single().

I need around 20MBytes of DMA buffer. What can I use?

Thanks

Dhruvit · ‎05-26-2023

Hi @heinzhjb,

I hope you are doing well.

->Please make sure to use dma_map_single() to create a virtual mapping of the reserved memory region, One can use the dma_mmap_coherent() function.
->This function will create a virtual mapping of the specified memory region & This function will allocate a buffer of the specified size from the DMA memory pool.

->One can use the dma_alloc_coherent() function to allocate a DMA-coherent buffer. This function will allocate a buffer of the specified size and make it accessible to the DMA engine.

I hope this information helps!

Thanks & Regards,
Dhruvit Vasavada

heinzhjb · ‎05-26-2023

Thanks Dhruvit,

I thought dma_alloc_coherent() is used to reserve uncached memory?
I need cached memory to gain speed for copying this buffer to userspace.
Wanted to use the streaming DMA API for that reason.

Regards
Bernd

Dhruvit · ‎05-29-2023

Hi @heinzhjb,

I hope you are doing well.

->No, the dma_alloc_coherent() function allocates a DMA-coherent buffer. This function will allocate a buffer of the specified size and make it accessible to the DMA engine.

dma_set_mask_and_coherent()
->This will set the mask for both streaming and coherent APIs together.
->The setup for streaming mappings is performed via a call to dma_set_mask()

->The setup for consistent allocations is performed via a call to dma_set_coherent_mask()

Please refer to section DMA at the given below location.
<linux_src_code>/Documentation/devicetree/bindings/dma/

I hope this information helps!

Thanks & Regards,
Dhruvit Vasavada

Dhruvit · ‎04-27-2023

Hi @heinzhjb,

I hope you are doing well.

->Regardless of whether the pages have been changed, they must be freed from the page cache, or they stay there forever. The call to use is:

void page_cache_release(struct page *page);
->This call should, of course, be made after the page has been marked dirty if need be.

->Please refer to the
The Generic DMA Layer
Coherent DMA mappings
https://www.oreilly.com/library/view/linux-device-drivers/0596005903/ch15.html

This information will help you.

Thanks & Regards,
Dhruvit Vasavada