Thank you!
I have one related question.
My setup:
LS1046 as root complex is reading data from a PCIe peripheral device that acts as an endpoint. The region I'm reading from is memory mapped through a BAR of the endpoint. The endpoint supports PCIe Gen1 (2.5Gb/s).
When I create a PCIe read request with the size of 8 bytes (without a DMA controller) it seems that the peripheral device returns more than one completion packet.
I assume this because I'm doing some benchmarks and the time difference between requesting 4 bytes and 8 bytes is about at least 300 nanoseconds.
If there was only one completion packet returned there should only be a time difference of a few ns since only 4 bytes of payload more have to be sent.
But if the answer to the read request was split in more than one completion packet there would be more overhead which would explain the time difference of at least 300ns.
The max payload size of the root port and the endpoint is 256 bytes, the max read request of root port and endpoint is 512 bytes. The read completion boundary of the root port is 128 bytes and of the endpoint is 64 bytes ( I make sure that the read request doesn't cut this alignment boundary)
For comparison:
There's no time difference between requesting 1 byte and 4 bytes, which makes sense, since the minimum payload that a completion packet contains is 4 bytes.
Do you have an idea what else could influence this behaviour?
Do I have to consider something else, when I'm creating that read request?
I tried two different endpoint devices and witnessed the same behaviour.