lspci output on iMX95EVK as PCIe RC
Please take a good look at the snippet above. It is taken from the console of iMX95 after executing 'lspci' on a specific PCIe device[iMX8MM as PCIe EP] that gets enumerated as BDF[Bus Device Function] 01:00.0. This blog attempts to debunk the mystery revolving around the "Memory at " info of the lspci output. We will discuss what this address is, why it is used and its relevance in the PCIe world. This blog will focus on the following agendas: - 1. PCIe parent and child relationship in Linux Device Tree 2. What is CPU and PCIe address space and the need for address space translation? 3. Assigning resources to a PCIe device in Linux 4. How is address space translation carried out in Linux PCI Subsystem?
PCIe parent and child relationship in Linux Device Tree
In the Linux device tree, PCIe parent and child relationship defines how PCIe Root Complex and Endpoints are positioned in the system. A PCIe parent node in the device tree represents a PCIe controller (Root Complex / Host-Bridge). Taking reference from a PCIe node present in the device tree source of imx95: -
pcie@4c300000 {
compatible = "fsl,imx95-pcie";
reg = <0x00 0x4c300000 0x00 0x10000 0x00 0x4c360000 0x00 0x20000 0x00 0x60100000 0x00 0xfe00000>;
reg-names = "dbi\0atu\0config";
#address-cells = <0x03>
…
}
pcie@4c300000 represents a Designware PCIe controller Root Complex which is a parent to the devices/bridge that will be connected to it. -- 'compatible' property identifies the specific PCIe controller. Its corresponding driver resides in drivers/pci/controller/dwc/pci-imx6.c
-- 'reg' property specifies the memory mapped registers of the PCIe controller.
Child nodes under PCIe RC represent devices on the PCIe bus. They can be fixed function devices like Wi-fi, Ethernet, NVMe or they can be PCIe bridges which further can have devices connected to it. Taking reference from 'arch/arm64/boot/dts/freescale/imx95.dtsi'
pcie_4ca00000: pcie@4ca00000 {
compatible = "pci-host-ecam-generic";
reg = <0x0 0x4ca00000 0x0 0x100000>;
/* Must be 3. */
… …
enetc_port0: ethernet@0,0 {
compatible = "fsl,imx95-enetc";
reg = <0x000000 0 0 0 0>;
clocks = <&scmi_clk IMX95_CLK_ENET>,
<&scmi_clk IMX95_CLK_ENETREF>;
clock-names = "ipg_clk", "enet_ref_clk";
nvmem-cells = <ð_mac0>;
nvmem-cell-names = "mac-address";
status = "disabled";
};
} ethernet@0,0 is a PCIe device at bus 0, device 0, function 0. It is a child of PCIe RC which is memory mapped at 0x4ca00000
These child devices/bridges can either be dynamically discovered using PCI enumeration or they can be statically described in a device tree as seen in the device-tree snippet above in which "ethernet@0,0" entry statically tells the RC that the ethernet child device is connected to it. These child nodes are nested within a PCI parent node of the device tree as seen in the above example.
What is CPU and PCIe address space and the need for address space translation ? CPU address space is the system's physical memory map as seen by the processor. Example of CPU Physical Address Space viewed by Cortex-A55 on iMX95:-
Start address End address Module 0x48000000 0x4812FFFF GIC Programming registers
0x4AA00000 0x4AAFFFFF Neutron SRAM
0x4AC10000 0x4AC1FFFF Camera domain block control 0x4E080000 0x4E08FFFF DDR Controller This address space is kind of a global system view which is managed by system firmware/OS. These addresses are fixed by hardware-design. On the other hand, PCIe address space is local to PCI bus, managed by PCIe subsystem. The addresses in this space are dynamically assigned. An example of PCIe address space that could look like the following:- 0x00000000 - 0x0FFFFFFF
0x10000000 - 0x1FFFFFFF
0x20000000 - 0x2FFFFFFF It is evident from the above explanation that CPU and PCIe address space operate in a separate and independent address domains. So the CPU cannot access the space of PCIe device unless a translation mechanism is in place. In one of the upcoming sections we will get to that as well but please spare a few minutes and ponder the question below:- Question : Why do you need separate address spaces for CPU and PCIe? Answer : One of the major reasons is modularity. We have separate spaces so that PCIe devices can be designed independently of the CPU architecture. Same card will work in different system. It will always have the flexibility of CPU remapping the PCIe space as and when needed. Also, different address spaces prevent devices to access arbitrary system memory. Based on the discussion in this section, it is evident that the PCIe address space is inherently different from the CPU address space and truth be told- it has its advantages. Therefore we need an entity to translate to/fro these address spaces. Here comes 'iATU' - Internal Address Translation Unit. On iMX SOCs, these hardware units are responsible for carrying out the address translation. These units are a part of Synopsys DesignWare PCIe Controller, providing programmable address translation windows for inbound and outbound transactions. For the readers who are uninitiated on the inbound and outbound transactions in pcie, please spare some time go through this technical blog -
Understanding PCIe Outbound/Inbound windows with a use-case - NXP Community Note: - Address translation simply ensures that the CPU can access a PCIe device's memory and vice-versa.
Up until here, the readers must have got a basic picture of PCIe Address Translation. Before jumping into the nitty-gritty of this translation in the Linux PCI subsystem, let's discuss how the resources are assigned to a PCIe device.
Assigning resources to a PCIe device in Linux PCIe devices do not have a direct CPU instruction interface so they communicate through memory-mapped regions. Devices need memory for DMA operations or for MSI/MSIX interrupts. Different devices have different needs, so resources in PCIe could be MMIO where device registers are mapped or memory regions needed for DMA transfer. In linux, pci_assign_resource function of PCI subsystem is responsible for assigning IO and memory resources to the PCIe devices during system initialisation after PCIe devices are enumerated. It is called for all the devices on a PCI bus and based on the PCI devices' resource requirement, it assigns them. But how does the PCI subsystem in linux figure out what resources does the PCIe devices need ? - Every PCIe device has a configuration space defined by the PCIe specification. This includes
BAR[Base Address Registers] - To indicate what type of resource[IO/Mem] does the device needs and the size of resource.
Capabilities - To broadcast the device capabilities such as MSI Interrupts, ASPM low power states etc.
Reading the BARs from the PCIe device will tell us what kind and size of the resources are needed by the device. // To determine the size of resource from the BAR of PCIe device:- Step-1: Write all 1's to the target BAR register.
Step-2: Read back the value and clear the lower 4 bits (for a memory BAR) or 2 bits (for an I/O BAR), as these are status bits, not part of the size calculation
Step-3 Perform Bitwise NOT on the value and add 1 to it.
Step-4: The returned value indicates the size. Taking an example to understand this:- Let's assume that after reading back the value in Step-2 above, the BAR returns 0xFFFFF000. The lower 4 bits are already cleared. Step-3 we perform bitwise NOT on the value -> ~(0xFFFFF000) = 0x00000FFF Adding 1 to it : 0x00000FFF + 1 = 0x00001000 The obtained value 0x1000 = 4096 bytes indicates the size, meaning the BAR requires a 4KB memory region. // To determine the type of resource from the BAR of PCIe device:-
A Base Address Register (BAR) in PCI configuration space:
Bit 0 → Resource type:
1 = I/O space
0 = Memory space
For memory BARs:
00 = 32-bit
10 = 64-bit
Bits 1–2 → Addressing type:
Bit 3 → Prefetchable flag
Interpreting the value 0xFFFFF000, we get:-
Bit 0 = 0 → Memory space
Bits 1–2 = 00 → 32-bit address
Bit 3 = 0 → Prefetchable
Upper bits → Base address (after masking)
pci_read_bases [drivers/pci/probe.c] in linux PCI subsystem is responsible to figure out the BAR memory size and type requirement during device enumeration.
Needless to say, the above sequence of writing to the Endpoint's BAR and identifying the size and type of resource is executed on the PCIe RC. We have the following setup :- iMX95 <------> iMX8MM [RC] [EP] After PCIe RC has the size of the BAR that is required, the pci_assign_resource function allocates a memory range and then sets up translation from this memory range to the PCIe address space. we started this blog with a snippet, that shows the following lspci log:-
Referring to the above, please note that the RC driver has allocated: - 0x910100000 - 0x910110000 as the non-prefetchable memory address range, size=64KB The above memory address range is in the PCIe 1 Outbound space memory mapped on iMX95 SoC: -
The range 0x910100000 - 0x910110000 will be mapped to the PCIe address space of the End-point. This essentially means that if the cpu generates any address in between this range [inclusive of start and end-address], a PCIe TLP will be sent by the PCIe controller on the RC to the End-point on the bus. It could be a read or write to the memory of Endpoint. The address to write/read would be decided based on the address space translation. We shall discuss in-detail how this translation is exercised in the linux kernel in the next section.
How is address space translation carried out in Linux PCI Subsystem?
We start with some important questions: - Where is the range 0x910100000 - 0x910110000 specified ? How does the kernel know that it has to map the PCIe 1 Outbound space and not PCIe2 Outbound space or any other address space for that matter? -- Like all good things in Linux, this also starts with a 'device tree binary'. A dtb is passed by Uboot to the kernel so that it could get the hardware description of our board. Since we are using Torradex 's Verdin iMX95 EVK Board as Root Complex, this is the dtb that we are using - imx95-19x19-verdin-adv7535.dtb I will be attaching a working dtb with this blog so that the readers can use it if needed. This dtb includes - arch/arm64/boot/dts/freescale/imx95.dtsi Let's have a look at a particular pcie node of interest: -
'ranges' property is the answer to the questions that were asked in this section earlier. - This property defines the address translation rules between the parent's address space and the child PCI address space.
Note:- This blog focuses only on 'ranges' property since it is relevant to our discussion. So the readers are advised to look elsewhere if they want to understand other device-tree properties of the PCIe node. Let's decode the ranges property :
It has the following format:-
<PCI address><CPU address> <PCI size> 3 cells 2 cells 2 cells
So one entry will have 7 cells. In our dtsi we have 2 entries. 1st is for IO space translation and the 2nd is for Mem space translation. Referring to the second entry :-
0x82000000 0x0 0x10000000 0x9 0x10000000 0 0x10000000
|------PCI address---------------| |-CPU address-| |---PCI size---|
The above gives us the following info: - MEM Space prefetchable
<
0x82000000 0x00 0x10000000 // PCIe address: 0x10000000
0x09 0x10000000 // CPU/system address: 0x910000000
0x00 0x10000000 // Size: 256MB
>;
0x82000000 = 1000 0010 0000 0000 0000 0000 0000 0000
Bits 31–30 (10) → Configuration space type: This indicates memory space.
Bit 29 (0) → Non-relocatable
Bit 28 (1) → Prefetchable = No (0 means non-prefetchable)
Bits 27–24 (0010) → Address space type = Memory
So, 0x82000000 means:
PCI memory space
Non-prefetchable
32-bit address space
Note:- For those of you wondering why lspci output mentions [size=64K] and dts says 256 MB. This is because 256MB is the maximum address space available for the PCIe devices. It is upto the Endpoint device, how large address space does it require and accordingly it gets allocated.
Similary IO space translation is also created from the 1st entry in 'ranges':- <
0x81000000 0x00 0x00 → PCI I/O address: 0x00000000
0x00 0x6ff00000 → CPU/system address: 0x6ff00000
0x00 0x100000 → Size: 1MB
>; we observe the same in the dmesg output of iMX95 Verdin EVK Linux console:-
So the MEM Space mapping is from CPU Address 0x910000000 - 0x091fffffff translated to PCIe Address 0x10000000 - 0x1fffffff It is only fair that we mention the driver that uses the 'ranges' property. The 'ranges' property get parsed in "pci_parse_request_of_pci_ranges -> devm_of_pci_get_host_bridge_resources" of "drivers/pci/of.c"
devm_of_pci_get_host_bridge_resources, for each range automatically manages the memory allocated for these resources. It ensures that the resources are freed when the device is detached or the driver is removed. We have got the answer what & why is the cpu and pci address range the way it is. But in the lspci, you see 0x910100000 and not 0x910000000 which is what the intended start range is supposed to be as per the dtb. Why is that ? To answer this - we need to go back to the PCIe device enumeration. During PCIe enumeration, in the linux PCI driver the bar resources were determined like we had discussed earlier and then the PCI core driver may assign addresses keeping alignment requirements in mind that is why EP's BAR0 was assigned a PCI bus address as 0x10100000 with a 1MB[0x100000] offset from 0x10000000. And keeping the device tree pci translation window in mind:- 0x10100000 translates to 0x910100000 This translation doesn't happen on its own. Device tree binary just mentions the translation window specifics such as the CPU address space to translate to and the PCI address space to translate from. The actual translation is done via iATU. This is done in the dw_pcie_iatu_setup function of drivers/pci/controller/dwc/pcie-designware-host.c by creating the outbound window using dw_pcie_prog_outbound_atu function. Translation is configured on the RC successfully but there is still something missing. .. .. Inbound window !! Without an inbound window on the Endpoint i.e iMX8MM, the writes/reads to 0x910100000 would be meaningless. On iMX8MM we are using PCI Endpoint test driver which is quite popular in linux community and I would urge the readers to visit this page if they want more info - 9. PCI Endpoint Framework — The Linux Kernel documentation pci_epc_map_addr function in drivers/pci/endpoint/pci-epc-core.c creates inbound window by mapping PCI address [0x10100000] to physical address in EP's memory. That's how the reads and writes go through. If there's no Inbound window configured, something like this unfolds in case of read:-
So now everything is set up. Translation windows are configured in the PCI drivers and you are at linux console. The following sequence unfolds when the CPU issues a memory read:-
In case of memory writes:-
The following happens on the Endpoint: -
The beauty is that this entire translation happens transparently in hardware - your driver just reads/writes to the CPU address, and the PCI host controller handles all the translation automatically!
-- How do we test the Address Translation ?
To test reads and writes, either we can make some changes in the driver itself or use devmem5 user-space binary. We are going to make minor driver side changes on iMX8MM and use devmem5 on the RC. iMX8MM is the PCIe Endpoint and we are using end-point test driver to configure it as such. If you want to do the same, please follow this blog -
Enabling PCIe End-point framework on iMX95 torradex board and iMX8MM EVK - NXP Community On the contrary if you want to make iMX95 as RC and iMX8MM Endpoint, feel free to follow this blog - How to configure iMX95EVK as PCIe Endpoint and test it using PCIe Endpoint Test Framework - NXP Community
Two things we are going to do next: - 1. On iMX8MM EP, we are going to write some random values in the drivers/pci/endpoint/pci-epf-core.c, make the following changes in pci_epf_alloc_space function: -
'space' is the virtual address and 'phys_addr' is the physical address that is contiguous. Please note that it is a crude way to test this translation. There are better ways to do it. Build the kernel after the changes and boot the board with it. Make iMX8MM an Endpoint using PCI Endpoint Test Framework. 2. On iMX95 Verdin EVK [PCIe RC], we are going to read the address 0x910100000 using devmem5 to verify that we can observe the same data on the RC.
That's it for today. This was a long blog and if you feel overwhelmed by the details, please feel free to drop in the DMs or comments so that I can try to make it easier. Until next time! Gaurav Sharma
View full article