LX2160A PCIe Latency and Performance Test --------------------- 1. Description This guide demonstrates steps of PCIe Latency and Performance Test. The setup is as given below: PCIe End Point device 0 <== PCI link ==> PCIe Root (Host) device <== PCI link ==> PCIe End Point device 1 lat_mem_rd and qdma_demo run on PCIe End point device 0, and have been tested using following hardware: - PCIe Root (host) - LX2 RDB - PCIe End Point - Advantech iNIC ESP2120 Card (LX2160A) - PCIe link - x8 gen3. 2. Assumptions - Advantech iNIC ESP2120 Card support two physical functions (PF0 and PF1). - We use PF0 as EP0, and PF1 as EP1. - Plug Advantech iNIC (LX2160A) into LX2160ARDB. - Configure EP ATU outbound window at console. - Apply the 001-support-pcie-latency-test.patch to lmbench-3.0-a9, and recompile lmbench tool. - There is qdma_demo in iNIC kernel rootfs by default. 3. Steps of PCIe Latency and Performance Test End Point steps =============== a) At u-boot prompt of iNIC, run below cmd to enable the memory space of PF0 and PF1 by setup Command_Register reg. => mw 3600004 00100007 => mw 3608004 00100007 b) Startup Linux kernel bu run "boot". c) Referring to section "22.6.1.3.3.9 Outbound Programming Example" of LX2160ARM Rev0, check the PF0 outbound region 0 window setup as following: 0x9000000000->0, size 4GB, type MEM. setup PF0 outbound region 2 window, 0x9200000000->0xa000000000, size 4GB, type MEM, by using mtool (similar tool as devmem2): $./mtool w.l 3600900 2 $./mtool w.l 360090c 0 $./mtool w.l 3600910 92 $./mtool w.l 3600914 ffffffff $./mtool w.l 3600918 0 $./mtool w.l 360091c a0 $./mtool w.l 3600904 0 $./mtool w.l 3600908 80000000 d) Run below cmd to test the PCIe EP to EP latency. root@localhost:~# ./lat_mem_rd -P 1 -t 1m "stride=64 0.00049 1519.006 0.00098 1518.840 0.00195 1518.519 0.00293 1518.508 0.00391 1525.525 0.00586 1526.050 0.00781 1525.304 0.01172 1529.972 0.01562 1525.989 0.02344 1518.448 0.03125 1523.981 0.04688 1519.011 0.06250 1525.317 0.09375 1524.872 0.12500 1524.585 0.18750 1523.763 0.25000 1527.060 0.37500 1523.535 0.50000 1522.805 0.75000 1524.173 1.00000 1524.024 e) Run qdma_demo application to test the throughput. NOTE: At least 2 cores are required to run the test, one core is used for printing results/stats, other cores for running test. $export DPDMAI_COUNT=48 $./dynamic_dpl.sh dpmac.3 $export DPRC=dprc.2 root@localhost:~# ./qdma_demo -c 0x8001 -- --pci_addr=0x924fa00000 --packet_size==1024 --test_case=mem_to_pci EAL: Detected 16 lcore(s) EAL: Detected 1 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket fslmc: Skipping invalid device (power) EAL: Selected IOVA mode 'VA' EAL: Probing VFIO support... EAL: VFIO support initialized PMD: dpni.3: netdev created qdma_parse_long_arg: PCI addr 924fa00000 qdma_parse_long_arg: Pkt size 2048 qdma_parse_long_arg:test case mem_to_pci qdma_demo_validate_args: Stats core id - 0 test packet count 4 Rate:79.99982 cpu freq:2000 MHz Spend :2000.000 ms Local bufs, g_buf 0x175382000, g_buf1 0x17537f000 [0] job ptr 0x17537e000 [15] job ptr 0x17537d000 core id:0 g_vqid[0]:0 g_vqid[16]:1 core id:15 g_vqid[15]:2 g_vqid[31]:3 test memory size: 0x2000 packets number:4 packet size: 2048 Local mem phy addr: 0 addr1: 0 g_jobs[0]:0x17537e000 Using long format to test: packet size 2048 Bytes, MEM_TO_PCI Master coreid: 0 ready, now! Processing coreid: 15 ready, now! =>Time Spend :4000.008 ms rcvd cnt:12386304 pkt_cnt:0 Rate: 50734.199 Mbps OR 3096.570 Kpps processed on core 15 pkt cnt: 12386304 =>Time Spend :4000.000 ms rcvd cnt:12451840 pkt_cnt:0 Rate: 51002.742 Mbps OR 3112.960 Kpps processed on core 15 pkt cnt: 12451840 Root complex (host) steps ======================== a) Boot LX2 to Linux prompt b) Run 'lspci -v' to check the address of BAR whose memory is targeted memory for test $ lspci -v 0001:01:00.0 Ethernet controller: Freescale Semiconductor Inc Device 8d80 (rev 20) Flags: bus master, fast devsel, latency 0, IRQ 255 Memory at a047000000 (32-bit, non-prefetchable) [size=8K] Memory at a047002000 (32-bit, non-prefetchable) [size=8K] Memory at a047800000 (64-bit, prefetchable) [size=2M] Memory at a057c00000 (64-bit, prefetchable) [size=64K] [virtual] Expansion ROM at a046000000 [disabled] [size=16M] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit+ Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [b0] MSI-X: Enable- Count=256 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [148] Alternative Routing-ID Interpretation (ARI) Capabilities: [158] #19 Capabilities: [178] Single Root I/O Virtualization (SR-IOV) Capabilities: [1b8] Address Translation Service (ATS) 0001:01:00.1 Power PC: Freescale Semiconductor Inc Device 8d80 (rev 20) Flags: bus master, fast devsel, latency 0, IRQ 255 Memory at a047104000 (32-bit, non-prefetchable) [size=8K] Memory at a047106000 (32-bit, non-prefetchable) [size=8K] Memory at a04fa00000 (64-bit, prefetchable) [size=2M] Memory at a058010000 (64-bit, prefetchable) [size=64K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit+ Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [b0] MSI-X: Enable- Count=256 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [148] Alternative Routing-ID Interpretation (ARI) Capabilities: [178] Single Root I/O Virtualization (SR-IOV) Capabilities: [1b8] Address Translation Service (ATS) Take the address of BAR you want to use in test. We use 0xa04fa00000 as the PF1 test address here, it corresponds to address 0x924fa00000 at PF0 side. So at qdma_demo test, we use 0x924fa00000 as the pci_addr. We also update addr 0x924fa00000 in file lib_mem.c for latency test.