Vybrid L2 cache latency settings?

jeremysommer · ‎10-09-2014

Hello,

I've noticed that the L2 cache latency settings for Vybrid have changed between kernels 3.0 and 3.13 as provided by Timesys, but using source evidently from Freescale.

The relevant source snippets are shown below.

File arch/arm/mach-mvf/mm.c in 3.0 kernel (from Timesys repo linux-mx.git, version 3.0-mvf):

int mxc_init_l2x0(void)

{

...

writel(0x132, MVF_IO_ADDRESS(L2_BASE_ADDR + L2X0_TAG_LATENCY_CTRL));

writel(0x132, MVF_IO_ADDRESS(L2_BASE_ADDR + L2X0_DATA_LATENCY_CTRL));

...

}

File arch/arm/boot/dts/vf610.dtsi in 3.13 kernel (from Timesys repo linux-mx.git, version 3.13-vf610-twr):

In arch/arm/boot/dts/vf610.dtsi:

L2: l2-cache@40006000 {

compatible = "arm,pl310-cache";

reg = <0x40006000 0x1000>;

cache-unified;

cache-level = <2>;

arm,data-latency = <1 1 1>;

arm,tag-latency = <2 2 2>;

};

As you can see, the data RAM read and setup latencies have decreased from 3 to 1, and 2 to 1, respectively; the tag RAM write latency has increased from 1 to 2, and its read latency has decreased from 3 to 2. I noticed this when I discovered that my TWR-VF65GS10 would always experience 3.13 kernel startup hang when configured for an A5 core clock of 500 MHz (or even 452 MHz), whereas the 3.0 kernel starts up fine at those speeds. I experimentally bumped the latencies up from "arm,data-latency = <1 1 1>; arm,tag-latency = <2 2 2>;" to "arm,data-latency = <2 2 2>; arm,tag-latency = <3 3 3>;" and this solved the problem. Then I dug deeper, and saw the changes shown above.

Is there any guidance from Freescale as to what latencies should be used to guarantee successful operation at 500 MHz A5 core clock? Similarly, is there any explanation for why these numbers changed? (A possible answer would be "we wanted to optimize the device tree settings for users of our 400 MHz parts", etc.) However, it seems to me it would be good to get this documented.

I have searched the Freescale and Timesys websites, and the web in general, to no avail.

Thank you,

-Jeremy Sommer

timesyssupport · ‎10-17-2014

Hi Jeremy,

Thanks for looking into this issue. We are indeed using the configuration provided by Freescale. We have not heard of any issues with L2 cache latency in the 3.0 kernel, so keeping the same configuration in 3.13 would likely be the best path forward. I would expect this to be changed in the next release of 3.13 for Vybrid, in 2-3 weeks, after we test and verify on our end. If the original author wants to make any comments (Jingchang_Lu), those would be welcome as well.

Thanks,

Timesys Support

View solution in original post

jeremysommer · ‎10-09-2014

A minor correction to make... my original assumption about the relationship between the dts file contents and the register is not true, for two reasons.

First, the order of the latency parameters in the dts file isn't the same as the order (MS to LS) in the control registers themselves.

In the control registers, fields [10:8], [6:4], and [2:0] correspond to write, read, setup respectively.

In the dts file, the values appear in the order <read write setup>.

Second, the register is written with one less than the dts file contents in each of the three fields (to have the dts file indicate the actual number of latency cycles).

So <R W S> in the dts becomes ...(W-1)...(R-1)...(S-1) in the control register.

The change in latencies from 3.0 to the current 3.13 is really equivalent to going from

"arm,data-latency = <4 2 3>; arm,tag-latency = <4 2 3>;"

to

"arm,data-latency = <1 1 1>; arm,tag-latency = <2 2 2>;"

where the order is <read write setup>.

I apologize for my confusion, but now I think I've got it right. BTW, when I modify the dts file with the <4 2 3> latencies of 3.0, kernel 3.13 boots up successfully.

timesyssupport · ‎10-17-2014

Hi Jeremy,

Thanks for looking into this issue. We are indeed using the configuration provided by Freescale. We have not heard of any issues with L2 cache latency in the 3.0 kernel, so keeping the same configuration in 3.13 would likely be the best path forward. I would expect this to be changed in the next release of 3.13 for Vybrid, in 2-3 weeks, after we test and verify on our end. If the original author wants to make any comments (Jingchang_Lu), those would be welcome as well.

Thanks,

Timesys Support

jeremysommer · ‎10-17-2014

Thank you for the fast response, I'm glad it's on your radar. For now I'll consider this resolved.

billpringlemeir · ‎10-31-2014

Stefen Agner supports the Toradex Colibri boards and he has commits to the mainline Linux where the values are,

&L2 {
     arm,data-latency = <2 1 2>;
     arm,tag-latency = <3 2 3>;
};

I think he also boots those devices at 500MHz. The values '111' and '222' are the defaults for the controller if I remember correctly. I think it is actually the AHB clock (or whatever the Vybrid name for that highest BUS clock is; maybe AXI clock) that determines these wait states. Probably when you convert to 500Mhz, you also increase the bus speed. The Tower Linux kernel use '1,1,1' and '2,2,2', but I think it boots at 400Mhz. He ran some memory benchmarks to discover these values. I don't know which ones, but there is bw_mem and cachebench in llcbench which will at least give you some numbers on how the latencies may affect things. They will hopefully also stress the memory sub-systems.

falstaff · ‎10-23-2015

Hi All,

Yes, as Bill mentioned, after we increased the CPU clock cycle (and hence the platform clock), we observed memory issues which showed even at normal temperatures. I am not sure anymore how I obtained the latency settings, but I remember that it was kind of a mix and match from previous values and proved to be working at that time.

Just dig a bit deeper into it, here is a table of values we have seen so far. All values are tag/latency and in the register format (add 1 to get to cycles, order is write/read/setup, exactly as seen in 0x40006108/0x4000610c, see the ARM L2C-310 docs).

Reset value	Linux 3.0 BSP	Vanilla Linux
0x111/0x222	0x132/0x132	0x111/0x000

The reset value had been obtained by reading the register state before using the L2 (using U-Boot without L2 cache support). Not sure if these are really reset values or if the boot ROM writes these values.
The Linux 3.0 BSP value seems to be introduced in July 2012. The initial version of the code initialization function mxc_init_l2x0 looks an awful lot like the version from arch/arm/mach-mx6/mm.c, including the data/tag latency values... It seems that this values has just been copied from i.MX6...
The vanilla Linux value has been introduced in May 2013 with the first commit of the vf610.dtsi device tree file, merged into Linux 3.11 then. The value seem to be a try to use the reset values, just cache and data latencies has been mixed up and the offset of 1 clock (between the register vs. device tree values) has not been accounted for...

So far, I would say the reset value is probably the best choice. According to these values, it would take 2 cycles to access the tags and 3 cycles to access data...

The application note AN4947 ("Understanding Vybrid Architecture") provides actually some more insight in the L2 cache architecture. Chapter 3.10 says:

During the first phase of L2 access, the tags are accessed in parallel and hit/miss is determined. The tag arrays are built in the L2 cache controller. No wait states are necessary

Looking at the values available for the L2C-310, I would translate that into "0b000 - 1 cycle of latency, there is no additional latency", for all three tag latency values (write/read/setup).

Further more the manual says:

During the second phase of a L2 access, a single data array structure is accessed:
a) L2 read cache hit: the L2 data array provides one line of data to the L2 cache controller at the same time. For Vybrid, it takes three clock cycles to read or write the L2 data array structure (upper GRAM on platform frequency). Together, four clocks Cortex-A5 latency

Again, looking at the options in the L2C-310, I would translate that into "0b010 - 3 cycle of latency", as write/read data latencies. Not sure about setup data latency...

So this suggests a value of 0x000/0x22X.... This more or less aligns with the reset value, just the tag latencies are smaller.

However, this sentence in the L2C-310 manual (under RAM clocking and latencies) makes my interpretation questionable:

The programmed RAM latency values refer to the actual clock driving the RAMs, that is, the slow clock if the RAMs are clocked slower than the cache controller.

The latencies in the "Understanding Vybrid Architecture" are in Controller clock cycles... Given that the programmed latencies refer to RAM latencies, the values from the "Understanding Vybrid Architecture" are probably not applicable here.... So I don't know what the right values are. Clearly, 1 clock cycle for the data latency as it is in the mainline Kernel seem to be too low,

Maybe the authors of the above mentioned paper can comment on that, jiri-b36968 or Rastislav Pavlanin?

--

Stefan

falstaff · ‎11-30-2015

Is there any chance to get a official response on this?

timesyssupport · ‎12-02-2015

Hello Stefan,

In our 3.13 kernel we set the following:

https://github.com/Timesys/linux-timesys/commit/a37d6e2797462dfd3311c3c2caf1cfd0875a8e86

arm,data-latency = <4 2 3>

arm,tag-latency = <4 2 3>

Running higher CPU clocks we do not enable by default, but rather, allow end user to enable in U-boot.

It seems these same values are used for the imx6sx, which is a part very similar to the Vybrid.

http://lxr.free-electrons.com/source/arch/arm/boot/dts/imx6sx.dtsi#L156

Hopefully this is of some use to you, karinavalencia is there a Freescale engineer who can provide further information?

Regards,

Timesys Support

falstaff · ‎12-02-2015

Hi Timesys Support,

Thx for your answer. I saw your changes in the 3.13 kernel, this values (4 2 3, translates to 0x132 in the settings registers) are basically the ones which has been used in the Linux 3.0 BSP. However, as I pointed out in my lengthy post, this values have been copied from the cache initialization code of i.MX6 (in 3.0 Kernel). The initial Vybrid commit (e426e71b4f "ENGR00180931-1 mvf: add MSL support for MVF platform") introduced these values. The function mxc_init_l2x0 introduced in this commit is exactly the same as in arch/arm/mach-mx6/mm.c. However, the cache architecture is not identical between Vybrid and i.MX6, hence I highly doubt that these values are the correct ones for Vybrid...

As far as I understand the values depend on the exact architecture (e.g. where are tags and data stored) and the latencys of this memories... This is not really a Linux related question, but more a hardware/platform question. Maybe someone from the hardware architecture team of Vybrid can comment on that?

--

Stefan

karina_valencia · ‎11-30-2015

timesyssupport can you comment?

jeremysommer · ‎10-17-2014

Any help?

karina_valencia · ‎10-17-2014

timesyssupport do you have any update of this case?

karina_valencia · ‎10-09-2014

timesyssupport can you help with this case?

Vybrid L2 cache latency settings?

Vybrid L2 cache latency settings?

Linux

Tower Board

VF6xx