mx28 auart1 cpu high load

nicusorhuhulea · ‎03-21-2023

Hello,

I have a custom platform(kernel version 5.4.188) that has an mx287 cpu. I'm using the auart1 to communicate with an MCU. The MCU is sending every 4ms some data. This set up is making the cpu load to be around 20% using the 'top' cmd.

This set up was working fine with ~0%-1% cpu load on an older kernel(2.36.35)

I have attached a test app(serialTest.c) to test the cpu MCU communication. On the 5.4.188 the cpu load is ~20% and on 2.36.35 is ~1%.

The dts bits from 5.4.188 on the custom board:

&auart1 {
pinctrl-names = "default";
pinctrl-0 = <&auart1_2pins_a>;
status = "okay";
};
the dts custom board includes the imx28.dtsi and uses this https://elixir.bootlin.com/linux/v5.4.188/source/arch/arm/boot/dts/imx28.dtsi#L324

When running the serialTest.c app, this is how the data is looking on the 5.4.188:

~# ./serialTest
15
15
15

I'm getting 15bytes because I've changed the FIFO level select from ONE_QUARTER(trigger on FIFO full to at least 4 from 16 entries) to SEVEN_EIGHTS(at least 14 of 16 entries) hence reducing the cpu load from ~20% to ~11%.

When running the serialTest.c app, this is how the data is looking on the 2.36.35:

~# ./serialTest
45
30
54
30
45
30
45
30
54

So, on 2.36.35 the read syscall is returning much more data. This is coming from the uart memory buffer(circular buffer) from my understanding. Once the FIFO hw buffer(16bytes) has at least ONE_QUARTER the interrupt will trigger, the cpu will copy the data from FIFO to the circular buffer. Later the read syscall will demand this data from the circular buffer.

What else I have tried?

I tried to see if there is a difference at the scheduler clock as there was some modifications to solve some kind of latencies issues. I saw that the drivers/clocksource/mxs_timer.c is running on Fixed Count mode and on the 2.36.35 this is running on Match Count mode. So, I've switched to Match Count mode to be aligned with the 2.36.35, but no performance improvement.

I tried to trigger the read syscall to return when the available data is greater than 30bytes(just to be aligned with the 2.6.35 kernel) and I tried to use the termios by modifying the VMIN field. Indeed the returned data is around the VMIN value, but still no significant improvement(I've observed a 2%improvement and reaching 9% cpu load)

So, I would like to see if there is some other pointers on this ? It sounds that whatever I try, everything narrows on that FIFO hw buffer. How is it that on 2.6.35 I get 0-1% cpu consumption and on the 5.4.188 I get ~11%(after changing the FIFO set level where on 2.6.35 there is no FIFO set level modification, just runs with the default one).

Note:

The 2.6.35 was not using DMA and on 5.4.188 this is conditioned by the existence of the rtc/cts signals which our custom platform does not have.

thank you

nicusorhuhulea · ‎03-31-2023

The reason for this high cpu load is due to a switch in the tty design. The old kernel 2.x was using a delay mechanism
plus a 'direct way' in order to transmit data to user space(flash_ldisc). This transfer is producing the highest
cpu consumption, but there is another point of cpu high load and this lies at the hw FIFO buffer that is interrupting
the cpu when at least 4 of 16 entries are satisfied. The latest is forcing the cpu to handle data to fast as the
MCU is sending data every 4ms. Changing the FIFO set level in order to ease the interruption requests on the CPU works by cutting down by half the cpu usage. The part with the delay_sched mechanism from 2.x kernel has to be dealt at the application side through a delay.
The user will no loose any data as the cpu will put the data from the FIFO hw buffer to a memory(ring buffer) buffer,
then the flush_ldisc will take out the data from the memory buffer to user space.

View solution in original post

nicusorhuhulea · ‎03-31-2023

The reason for this high cpu load is due to a switch in the tty design. The old kernel 2.x was using a delay mechanism
plus a 'direct way' in order to transmit data to user space(flash_ldisc). This transfer is producing the highest
cpu consumption, but there is another point of cpu high load and this lies at the hw FIFO buffer that is interrupting
the cpu when at least 4 of 16 entries are satisfied. The latest is forcing the cpu to handle data to fast as the
MCU is sending data every 4ms. Changing the FIFO set level in order to ease the interruption requests on the CPU works by cutting down by half the cpu usage. The part with the delay_sched mechanism from 2.x kernel has to be dealt at the application side through a delay.
The user will no loose any data as the cpu will put the data from the FIFO hw buffer to a memory(ring buffer) buffer,
then the flush_ldisc will take out the data from the memory buffer to user space.

nicusorhuhulea · ‎03-22-2023

It seems that this is due to a tty subsystem switch design mechanism. On 2.36.65 when the mxs-auart.c driver is ready to send the data to user space, it calls tty_flip_buffer_push where there is a condition in there: either uses the schedule_delayed_work or depending on an flag(tty_latency that is no more on 5.x) calls flush_to_ldisc that sends the data without delay.

On 2.36.65 the schedule_delayed_work is used hence the data is scheduled to be pushed out of the buffer later, depending on the jiffies.
And this lowers the cpu high load. This is the reason why the read syscall is returning more data e.g: 45bytes 30bytes 50bytes .......

On 5.x the data is being pushed tty_schedule_flip https://elixir.bootlin.com/linux/v5.4.188/source/drivers/tty/tty_buffer.c#L405 without any delay. This creates the misleading in believing that the data is being sent directly from the 16bytes FIFO hw without storing them in the memory uart buffer(smth like a ring buffer). Hence producing high cpu load.

It seems that this is due to a design switch. The tty subsytem was restructured to remove all related to schedule_delayed_work and letting the tty subsystem handle data at very high rates. And now, the delay part from the 2.36.35 kernel has been sent to the user space by just using the VMIN from termios or adding a delay between reads. (and to increase the FIFO set level at driver side(mxs-auart.c))

Bio_TICFSL · ‎03-22-2023

Hello,

Good to know, since the EVK BSP is 2.6.35 there was not upgrade. So it better to know this.

thanks