LS1012A - system timer issues

andreykorolyov · ‎07-26-2017

Hello,

while working with ls1012a-rdb I`ve observed quite weird behavior possibly related to a system timer instability - one of every few 'dmesg' scroll does stall while I am using ssh connection and in same time performs relatively well within serial console. The first suspected candidate was frequency scaling affecting pfe clock, but fixing SoC CPU frequency didn`t change anything - stalls continued to appear and I tried to dig deeper.

After quick look, it appears that a) system timer monotonicity is changing over time as it does affect TCP connections for local links in a described manner - e.g. system lose a few tcp packets due to timer (?) stall and b) system timer seems to be quite slower than it should be, losing tens of seconds within a few minutes of near-idling with loaded Linux kernel. What is more interesting is that an active Rx/Tx job through pfe interface could significantly push down the variation - the larger a flow bandwidth is, the lower is an observable drift. SDK`s U-Boot 'sleep' execution time also seem to be larger by a *30* percent than known-to-be-good sleep with same argument.

Unfortunately NTP daemon seem to diverge within any sane setting set for drift value, so the issue is a certain candidate for sw/hw fix.

I am using reference board for LS1012A from NXP with latest SDK 2.0-1703, if this could be related. Any suggestions on the situation improvement are highly welcomed.

johanderycke · ‎01-24-2019

andreykorolyov‌ : Ever got a solution for this?
We have the same issue:

time ssh gateway sleep 100
real 2m5.245s
user 0m0.100s
sys 0m0.010s

johanderycke · ‎02-04-2019

This fixed it for me:
armv8: configs: ls1012a: correct the generic timer frequency · ARM-software/u-boot@b584510 · GitHub

Pavel · ‎02-19-2018

If you using 100Mbps Ethernet with SDK 1703, you can see really poor TCP (scp) performance. This problem is resolved in the LSDK with kernel 4.4 and up.

This problem happens if Fast Ethernet (100Mbps) is used. There is no similar problem if 1G or 10G Ethernet is used.

Have a great day,
Pavel Chubakov

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

justinjonas · ‎12-08-2017

Has anyone found a solution to this problem? We are having the same issue.

Pavel · ‎07-28-2017

Do you send these commands from your PC to the LS1012a board?

Do you send these commands from the LS1012a board to some board?

What is incorrect in this result?

Have a great day,
Pavel Chubakov

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

andreykorolyov · ‎08-03-2017

Have you been able to reproduce the problem? This thread has been marked as 'Answered' which is certainly untimely thing for now.

andreykorolyov · ‎07-28-2017

Yes, the issuing host is an x86 laptop with assumingly stable clock source. As one could see, the elapsed time on the board itself for 'sleep' command which relies on a system timer is something 30% larger than the real time interval which one expect to see. If there is any specific snippet of userspace code which also could take a measure in same manner for (maybe) better understanding of this situation, please share it there.

Pavel · ‎07-28-2017

What incorrect behavior do you see if this command is used?

Have a great day,
Pavel Chubakov

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

andreykorolyov · ‎07-28-2017

Here it goes:

time ssh -lroot 192.168.11.125 sleep 100

real 2m5.840s
user 0m0.050s
sys 0m0.007s

time ssh -lroot 192.168.11.125 sleep 200

real 4m10.897s
user 0m0.047s
sys 0m0.007s

keestrommel · ‎08-18-2017

The observed behavior can be explained from a bad ethernet connection without the need to have time problem. Assume that the time is correct and sleep returns accurately after 100 (or 200) seconds but the TCP packet that tells the ssh host that the sleep is finished gets lost and gets re-transmitted successfully after a few seconds then the ssh command takes longer while there is no time problem at all.

I once had ethernet problem (at a completely different board) and made the same wrong conclusion that there was a time problem while there wasn't any:)

I you have a laptop connected to the reference board make sure that you disablet he EEE (Energy Efficient Ethernet) option of the used adapter.

andreykorolyov · ‎08-24-2017

Thanks Kees,

I`ve referenced in answer just below the mis-behavior which affect long-term measurements via serial console as an example, so the divergence is real and not related to the behavior of the specific subsystem, like pfe clock. The interesting is that Linux kernel from binary evaluation packages (both NAS and Home Router for LS1012A) does not expose timer divergence at all.

Pavel · ‎07-28-2017

Send please command sequence for reproducing this incorrect bahavior on NXP board.

Have a great day,
Pavel Chubakov

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

andreykorolyov · ‎07-28-2017

Something lime 'time ssh ls1012.board.address sleep 100' could provide good example.

Pavel · ‎07-28-2017

We did not meet incorrect behavior of the LS10xx timer. Clock source is stable. Unstable clock source usually produces SDRAM incorrect behavior and board hangs.

System uses interrupt from system timer as clock source.

Usually application can provide accuracy approximately 10ms using system time.

It happens since kernel (Linux or Windows) requires a timer for servicing internal task.

There is Real Time Linux.

See the following page:

https://en.wikipedia.org/wiki/RTLinux

Have a great day,
Pavel Chubakov

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

andreykorolyov · ‎07-28-2017

Thanks, that`s strange. Had to say that the issue is presented on both reference board and our alpha-version custom boards, so this is not certainly a single-board PLL/resonator failure. With custom boards out of scope, what could you suggest for the reference board from NXP? Behavior with increasing lag could be shown on reference binaries or reference SDK, so the issue is certainly should be addressed back to NXP. Board switches has been set to defaults of course, if this could matter.

Pavel · ‎07-27-2017

Look at the following pages about system timer testing and using:

http://elinux.org/Kernel_Timer_Systems

https://stackoverflow.com/questions/240058/1ms-resolution-timer-under-linux-recommended-way

https://stackoverflow.com/questions/3124852/linux-timerfd-accuracy

http://man7.org/linux/man-pages/man2/timer_create.2.html

Test timer on your board using these pages.

Have a great day,
Pavel Chubakov

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

andreykorolyov · ‎07-28-2017

Thank you Pavel,

I`m not sure if you understood my question correctly - I do aware of timer testing/usage techniques and the question exposes *huge* timer instability instead which itself is almost certainly an unwanted behavior. As you could possibly see, there is only one clocksource available on this platform and it does something that it would not intended to do - messing with time accuracy within limits far from being close even to RTC stability level. Having instability as large as *one third* for the real clock is a bug and this must be fixed on NXP side by certain means.

LS1012A - system timer issues

LS1012A - system timer issues

QorIQ LS1 Devices