M54455EVB - Linux kernel boots very slowly

dmarks_ls · ‎07-07-2011

OK, I have a factory-fresh M54455EVB. I booted the box as-is (using "bootm 0" to launch the kernel), and everything ran fine. I then installed the new M54455 LTIB (2010-09-19) and got that set up on our 64-bit Ubuntu 11.04 box. (Took some hacking to do... had to install RPM, edit sudoers, and edit Ltibutils.pm to have it locate libm.so and libz.so correctly.) I changed the configuration from M54451EVB to M54455EVB, used the settings that were there, and added a few packages (iptables, dropbear, lszrz, etc). I had a uImage and rootfs.jffs2 built in about 15 minutes.

I then followed the instructions in "ColdFire M54455EVB BSP User's Manual, Rev 1.3" on how to upgrade the U-Boot image and then flash a new kernel and RFS to the device, no trouble. I updated the bootargs to include an explicit console, saved that, and then booted the kernel.

The kernel booted quite quickly, spitting out a bunch of new stuff about testing DES, DES3, AES, MD5, SHA1, etc. It registered usbcore, usbhid, ALSA, TCP cubic, protocol families 17 and 15, UDP, and TCP. Then it just sat there. Several seconds later, it mounted root (jffs2). A couple of minutes later, it started BusyBox v1.11.2. Over the course of the next 30 minutes, it went through about 25 more lines of bootup sequence. Right now, it's generating the 1024-bit RSA key for dropbear. It might be done by morning.

For whatever reason, right around the moment it begins mounting the root filesystem, the system slows down by factor 100 or more. It's still running... I'm confident that the RSA key will eventually be generated, even if it takes a few days.

Any idea why the kernel would suddenly slow down like that, especially given the fact that the old U-Boot and old kernel (flashed at the factory) booted and ran just fine?

JWW · ‎07-16-2011

drodgers,

Thanks for the detailed write-up. I'll ask the apps team to take a look and see if there is something we need to tweak in the next maintenance release, which I believe is coming this fall.

-JWW

View solution in original post

TomE · ‎07-07-2011

If it was an old PC I'd guess it had run out of memory and was "thrashing" to the swap disk. But I doubt if an embedded system would be set up that way.

Read this one for a problem where Linux ran slow on a Coldfire because it had TOO MUCH memory, but in that case it was only about 3 time slower and not "glacial" like yours is:

https://community.freescale.com/thread/83658

If it ever does come up, run "top", just in case a user process is slowing you down. You won't be that lucky. Something in the kernel (or one of the modules) is doing something stupid.

Try removing features or modules - the last thing before it mounted root was to start the network, so get rid of that. Otherwise there's something wrong with the root filesystem setup or drivers if that's what has slowed it down. Re-order some of the startups and try to find out which one did it.

If you can run it with a debug pod, then simply halt and continue 10 times or so it and see where it is spending all its time. The PC should match something in System.map.

Let us know what you find.

Tom

dmarks_ls · ‎07-08-2011

OK, update... I tried a "default" install of LTIB, changing only the settings to select the M54455EVB, flashed both kernel and RFS, and it ran slow. I removed some of the optional packages and stuff from the kernel, like ALSA (we don't need sound). Still slow. I then selected an NFS-only build, flashed the kernel, and pointed the target at the NFS server. Booted fine, but still slow. So that probably eliminates the theory that it was the mounting of the JFFS2 filesystem that was introducing the slowness. NFS seemed to come up quickly enough, but it started running slowly just before the "VFS: Mounted root (nfs filesystem) on device 0:11" message appeared.

So, this isn't going away easily. I'll have to start really stripping down the kernel, hopefully that may shed some light on what's going on.

TomE · ‎07-09-2011

If you have a debug pod, then just attach it and then tell it to run. When it goes slow (or after it is up), just HALT it and look at the program counter. That's all you need to know. No need for tellling the debugger about the symbol table or anything. Then match the PC (or a bunch of them, get a sample) to the contents of System.map.

THe other thing to do is to search deep in /proc. There might be some useful statistics being kept under /proc/drivers, /proc/sys and a bunch of other places that might reveal what is going wrong. Just do an "ls -R /proc" and look for anything interesting.

I still suspect your network. Check /proc/net, netstat and ifconfig.

Tom

JimDon · ‎07-13-2011

Tom,

Just wondering how you would use a JTAG to debug Linux?

I'd really like to know a about tool that can do this. I assume you have tried this....

TomE · ‎07-14-2011

We have a debug pod here for our Linux-based product, but I've never had to use it (yet).

I'd only use this for low-level kernel or boot problems. For the case of the problem in this thread, the question is "where is the CPU spending 99% of its time?" and that should be easily answered by just stopping the CPU and seeing where it is - not in a GUI-based IDE, but just reading the program counter and then reading the Linux "System.map" file to find what function it is in.

Applications can be debugged with gdb running on the target (or gdb-server more likely as gdb itself is huge, and not friendly when it doesn't have access to the sources).

Tom

JimDon · ‎07-14-2011

I see.

So you are recommending using the debug pod for a boot or kernel problem, but you have no idea how that would be done?

I hope you prove me wrong but as far as I know, unless you buy a very exspensive set up (like from Windriver) I don't know that can be done in practice.

If you do figure it out and get it to work, please explain the details of how to get it to work...

ChrisJohns · ‎07-14-2011

Hi Jim,

You need to look at the kernel as an emebdded system and handle the GDB and the BDM connection in the same manner. You should be able to debug a kernel with a simple BDM pod and not need expensive tools. Greg Ungerer used a simple parallel port BDM device and gdb for the uCLinux work. Having said this I have not done with a MMU based kernel and it has been years since I played with a Linux kernel at this level. The key point is getting the symbols gdb uses to match the image in memory and it should just work.

In simple terms this means start gdb with the kernel as an ELF file. Reset the target (BDM command) then hit continue and let the kernel load. The symbols in the ELF file should match the loaded kernel. Hit ^C set a break point, continue and do something to hit your breakpoint.

Some complications are stepping over traps, virtual address space in applications that are running and kernel loadable modules.

JimDon · ‎07-14-2011

Chris,

Interesting, but still no how to.

The only problem I see there is that GDB depends on a sprite running on the target. Since the sprite is basically a user space app that loads the target app and communicates the debug information over the selected transport, the kernel needs to be already running.

Second you can't start the kernel until uboot has initialised the hardware, so the debugger would have to know how to do this as well. Just breaking into the kernel at some random place with no symbols is really not that useful.

My point is really this - why would you recommend that someone try a technique that you have no idea how to actually do it, or even if it is reasonably possible? If I were posting I would not consider it ethical to do so, but that's just me. I have a good reason for this, mainly the OP will just waste time attempting to do such, to no avail. Second, when the OP asked me how to do that I would have no answer.

Again, I really hope someone can come up with a how-to on using a JTAG to trace uboot and or the kernel with symbols. That would be fantastic.

ChrisJohns · ‎07-15-2011

> Interesting, but still no how to.

On host with a m68k gdb cross gdb and a suitable .gdbinit script that connects the BDM gdb server as per the documentation, then :

$ m68k-gdb kernel.elf

where kernel.elf is the kernel excutable containing all the debug data. This is the normal procedure for using BDM and GDB with the gdb-server on a host. The BDM project explains using the BDM gdbserver on its web site.

> The only problem I see there is that GDB depends on a sprite running on the target.

With BDM the bdm server runs on the host and connects directly into the CPU. Your use of JTAG is a little confusing to me. The Coldfire uses BDM which is not JTAG. Are you talking about another debugger or BDM ?

If you are debugging applications then using a native gdb-server running on the target is the correct tool and what you state is correct. For the kernel using the debug module of the CPU can be done.

> Since the sprite is basically a user space app that loads the target app and communicates

> the debug information over the selected transport, the kernel needs to be already running.

The BDM hardware is in the processor and your gdb the BDM gdb-server on the same host machine. You treat the kernel as a bare metal embedded application. You cannot debug the user space applications.

> Second you can't start the kernel until uboot has initialised the hardware, so the debugger would

> have to know how to do this as well.

BDM works from reset. If you do not touch the PC but continue the debugger it will run the boot monitor as if not present at all. If this boot monitor loads a kernel and the symbol table you have loaded into gdb matches the kernel you can look at the kernel's source and inspect it's variables.

> Just breaking into the kernel at some random place with no symbols is really not that useful.

Agreed so you need to load the kernel symbols. GDB will not allow you to switch address spaces so you cannot see what an application is doing. Also kernel modules are a problem because GDB does not know the vma of the loaded module.

> My point is really this - why would you recommend that someone try a technique that you have no

> idea how to actually do it, or even if it is reasonably possible?

But I do.

> If I were posting I would not consider it ethical to do so, but that's just me. I have a good reason

> for this, mainly the OP will just waste time attempting to do such, to no avail. Second, when the

> OP asked me how to do that I would have no answer.

I have no idea what you are talking about or why.

> Again, I really hope someone can come up with a how-to on using a JTAG

> to trace uboot and or the kernel with symbols. That would be fantastic.

Do you mean BDM or do tools exist that allow debugging over JTAG ?

dmarks_ls · ‎07-12-2011

Well, I can guarantee you, it's not the network. I removed all network support in the kernel and in support packages, flashed the kernel and rootfs, and rebooted, and got the same results (slow boot right around the time that the rootfs is mounted). I kept removing stuff from the kernel. and whaddaya know, I eventually got the thing to boot and run at "normal" speed. I even re-enabled networking and did an NFS boot, and that booted normally as well.

Now, I haven't yet figured out the exact combination of circumstances, or the one magic option in the kernel, that stops the boot from bogging down. I'm still working on that. Now that I know that it's not the network, I'm back to doing NFS boots, which speeds up the process considerably.

However, I have uncovered a potential issue regarding silicon revision. I created a separate thread for this, since it's a broad question, but I do wonder if the fact that my M54455EVB (rev D) has Rev 1 silicon on it is contributing to my problems.

I'll report back when I figure out exactly what it is that I have to disable to get a working boot. At least I now know there is some combination that works.

dmarks_ls · ‎07-12-2011

OK, found the culprit, I think. My initial instincts were correct, the target was in interrupt hell. And after realizing that /proc/interrupts very nicely summarizes all of the interrupt activity in the system, I took a few snapshots. And the culprit is...

M5445X 182: 4075542054 pata_fsl

Apparently the PATA driver was going ape. Yes, that's 4+ billion interrupts. I checked again a minute or two later, and found that the count had rolled over to 0 and restarted.

So, I went digging into the kernel config, and found the selection for:

<*> Serial ATA (prod) and Parallel ATA (experimental) drivers --->

I turned that option off, recompiled and rebooted... and for some reason, my NFS boot was now failing, it couldn't mount the rootfs. So, I turned the option back on, recompiled, rebooted... and it runs fast and fine. And if I cat /proc/interrupts now, the pata_fsl driver isn't even listed. So no more interrupt hell, I guess.

To be absolutely sure I'd ID'ed the root cause, I wiped LTIB completely clean (including the /opt/freescale stuff), reinstalled it, and did a full rebuild... and something has now gone quite wrong in LTIB, as it refuses to build properly... apparently "microwindows" won't build (everything built fine prior to the wipe). So I have to figure out what's gone wrong there. But it appears the PATA driver was the culprit.

dmarks_ls · ‎07-13-2011

Yeah, microwindows was just refusing to build, so I turned that package off, and now I have a fast-booting kernel over NFS. I'm fairly certain I'll be able to build a flash-based image and make that work, too.

So, in summary, here's the problem encountered, analysis, and workaround (really, that's about all I can call it at this point).

Problem: When building a new installation of LTIB 2010-09-19 for M54455EVB, then booting the resulting build, the kernel slows down by factor 100 or more when it reaches the point of mounting the root filesystem, whether by flash (JFFS2) or NFS.

Analysis: PATA driver is going nuts, creating millions of interrupts per second, slowing system to a crawl.

Workaround: Disable PATA driver in kernel. To accomplish this, the following steps must be followed (tested using NFS-based build/rootfs):

Under the kernel menuconfig, under "Device drivers", disable the option "Serial ATA (prod) and Parallel ATA (experimental) drivers". Rebuild.
Boot the target. Observe that target will kernel panic when trying to mount rootfs, because it cannot open root device "nfs".
Re-enable the "Serial ATA / Parallel ATA" option. Rebuild.
Boot the target. Observe that PATA driver does not announce itself during kernel boot, and that kernel boots at normal speed. Also, "pata_fsl" does not appear in /proc/interrupts.

Don't ask me why, but this works. This is starting with a fresh, complete reinstall of ltib-cflinux-20100919 for M54455EVB, built on Ubuntu 11.04 64-bit. The only hack I had to make was to bin/Ltibutils.pm, so it would recognize that libm.so and libz.so are indeed present on the host system (in the 64-bit directories).

JWW · ‎07-16-2011

drodgers,

Thanks for the detailed write-up. I'll ask the apps team to take a look and see if there is something we need to tweak in the next maintenance release, which I believe is coming this fall.

-JWW

dmarks_ls · ‎07-08-2011

Well, it did finish coming up overnight, dropped me to the BusyBox prompt. And I was able to run "top", which contended that 50-90% of CPU time was tied up in the system, as opposed to user processes. It seems that everything still runs, it's just stupid slow.

This smacks of something being stuck in interrupt hell, servicing a gazillion interrupts per second, leaving very little time for actual work.

I'm not exactly sure how to run a debug pod in parallel with a U-Boot/kernel boot sequence; if I could, I'm sure it would be pretty easy to see where the CPU is spending most of its time.

I think what I'm gonna do is wipe and reinstall LTIB from the DVD, then only make the necessary choices to select the M54455EVB (the M54451EVB is selected by default), then build that, and see if their default setup still causes problems. If so, I guess I'll have to start backing out packages until either it stops slowing down, or there's none left. In which case, I might have to start looking at kernel configuration.

Dumb question... I assume that LTIB automatically applies all relevant patches to the kernel, i.e. there's nothing special I'm supposed to do to create a working kernel, is there?

M54455EVB - Linux kernel boots very slowly

M54455EVB - Linux kernel boots very slowly

General