Intermittent Hangup on Custom Hardware

steveanderson · ‎05-04-2015

A new and interesting bug has landed on my lap, and I am fishing for ideas to trap it beyond those I list at the bottom of this note:

I THINK this is more MX6 than Android or Linux, but it is a 'fun' bug to try and trap - very elusive so far.

Unfortunately it occurs on custom hardware so it won't help anyone else to repro the problem.

I am told that the same versions, from the same software base, built for the SabreSD does not exhibit the problem - and I am taking that for an important clue.

The target system has an iMX6q and is using Android version 4.2.2:

- U-Boot 2009.08-00690-gee2c5a6

- Linux version 3.0.35-06147-g1b17dab

The problem is intermittent and related to playing media (either audio or video) in a repetitive loop.

It appears to happen when one iteration of the loop tries to start up before the media player has finished shutting down the last instance, but this is mostly speculation based upon observations of others..

The symptom is pretty much that it stops dead, presenting a black screen - then the watchdog timer resets the system.

However, when I say intermittent, I mean that running a 1 second video or audio clip back to back it will take 2-8 hours to produce the bug by accident.

Therefore it is not something I can easily just sit there and watch, this is far beyond the useful attention span of a typical person to watch for that long effectively.

I have modified the watchdog driver to provide the watchdog pre-timeout ISR with a stack dump, and it all works well.

(tested by making a 30 second timeout with 10 second refresh and a 28-second pre-timeout interrupt).

I have that ISR call dump_stack() and for the purposes of development and testing it all works swimmingly well.

I further instrumented the ISR to indicate by LEDs when the ISR is triggered...

I then set the watchdog for 30-second timeout and 10-second refresh with a 5-second pre-timeout interrupt...

However, when the bug manifests itself, there is no stack dump... and no LED showing the ISR is even called.

Since there is also no illegal instruction trap, or access violation trap, or anything like that, I do conclude that the system is not running off the rails...

Since the pre-timeout ISR provides no stack dump, and the ISR is apparently not called, I also conclude that interrupts are disabled at the time of the failure...

I don't think an emulator will help me much, a reset would kill the register values I need - and the attention span thing becomes a problem to manually trap it...

Any emulator suggestions, am I missing an obvious idea here?

I am just beginning to explore making the interrupt an FIQ instead of an IRQ.

In this case the FIQ is for its non-maskable property - speed is not an issue, I know the system is about to die, and whole seconds have gone by with no activity...

Any pointers or examples? Any potential problems I am missing calling dump_stack() during an FIQ?

One idea suggested was to instrument the player code around startup and shutdown, to see which drivers may be invoked but not returned from...

Am I missing some other relatively obvious course of action?

-- no content change, just trying to get rid of the assumed answered, which it wasn't...

steveanderson · ‎05-15-2015

Does anyone have any guidance for implementing an FIQ in 3.0.35 on an iMX6q?

(Please look over my original post)

Synopsis: I am having a hangup with media (audio only or audio/video) in which interrupts do not work.

The hangup is on custom hardware using an iMX6q and Kernel 3.0.35 for Android.

We get a watchdog reset, but the pre-timeout interrupt is not hit when the bug manifests (hence the conclusion that interrupts are disabled)

I am wanting to use an FIQ instead of IRQ for the watchdog pre-timeout interrupt, so it is not maskable.

My Quick-and-Dirty FIQ simply has an invalid instruction, which should cause a panic and stack dump by way of the invalid instruction trap.

However, making use of FIQ is not well documented anywhere I have found.

I am pretty much coming up with nothing in my searches for examples, and am starting into the ARM documentation.

The value CONFIG_FIQ is active in my .config file, but calling mxc_set_irq_fiq causes a kernel panic.

This goes into arch/arm/plat-mxc/irq-common.c

The problem here is that chip->set_irq_fiq is not NULL, but is also not valid.

This element is part of struct mxc_irq_chip in irq-common.h of the same folder.

This structure is only used, it seems, in avic.c or tzic.c.

Now AVIC and TZIC are not well described anywhere that I have found.

TZIC is Trust Zone Interrupt Controller

AVIC is (apparently) Advanced Virtual Interrupt Controller

But that is about all I have found so far - MXC_TZIC is part of the iMX5 config, but (?) not relevant for iMX6.

CONFIG_ARM_GIC is also activated in my .config file.

It seems that if FIQ is active and TZIC is not then AVIC should be automatically selected?

Does anyone have some helpful guidance for this?

igorpadykov · ‎05-16-2015

for fiq example one can look at

i.MX6 AVB Demo Implementation

also kernel 3.0.35 is quite old and already removed from i.MX6

product page

i.MX6Q|i.MX 6Quad Processors|Quad Core|Freescale

it is recommended to use latest L3.14.28_1.0.0_iMX6QDLS_BUNDLE

as it significantly improves system stabilty compared with 3.0.35

and includes all latest patches, which may not included in 3.0.35

~igor

steveanderson · ‎05-20-2015

I have identified what seem to be the relevant elements of this example, and reviewed them against the relevant ARM GIC documentation.

I observe that, by default, Linux runs with security options inactive.

I observe that, by default, all interrupts are group 0, and that LINUX uses the IRQ signal for group 0 interrupts.

The first issue, then, is that only group 0 interrupts can signal the FIQ handler.

From the code example, patch 1 does most of the key modifications in the GIC driver code.

As a first step and sanity check, I have chosen to do the following:

Reassign all interrupts as group 1 and only enable group 1 interrupts.

I am not yet using any FIQ, nor enabling group 0 as either IRQ or FIQ.

This change is implemented entirely within the GIC module in arch/arm/common/gic.c (details below)

There is a register array base also for which the offset is added as a define macro in arch/arm/include/asm/hardware/gic.h

My expectation would be that this change is transparent, and everything in the Kernel world should be oblivious to this change.

Unfortunately, this is not the case... Relatively early in the boot process, just after the GPT interrupt is enabled, everything stops...

code fragments thus far:

in gic.h I added:

#define GIC_DIST_GROUP_SET 0x080

-- Note that the code example used GIC_DIST_SECURE_SET, but I found that to be a misnomer.

Since these bits change what the ARM documentation refers to as the 'interrupt group' GIC_DIST_GROUP_SET made more sense.

in gic_dist_init() the following is added:

/* SDA-SQS - Set all irq to Group 1 mode

** By default, since no FIQ is used, all can be group 0 since all are the same anyway...

** Wanting to use an FIQ we will make all group 0 interrupts use FIQ as part of this test...

** To do that we first need to make all the 'normal' interrupts use group 1...

*/

for (i = 0; i < gic_irqs; i += 32)

writel_relaxed(0xffffffff, base + GIC_DIST_GROUP_SET + i * 4 / 32);

-- Note that this is added just after initialization of the GIC_DIST_ENABLE_CLEAR registers

and at the end of that procedure:

writel_relaxed(2, base + GIC_DIST_CTRL);

-- Note that the original code wrote '1' which is enable for group 0, while '2' is enable for group 1 interrupts.

in gic_cup_init() the following is changed:

writel_relaxed(2, base + GIC_CPU_CTRL);

-- Note that the original code wrote '1' which is enable for group 0, while '2' is enable for group 1 interrupts.

-- Also since I am not yet using FIQ I have not set any of the other bits in this control register.

All other things being equal, it seems this should be a transparent set of changes.

Have I missed something obvious?

steveanderson · ‎05-29-2015

Update - I have been able to update the GIC support software to allow what is implied by CONFIG_FIQ being active, and I have implemented a simple FIQ support to help trap the pernicious bug which started this discussion.

I am just a little alarmed, though, that as the system is currently written using an FIQ is simply impossible by design.

The example given above is a fair example of a patch to allow one and only one specific interrupt source to signal an FIQ instead of an IRQ.

I have written a more general solution because I can easily think of other places where I might want to use FIQ instead of IRQ.

Once I clean all the debugging junk from my code I will share the changes in the event someone may find it useful.

This still only applies to the 3.0.35 Kernel, and is only tested in that kernel for iMX6 and patched for Android.

However, this example did lead almost directly to the information I actually needed to resolve the FIQ implementation.

The modified system is running now to try and reproduce the original failure, which the FIQ should be able to trap - under the presumption that while IRQs are clearly disabled when the hang-up bug manifests, but FIQs are usually not disabled when IRQs are. That was more work than should have been necessary for what seems to me, as an engineer, should be a feature supported by the OS - which instead was practically designed against under what seems like support - which is the worst of both worlds.

steveanderson · ‎06-01-2015

I am now simply stumped... Over the weekend I managed to get my bug to surface with the modified Kernel...

I have configured the watchdog pre-timeout interrupt to use an FIQ, and I have tested it by making the interrupt happen to make sure I get the stack and context dump I want in the 'real' failure mode... That all works...

However, when the failure happens - even with the FIQ in the design - there is just a cessation of activity, no FIQ.

Each time the watchdog is refreshed I check the CPSR and both IRQ and FIQ are always enabled.

This is very confusing to me. FIQ are only disabled in a very small number of places - and only on the way to a shutdown or reset...

I stubbed the macro to disable FIQ with a note to the log and a stack dump.

I also added code to trap if something tries to mask my FIQ in the GIC.

Neither of these things is happening, and yet the pre-timeout FIQ is somehow not happening.

So:

1) Does anyone have an idea what might be going on here?

2) Any good guidance for getting an emulator hooked up for Kernel Debugging?

jingyuzhou · ‎10-06-2019

Hi Steve,

I am debugging a watchdog timeout reboot issue on i.mx6 (possibly caused by some hardware lockup), and I would like to try out your idea to use FIQ to trigger the stack dump on the lockup CPU.

Did you tried in your FIQ enabling that if it can serve as an non-maskable interrupt? Say.. set your WDT FIQ affinity to CPU1 and then disable irq on CPU1 to see if WDT interrupt still come up?

Do you mind sharing your patch for me to continue try out this idea?

matej_kupljen · ‎09-10-2019

Steve,

I am also trying to use FIQ on iMX6 platform, but my FIQ handler is never executed.

Basically I described my problem here:

SSI-AC97 and WM9712 integration

I have not added any special code by myself, I just use what is already enabled. And the driver for SSI in AC97 mode has an option to turn on the stream filter, which uses FIQ handler for this.

If I change the code to use IRQ handler instead, I can see that isr is called, however FIQ is not.

I tracked the problem down to arch/arm/mach-imx/irq-common.c in fucntion:

int mxc_set_irq_fiq(unsigned int irq, unsigned int type)

Basically I can see that exirq->set_irq_fiq is not set and I believe that is why I cannot set this to be an FIQ handler.

I am suing Linu kernel 4.14.39. I know you used an older kernel, but can you share the changes you did so you can enable FIQ?

Thanks and BR,

Matej

steveanderson · ‎09-10-2019

You’re waking the old neurons now.

I didn’t so much port the code, as review the differences and patch it.

The original e-mail still applies though, even if I dig out my files for that client, I don’t recall how far I got in the end.

I do recall dropping the FIQ before I was happy with it, and before it really helped me trap the problem.

The principle problem on that board was DRAM out of cal. FIQ would not have helped me with that.

They still had intermittent problems, but were happy with much lower frequency of problem.

Still happy to help shake out ideas.

Steve

savedstesen · ‎01-30-2023

I'm having a problem similar to yours and I wonder - How did you get to the conclusion that it was the DRAM out of calibration?

matej_kupljen · ‎09-11-2019

Hi Steve,

thank you for the answer, even if it was a long time ago.

Like you said in the e-mail, the problem is that on the iMX6 platform, the code for enabling FIQs is not available, while on iMX27 and iMX35 it is.

I tracked this down to arch/arm/mach-imx/irq-common.c to function mxc_set_irq_fiq, where the call to irq_get_chip_data returns NULL. This function then enables FIQ for specified IRQ number. The code for iMX27/35 is in arch/arm/mach-imx/avic.c.

I'll try to add the code for iMX6, but I need to read (a lot) of documentation for interrupt controller in iMX6.

If I find anything, I'll post the solution here.

Thank you again for answering an old thread.

BR,

Matej

JFIDAHO · ‎04-19-2021

I am working on implementing the FIQ for an embedded system. I have been all through the GIC v2 Arch Manual, have all the registers (GROUP, PRIORITY, CPU TARGET) set up and nominally working. I have a test IRQ that works,

but when I actually activate it as FIQ, the system hangs. I presume that's because I have no way to assign it to the FIQ, so the system doesn't know where to route the FIQ interrupt. Because this is an embedded system, and the FIQ IRQ needs no system services, I am happy to re-route the FIQ exception vector to my routine. (I did this on the iMX35). But I can't over write that vector (0xFFFF001C), it generates a page fault.

How can I write to the exception vector table to assign my own FIQ handler?

Thanks.

steveanderson · ‎05-18-2015

It will take some time to look more deeply into the referenced topic and analyze it for the parts useful to me. Thanks. I had actually looked briefly into this article earlier. I dismissed it because it was a very specific patch to set up a very specific FIQ, and while it does still seem to be that, I may have been too hasty to dismiss that example in the search for code to allow using ANY IRQ as an FIQ if that was desired... I have a strong instinct to look for code that can be applied beyond the special case - but the fact is that just now I have only a special case anyway.

I also understand about the somewhat aged version...

For this project I am constrained to looking to fix the deployed version of the software.

There is another project where someone is porting a newer version of Android.

On yet different projects I have ported 3.10.17 and am working on 3.14.28 for Linux platforms on the same base hardware set.

An aside to this - the BSP for 3.10.17 was my first work in the professional world with the Yocto build environment. In that project I ported the very same 3.0.35 Kernel adaptations into the BSP for the Daisy release. The greatest complexity I ran into was the pinmux settings for the Device Tree, since 3.0.35 did not use a device tree.

Starting to use the BSP released in April has been no walk in the park. Someone 'cleaned up' what they considered 'unused' pinmux settings in the device tree. My response to this is that "Of course they are unsused in the basic Kernel!" The basic Kernel is developed for marketing test platforms which are not necessarily representative of the real world. But I can't just use the old, working, device tree... Now, looking over the changes on the whole, I personally agree with the structural changes to the DT Sources, and target hardware projects should not depend on generic pinmux settings (even though I used what was available (that which worked for me, anyway) porting 3.10.17) because the fact is that any combination is possible in specific target hardware, and every combination cannot possibly be provided - and should not be provided - in the generic distribution. This change forced me to go through the schematics, and the applications of the custom Module I am working with, and define RELEVANT pinmux settings for current designs using the Module.

Further aside: I have noticed 'out there' comments about how 'almost nothing in the Kernel uses FIQ anymore - we should remove it from the code.'

It is a GOOD thing that not much in the Kernel uses the FIQ. The current design seems to say 'There can be ONLY ONE' about the FIQ, and that being the case, if anyone wants to design custom hardware, and has a need for the FIQ, it is there, possibly supported in the Kernel, and available for their use...

Just because the BSP does not need it does not in any way mean that the target audience does not need it... Nor that the BSP should not provide a way to use it... Imagine if the Chip manufacturers decided to remove non-maskable interrupts since only a few customers need it... Then they guarantee the loss of those customers, and possibly future customers... This is a marketing decision, of course, and sometimes it happens.

One reason people continue to use older versions of the Kernel is that someone changed APIs, or 'cleaned up' code that a specific target application depended on. Companies may go to great lengths to avoid the expense of porting a new Kernel and debugging all the consequences. Sure it has the latest patches and bug fixes... But in the end, it is swapping a well known set of bugs (which probably don't affect their current release) for an unknown set of bugs which might not be so benign to their end product.

But back to the real topic at hand:

I am certainly marking that as a helpful answer. If it helps me to solve my issue I will also mark it as the correct answer and share the relevant changes and notes -- so that someone else might find something to lead to their own answer.

igorpadykov · ‎05-05-2015

Hi Steve

this may be caused by memory fragmentation, one can look

at attached file sect.6.3 Known issues and limitations for multimedia

for video playback issues. Also may be useful below

Long running vpu task with memory leak bug on imx6

Best regards

igor

-----------------------------------------------------------------------------------------------------------------------

Note: If this post answers your question, please click the Correct Answer button. Thank you!

-----------------------------------------------------------------------------------------------------------------------

steveanderson · ‎05-15-2015

I am not sure what happened to my reply from before I went on vacation... I have been on vacation for a week, and somebody decided to mark this topic as assumed answered, and that is very far from the truth. I don't know why some things get marked presumed answered while others go for months without being marked that way, but it is one of the most frustrating aspects of this community. I promise I will let you know if you have lead me to the correct answer, or even if you may have accidentally answered the issue directly and succinctly - which is rare, but I believe possible...

The information in the given references, while well intentioned, does not appear to apply to the problem I am exploring.

1) The problems reported in those references all have a kernel panic and stack dump.

--- If I had a stack dump I would know where my system was hanging up.

2) The problems referenced have to do with the VPU, but my problem occurs with Audio Only media.

--- If you can explain how a VPU memory leak leads to an audio-only playback deadlock problem of some kind I will look more closely.

3) I am not able to get a stack dump given the procedure I outlined above.

My conclusion is that interrupts are disabled when the deadlock is happening.

This conclusion is based on the fact that if the execution was 'off in the weeds' there would be access violations or some other traps, leading to kernel panic and stack dumps. Contrary to that indication, everything appears to think it is under control...

So, what I need is some expert advice on how to trap this problem, and let me offer some thoughts I have already considered which any expert is welcome to critique in the event I missed something...

Use of an emulator to debug the Kernel in this case is not likely to be revealing. The problem is of such a low frequency that human attention is challenged for manually trapping this problem as it happens. I could trap on the watchdog reset, but by that point critical registers are already modified by the reset and I am unlikely to find where I was stuck before the reset.

What does appear promising is using an FIQ to trap the bug, since an FIQ is not maskable, if I made the watchdog timer interrupt an FIQ instead of an IRQ, then I would get my interrupt and have a chance to trap the bug... My problem here is that I am finding pretty much ZERO helpful documentation for how to make use of the FIQ functionality in 3.0.35 -- and I am certainly open to some expert guidance.

Thanks.

Note - By saying thanks here I simply being polite and acknowledging the helpful intent of any response, not that it actually helped lead me to a solution.

Once again, I promise to give credit where it is due, when it is due.

Intermittent Hangup on Custom Hardware

Intermittent Hangup on Custom Hardware

Android

i.MX6Quad

Linux

Multimedia