The issues on "mmc0: cqhci: timeout for tag 0"

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

The issues on "mmc0: cqhci: timeout for tag 0"

11,754 Views
jack-cap
Contributor III

Hi NXP teams

 Recently We have a rather troublesome problem.

After we flash the image of android Android P9.0.0_2.1.1_AUTO done,then  system boot up.

Execute adb shell command,in the shell cmd,execute the "su" command.it will wait for long time and shows as below:

[ 83.014933] mmc0: cqhci: timeout for tag 0
[ 83.019053] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
[ 83.025498] mmc0: cqhci: Caps: 0x0000310a | Version: 0x00000510
[ 83.031944] mmc0: cqhci: Config: 0x00001001 | Control: 0x00000000
[ 83.038391] mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
[ 83.044834] mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
[ 83.051283] mmc0: cqhci: TDL base: 0xffffc000 | TDL up32: 0x00000000
[ 83.057729] mmc0: cqhci: Doorbell: 0xffb1ffff | TCN: 0x00000000
[ 83.064175] mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00400000
[ 83.070619] mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00011000
[ 83.077067] mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000800
[ 83.083511] mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
[ 83.089959] mmc0: cqhci: Resp idx: 0x0000000d | Resp arg: 0x00000000
[ 83.096405] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[ 83.102853] mmc0: sdhci: Sys addr: 0xff4e0000 | Version: 0x00000002
[ 83.109295] mmc0: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000008
[ 83.115745] mmc0: sdhci: Argument: 0x00018000 | Trn mode: 0x00000033
[ 83.122188] mmc0: sdhci: Present: 0x01fd8008 | Host ctl: 0x00000030
[ 83.128634] mmc0: sdhci: Power: 0x00000002 | Blk gap: 0x00000080
[ 83.135080] mmc0: sdhci: Wake-up: 0x00000008 | Clock: 0x0000000f
[ 83.141527] mmc0: sdhci: Timeout: 0x0000008f | Int stat: 0x00000000
[ 83.147972] mmc0: sdhci: Int enab: 0x107f4000 | Sig enab: 0x107f4000
[ 83.154418] mmc0: sdhci: AC12 err: 0x00000000 | Slot int: 0x00000502
[ 83.160867] mmc0: sdhci: Caps: 0x07eb0000 | Caps_1: 0x8000b407
[ 83.167311] mmc0: sdhci: Cmd: 0x00000d1a | Max curr: 0x00ffffff
[ 83.173759] mmc0: sdhci: Resp[0]: 0x00000000 | Resp[1]: 0xffc003ff
[ 83.180203] mmc0: sdhci: Resp[2]: 0x328f5903 | Resp[3]: 0x00d07f01
[ 83.186648] mmc0: sdhci: Host ctl2: 0x00000088
[ 83.191098] mmc0: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0xffff5408
[ 83.197542] mmc0: sdhci: ============================================
[ 83.205431] mmc0: running CQE recovery

Sometimes,when open the camera app,it also met this issues.Do you have any good suggestions?

Hope for your reply

Additional information:

Soft platform:             Android P9.0.0_2.1.1_AUTO (4.14.98 kernel) 

Hardware platform:  i.MX 8QuadMax for PMIC 1.7v

BRs

thanks

Tags (1)
23 Replies

8,864 Views
martinetd
Contributor IV

Hi all -- forgive me for hilighting as it's been a while -- @jack-cap @edcloudcycle @philippe_schenk @brandon_shibley @quentin_schulz 

 

We have the same issue here, it's not fixed with later kernels (tried 5.4.70-2.3.2 and 5.10.35-2.0.0).

 

I've debugged it a bit and honestly it looks like a mmc firmware bug I'd like to report to micron, basically the mmc stops responding after a flush from time to time for us.

It'd be great help if you could tell me what brand/model of MMC you use, so we can confirm it's related, as I can't reproduce at all with the kingston mmc in the evk.

 

The performance hit of disabling cqhci is pretty big so I'd rather it not comes to that... linux properly requeues failing requests and it doesn't fail too much for us so we can probably live with the bug but it's quite annoying.

0 Kudos

8,854 Views
edcloudcycle
Contributor III

Hi @martinetd,

Thanks for following up on this, I think the feature is disabled in the latest NXP BSP and so we are not seeing the error anymore.

We are using a SOM: nxp-imx-8m-mini-nano , the marking on the device is OPA2D JZ133 with a logo of an M with a ellipse like an orbit around it. @brandon_shibley will know exactly.

Thanks

Ed

8,839 Views
brandon_shibley
Contributor II

We are using a SOM: nxp-imx-8m-mini-nano , the marking on the device is OPA2D JZ133 with a logo of an M with a ellipse like an orbit around it. @brandon_shibley will know exactly.

A Micron MTFC series eMMC is used on the Toradex Verdin iMX8M Mini.  We disabled cqhci in our BSP to resolve the instability.

8,825 Views
martinetd
Contributor IV

Thanks! we're using the same series of eMMC from micron so it's exactly what I wanted to hear; it could be we have the same kind of hardware bug but if only boards with that model have the problem I think we're on the right track.

I've contacted them yesterday, and will follow up if we can make progress.

 

Regarding the feature being disabled, I can't see anything in the git logs from neither nxp's 5.4.70_2.3.2 nor 5.10.35-2.0.0 -- so I assume it's a toradex patch...

right, it looks like http://git.toradex.com/cgit/linux-toradex.git/commit/?h=toradex_5.4-2.3.x-imx&id=fd33531be843566c59a... -- that isn't in nxp's tree.

0 Kudos

7,427 Views
martinetd
Contributor IV

Just to follow up on this:

 - micron has been able to reproduce the problem internally, but haven't provided us with any fix yet. At least the problem might get fixed for next versions.

- meanwhile upstream linux disabled CQHCI support recently (in 5.16-rc3), and this has already been backported in 5.10.83 / it's also included in a similar patch from someone else in lf-5.10.72-2.2.0 ... so it's all been pushed under a rug and the problem should not show up again for anyone right now.

 

I'm hoping the next eMMCs we have doesn't show the problem and CQHCI can be added back with just a few models blacklisted as quirks, as I feel the performance difference is noticeable enough, but I guess time will tell...

6,868 Views
ChenJun945
Contributor III

when does the CQHCI  renable? Is micro solved the problem? If not When to solve the proble. For high

performance the CQHCI should been enabled...

0 Kudos

6,855 Views
martinetd
Contributor IV

As far as I know only this particular MMC is affected. We've tried a newer version (also micron chip) and CQHCI works well there, so we've re-enabled CQHCI at the host level and added a quirk for the MMC on our kernel:

https://github.com/atmark-techno/linux-5.10-at/commit/407e28eb648e8605d44fc53cf02c17984ff224ed

Unfortunately there's no way of listing all the MMCs that have the problem, so each vendor should take the time to check themselves if their MMC behaves correctly...

9,776 Views
edcloudcycle
Contributor III

Also, do you know what gets disabled? Is it some power saving feature?

Thanks

Ed

0 Kudos

9,758 Views
philippe_schenk
Contributor IV

It's the CQHCI feature. That is command queuing present on newer eMMC chips. I did some benchmarking on our Apalis iMX8 once and it showed that for packages <= 32k the speed increase with CQHCI is 60-80%.

But that doesn't bring you anything if it is not stable.

9,776 Views
edcloudcycle
Contributor III

Thanks Philippe, that is more tricky for us as I am trying to use vanilla Torizon. Did you have any luck with them on including this change in their build?

Tagging: @andrecurvello-tx 

 

0 Kudos

9,756 Views
philippe_schenk
Contributor IV

TorizonCore does base on the regular Toradex BSP. That means the change is now also in TorizonCore.

However... unfortunately it is still in the Torizon 5.1.0 quarterly. 
TorizonCore 5.2.0 april monthly release has that fix in for sure.

Best Regards,
Philippe

9,750 Views
edcloudcycle
Contributor III

 

Thanks @philippe_schenk I will follow up with the Torizon team, we are trying the new sort of quarterly with the new BSP today. I noticed @brandon_shibley on this thread. Maybe they have done some testing of this?

10,083 Views
quentin_schulz
Contributor I

Any update on this?

We have the same issue (for other tags though but probably does not matter) on 5.4.70_2.3.0 on an i.MX8MM.

Best regards,
Quentin

10,049 Views
brandon_shibley
Contributor II

Hi @joanxie,

We're also looking for some clarification on this.  Is this addressed with 5.4.70_2.3.0 or will the issue be resolved in a later release?

10,486 Views
philippe_schenk
Contributor IV

Hi @joanxie 

We again stumbled on this issue on L5.4.24-2.1.0 release. We are hesitant to just disable this feature. So I ask again, What is the status on that issue? Can we track the progress on this somewhere?

Thanks and Best Regards,
Philippe Schenker

10,613 Views
joanxie
NXP TechSupport
NXP TechSupport

Please disable CQHCI and take one try:

 

--- a/drivers/mmc/host/sdhci-esdhc-imx.c

+++ b/drivers/mmc/host/sdhci-esdhc-imx.c

@@ -246,7 +246,7 @@ struct esdhc_soc_data {

        .flags = ESDHC_FLAG_USDHC | ESDHC_FLAG_STD_TUNING

                        | ESDHC_FLAG_HAVE_CAP1 | ESDHC_FLAG_HS200

                        | ESDHC_FLAG_HS400 | ESDHC_FLAG_HS400_ES

-                       | ESDHC_FLAG_CQHCI

+               //      | ESDHC_FLAG_CQHCI

                        | ESDHC_FLAG_STATE_LOST_IN_LPMODE

                        | ESDHC_FLAG_CLK_RATE_LOST_IN_PM_RUNTIME,

 };

0 Kudos

10,613 Views
philippe_schenk
Contributor IV

Hi joanxie

We are facing the same issue on our Verdin i.MX8MM module. Using a kernel with CQHCI disabled as you suggest seems to make things better.

What is the status on that issue? Can we track the progress on this somewhere? Is it maybe already solved in the new L5.4.24 release?

Best Regards,

Philippe

10,613 Views
philippe_schenk
Contributor IV

Hello,

Do we get any answer from NXP side on this issue?

Best Regards,

Philippe Schenker

cc joanxie

0 Kudos

10,613 Views
joanxie
NXP TechSupport
NXP TechSupport

it seems that the development team still work on this, suggest that you can disable it, this issue will not cause system hang, just some dump log, then our usdhc driver will re-try and pass, do not cause any real read/write operation issue.

10,333 Views
leonlin1
Contributor II

Hi joanxie,

            we ecnouter similar error as below, and it will cause kernel panic then system reboot. Our board are i.MX8QP and using NXP android 9.0.0_2.0.1 BSP, Linux version is 4.14.98. Will the patch for this issue also can apply into current BSP we used? Thanks~

              

[ 9430.110443] mmc0: cqhci: timeout for tag 31

[ 9430.114641] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========

[ 9430.121089] mmc0: cqhci: Caps:      0x0000310a | Version:  0x00000510

[ 9430.127534] mmc0: cqhci: Config:    0x00001001 | Control:  0x00000000

[ 9430.133979] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000006

[ 9430.140425] mmc0: cqhci: Int sig:   0x00000006 | Int Coal: 0x00000000

[ 9430.146872] mmc0: cqhci: TDL base:  0xffffb000 | TDL up32: 0x00000000

[ 9430.153318] mmc0: cqhci: Doorbell:  0x00000000 | TCN:      0x80000000

[ 9430.159764] mmc0: cqhci: Dev queue: 0x00000004 | Dev Pend: 0x00000000

[ 9430.166210] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00011000

[ 9430.172656] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000004

[ 9430.179102] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000

[ 9430.185548] mmc0: cqhci: Resp idx:  0x00000000 | Resp arg: 0x00000004

[ 9430.191996] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========

[ 9430.198442] mmc0: sdhci: Sys addr:  0xf9bdc000 | Version:  0x00000002

[ 9430.204888] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000060

[ 9430.211333] mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000023

[ 9430.217780] mmc0: sdhci: Present:   0x01fd8008 | Host ctl: 0x00000030

[ 9430.224225] mmc0: sdhci: Power:     0x00000002 | Blk gap:  0x00000080

[ 9430.230671] mmc0: sdhci: Wake-up:   0x00000008 | Clock:    0x0000000f

[ 9430.236603] healthd: battery l=85 v=3 t=35.0 h=2 st=2 c=400 fc=4000000 cc=32 chg=a

[ 9430.237125] mmc0: sdhci: Timeout:   0x0000008f | Int stat: 0x00000000

[ 9430.251140] mmc0: sdhci: Int enab:  0x107f4000 | Sig enab: 0x107f4000

[ 9430.257585] mmc0: sdhci: AC12 err:  0x00000000 | Slot int: 0x00000502

[ 9430.264032] mmc0: sdhci: Caps:      0x07eb0000 | Caps_1:   0x8000b407

[ 9430.270478] mmc0: sdhci: Cmd:       0x00000018 | Max curr: 0x00ffffff

[ 9430.276923] mmc0: sdhci: Resp[0]:   0x00000004 | Resp[1]:  0x3452395f

[ 9430.283369] mmc0: sdhci: Resp[2]:   0x42475546 | Resp[3]:  0x00150100

[ 9430.289815] mmc0: sdhci: Host ctl2: 0x00000008

[ 9430.294264] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0xffff0808

[ 9430.300710] mmc0: sdhci: ============================================

[ 9430.307222] mmc0: running CQE recovery

[ 9430.313604] mmc0: running CQE recovery

[ 9430.317610] mmc0: running CQE recovery

[ 9430.321522] print_req_error: I/O error, dev mmcblk0, sector 17305144

[ 9430.327983] Aborting journal on device mmcblk0p12-8.

[ 9430.333101] mmc0: running CQE recovery

[ 9430.337311] mmc0: running CQE recovery

[ 9430.341650] mmc0: running CQE recovery

[ 9430.345738] print_req_error: I/O error, dev mmcblk0, sector 17285120

[ 9430.352390] Buffer I/O error on dev mmcblk0p12, logical block 1081344, lost sync page write

[ 9430.360888] JBD2: Error -5 detected when updating journal superblock for mmcblk0p12-8.

[ 9430.369573] mmc0: running CQE recovery

[ 9430.373844] mmc0: running CQE recovery

[ 9430.378008] mmc0: running CQE recovery

[ 9430.382169] print_req_error: I/O error, dev mmcblk0, sector 8634368

[ 9430.382496] healthd: battery l=85 v=3 t=35.0 h=2 st=2 c=400 fc=4000000 cc=32 chg=a

[ 9430.388537] Buffer I/O error on dev mmcblk0p12, logical block 0, lost sync page write

[ 9430.404020] EXT4-fs error (device mmcblk0p12): ext4_journal_check_start:61: Detected aborted journal

[ 9430.413235] EXT4-fs (mmcblk0p12): Remounting filesystem read-only

[ 9430.419385] EXT4-fs (mmcblk0p12): previous I/O error to superblock detected

[ 9430.426534] mmc0: running CQE recovery

[ 9430.430877] mmc0: running CQE recovery

[ 9430.435017] mmc0: running CQE recovery

[ 9430.439173] print_req_error: I/O error, dev mmcblk0, sector 8634368

[ 9430.445556] Buffer I/O error on dev mmcblk0p12, logical block 0, lost sync page write

[ 9430.453693] Kernel panic - not syncing: EXT4-fs panic from previous error

[ 9430.453693]

[ 9430.461974] CPU: 2 PID: 4550 Comm: .into.stability Not tainted 4.14.98-00003-gab881e3c854e-dirty #4

[ 9430.471027] Hardware name: Trimble GX10 (DT)

[ 9430.475304] Call trace:

[ 9430.477769] [<ffff00000808b2fc>] dump_backtrace+0x0/0x414

[ 9430.483179] [<ffff00000808b724>] show_stack+0x14/0x1c

[ 9430.488245] [<ffff000008fa178c>] dump_stack+0x90/0xb0

[ 9430.493310] [<ffff0000080db29c>] panic+0x140/0x2ac

[ 9430.498118] [<ffff000008384c50>] __ext4_abort+0x164/0x168

[ 9430.503524] [<ffff00000833715c>] ext4_journal_check_start+0x84/0x8c

[ 9430.509794] [<ffff000008337274>] __ext4_journal_start_sb+0x3c/0x158

[ 9430.516069] [<ffff000008359764>] ext4_dirty_inode+0x30/0x68

[ 9430.521647] [<ffff0000082d38cc>] __mark_inode_dirty+0x4c/0x474

[ 9430.527482] [<ffff0000083591a0>] ext4_setattr+0x3b4/0x948

[ 9430.532888] [<ffff0000082c0c4c>] notify_change2+0x388/0x3dc

[ 9430.538467] [<ffff00000829a108>] chmod_common+0xa4/0x128

[ 9430.543789] [<ffff00000829b720>] SyS_fchmodat+0x40/0xb8

[ 9430.549017] Exception stack(0xffff00000a8f3ec0 to 0xffff00000a8f4000)

[ 9430.555466] 3ec0: 00000000ffffff9c 00000000c0993b40 00000000000001b0 0000000000000000

[ 9430.563301] 3ee0: 00000000e8a13380 00000000ff90fcb0 00000000c0993b40 000000000000014d

[ 9430.571137] 3f00: 0000000000000000 00000000e89e5000 0000000012c8c660 0000000012f80010

[ 9430.578973] 3f20: 00000000000001b0 00000000ff90fc88 00000000e4865d6d 0000000000000000

[ 9430.586809] 3f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

[ 9430.594645] 3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

[ 9430.602481] 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

[ 9430.610317] 3fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

[ 9430.618155] 3fc0: 00000000eb4dcca0 00000000400e0010 00000000ffffff9c 000000000000014d

[ 9430.625990] 3fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

[ 9430.633828] [<ffff000008083ac0>] el0_svc_naked+0x34/0x38

[ 9430.639151] SMP: stopping secondary CPUs

[ 9430.643080] Kernel Offset: disabled

[ 9430.646569] CPU features: 0x180200c

[ 9430.650058] Memory Limit: none

[ 9430.653126] Rebooting in 5 seconds..

0 Kudos