Hi, there are some months we have a problem with T1024 CPU and the issue is randomic so difficult to dig deeper.
I know that this post is long but it's necessary...
Thank you in advance.
PROBLEM DESCRIPTION:
We use T1024NSE7KQA CPU inside our industrial PLC. We use CPU SerDes to connect to a 4-ports Ethernet phyter in 100BASE-T mode (more details below).
Sometimes at the PLC startup, only on some units, the SerDes doesn’t work and the Ethernet connection fails. Othertimes, those units, the SerDes works and Ethernet connection has no issues (when the PLC is running the SerDes between CPU and Phyter keeps working without issues).
The problem is only determined by the startup phase.
The CPU is setting RESET_REQ_B:1->0 after startup sequence; as a consequence our on-board logic drives PORESET_B:1->0 to CPU and a CPU restart happens. After that CPU restart happens, the CPU continues to setting RESET_REQ_B:1->0 so the cycle repeats.
The CPU reset repeats forever, until we power down the device.
OTHER PROBLEMS (RELATED?)
For the sake of completeness some other units show different problems (but we don’t know if related to THIS problem):
CONFIGURATION DESCRIPTION:
The phyter is Microsemi VSC8514XMK-11. The SerDes protocol is QSGMII. The only SerDes lane used is lane 0, and it’s configured as “qs.m1-4” (the other are PCIexpress, not used).
The CPU RCW data are:
aa 55 aa 55 01 0e 01 00 //preamble row
08 10 00 0a 00 00 00 00 // RCW begin...
00 00 00 00 00 00 00 00
6a 80 00 03 00 40 00 12
fc 02 f0 00 21 00 20 00
00 00 00 00 00 00 00 00
00 00 00 00 00 03 2a 00
20 00 01 00 14 26 5a 00
00 00 00 00 00 00 00 06 // ...RCW end
08 13 80 40 18 f8 8a 01 // End command row
The CPU clocks are all generated with on-board clock generator (cod. CDCI6214RGET) as below:
The SerDes signals are AC-coupled.
During the Ethernet initialization, the firmware operates only on following SerDes registers as listed below:
out_be32(& pSerDes->QSGMIICR1, 0); //set MDEV_PORT=0
[...]
Value = in_be32(& pSerDes->PLL2RSTCTL);
Value &= ~0x20;
out_be32(& pSerDes->PLL2RSTCTL, Value); // Disable PLL2
[...]
Value = in_be32(& pSerDes->LN1GCR0);
Value &= ~0x88000000;
Value |= 0x00180000;
out_be32(& pSerDes->LN1GCR0, Value); // Disable the Lane 1
Value = in_be32(& pSerDes->LN2GCR0);
Value &= ~0x88000000;
Value |= 0x00180000;
out_be32(& pSerDes->LN2GCR0, Value); // Disable the Lane 2
Value = in_be32(& pSerDes->LN3GCR0);
Value &= ~0x88000000;
Value |= 0x00180000;
out_be32(& pSerDes->LN3GCR0, Value); // Disable the Lane 3
WHEN:
The problem occurs sometimes, with occurrence probability from 5% to 50%, depending on the unit;
the problem occurs only if the PLC has been powered-down for hours (1h - 3h depending on the unit), so a power cycle always temporary "resolves" the issue.
WHO: only some units (~30%) have the problem described below.
WHY: But why do CPU request the RESET_REQ_B? Our last findings are...
In our software we didn't check that SerDesx_PLL1RSTCTL[RST_DONE]=1 before proceeding on the configuration sequence. We also added the check in the SW described in the NOTE below (sec. 4.7.1, CPU RM):
NOTE
After completing reset, software should check the SerDesx_PLL1RSTCTL[RST_DONE] field to make sure that
each active SerDes PLL on the device has locked. Transactions or packet data cannot be transferred through the targeted lane(s) of the SerDes interface if the PLL associated with the lane(s) does not lock properly.
We could log some CPU registers some seconds after after problem occurs: we found that SerDesx_PLL1RSTCTL[RST_DONE]=0, so the problem is with PLL1.
The HRESET_REQ transition reason could be understood after interrupting the HW path on our board that drives the PORESET_B CPU input: in this case we read:
After we read PLL1RSTCTL=0x264745A7 we tried to reset CPU SerDes PLL setting PLL1RSTCTL= 0x864745A7 but no luck: after it, the registers will remain forever PLL1RSTCTL[RST_DONE] = 0, PLL1RSTCTL[ERR] = 0.
Moreover we know that on our board not all the checklist has been followed (e.g. CPU SerDes module supply filtering and CPU supply decoupling), but we are trying not to throw away all our board and to understand which modification could solve the issue for future production.
QUESTIONS:
“After POR completion the RST_DONE=0 is considered a fatal error and must be corrected by either way:
1) provide correct reference clock for the PLL
2) disable the unused PLL in the RCW.”
Thank you so much for clarification.
Giacomo Gasparini.
Sorry for the technical issue.
There was no indication in the CRM that you have updated the issue, so the Technical Case was closed automatically.
Please provide additional information:
1) what is the selected RCW source?
2) U-Boot log as textual attachment
3) check POR behaviour of the cfg_rcw_src signals which could be disturbed by U27 having bus hold on data inputs
Dear Ufedor,
1) Our actual intended RCW source configuration (see schematics page 6) is:
cfg_rcw_src[0 to 8]=0 0 0 1 0 1 1 1 1
We want 16-bit NOR flash, 32bit addressability, IFC_AD[0]=LSB, ALE before CS.
2) sorry but we don’t use U-boot but our proprietary boot. Our boot doesn’t show any CPU power on phases. Which information do you think could be help to debug the issue? We’ll add them to our boot file.
3) U27 has enable SAMPLE_CFG_B=1.8V so is should never be enabled when the CPU is powered on. But to be sure I removed U27 and on next Monday I’ll have completed the test: I’ll give you update to this point;
Thanks and have a nice weekend ,
Giacomo.
2) It could be helpful to compare DCFG_CCSR_RCWSRn registers values in proper and failing cases.
4) Check supplies ramp rate - refer to the QorIQ T1024, T1014 Data Sheet, Table 8. Power supply ramp rate.
5) Can you provide binary image of the RCW being used?
Dear Udedor,
sorry for late answer.
I checked and without U27 the CPU keep requesting the reset at startup, so U27 is not the culprit.
About your new questions
2) we added a check about DCFG_CCSR_RCWSRn and results are below:
08 10 00 0A 00 00 00 00 00 00 00 00 00 00 00 00
6A 80 00 03 00 40 00 12 FC 02 F0 00 21 00 20 00
00 00 00 00 00 00 00 00 00 00 00 00 00 03 2A 00
20 00 01 00 14 26 5A 00 00 00 00 00 00 00 00 06
So we can exclude that bad-read RCW might be the cause of the issue;
4) After measurement with oscilloscope, the power-on max ramp rates are:
• +VDD=+VDDC=SENSEVDDC=1.0V rail max ramp rate: 1330 V/s;
• +VDD_DDR3=1.35V rail max ramp rate: 1150 V/s;
• +1.8V rail max ramp rate: 2920 V/s;6
• +1.2V rail max ramp rate: 5770V/s;
• +2.5V rail max ramp rate: 2928 V/s;
• +3.3V rail max ramp rate: 3400V/s;
(attached you'll find related startup waveform)
5) binary image of RCW used is below:
10101010 01010101 10101010 01010101 00000001 00001110 00000001 00000000 //preamble
00001000 00010000 00000000 00001010 00000000 00000000 00000000 00000000 // RCW ...
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
01101010 10000000 00000000 00000011 00000000 01000000 00000000 00010010
11111100 00000010 11110000 00000000 00100001 00000000 00100000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000011 00101010 00000000
00100000 00000000 00000001 00000000 00010100 00100110 01011010 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000110 //...RCW
00001000 00010011 10000000 01000000 00011000 11111000 10001010 00000001 // End command
NOTE: We think that's better to power down unused SD1_REF_CLK2_P/N to avoid emission. So - based on CPU errata corrige A-011367 we want to change RCW[173] from 0 to 1 in order to be sure that also PLL2 will be power-down. We are testing also this configuration.
We have a doubt: we’ve seen that in our configuration SerDes_PLLnCR1[PLLBW_SEL]=1: it seems the adviced configuration with QSGMII; do you advice to try to change it to 0 to decrease PLL bandwidth and PLL noise immunity or we should keep as adviced?
Thank you for support!
Giacomo
5) Excuse me, but requested was a binary image (.bin, .img) - not the '01' textual representation.
6) Ensure that RESET_REQ_B is not pulled low during POR.
sorry for late answer but It tooks time to find the event trigger on the oscilloscope.
5) attached you’ll find the binary image currently used in all our units.
6) all the items that could pull it down are logic gate (U41) and FPGA (U34): we removed U41 logic gate (if faulty it could behave unexpectedly) but the unit keeps resetting. Moreover the FPGA is volatile memory type and in this phase is not programmed (it will be programmed 0.9s after the CPU power up) so it cannot pull down the RESET_REQ_B.
Below you'll find the waveform of RESET_REQ_B and ASLEEP during normal startup and during reset event startup.
Thank you for reply.
Giacomo.
Please confirm that:
1) AVDD_SD1_PLLn filters are implemented exactly as shown in the QorIQ T1024 Family Design Checklist, Table 4. Power design system-level checklist.
2) TRST_B is pulsed low during POR.
Hi Ufedor,
1) AVDD_SD1_PLLn supply filter are in schematics page 3 (and reported in picture below). I summarize our standard failing boards filter:
As seen our implementation seems to fulfill the requirement of CPU datasheet sec. 4.2.2 as below:
I think our standard implementation is NOT OK for following points:
When I understood this problem I changed actual failing product as follows:
The result after this mod is that the problem is not solved.
But I have a residual doubt about datasheet meaning: our actual AGND_SD1_PLL1, AGND_SD1_PLL2 signal are supplied from board-level supply +VDD_DDR3=1.35V, NOT from X1VDD as seems to be required from datasheet; but we are not sure of this requirement (“AVDD_SDn_PLLn should be a filtered version of XnVDD”) as we copied from Evaluation board design. Do NXP intend to have the board-level 1.35V to supply X1VDD and then X1VDD to supply S1VDD or can we use the board-level 1.35V to supply both X1VDD and S1VDD?
2) The COP interface not used during the product use, we use it only 1 time during initial system programming. For this reason during the acquisitions below the COP connector (connector J1) was not paired with an external TAP interface.
So during measure COP_JTAG_TRST_3.3V_B was tied to 0V with pull-down (R26). -> Even if it should be a pull-up according fig. 80 on datasheet, don’t know if it could cause problems to CPU in normal operation...
Being an input of logic gate U35, it causes TRST_B=JTAG_TRST_B_1.8V to be 0V.
The TRST_B=JTAG_TRST_B_1.8V waveform is the C3 channel in the following photos.
Normal startup | Failing startup (CPU RESET_REQ asserted) |
Normal/correct startup (C1=24V trigger, C2=RESET_REQ_B, C3= TRST_B) | Forced CPU reset startup (C1=24V trigger, C2=RESET_REQ_B, C3= TRST_B). The forced reset is fictitiously obtained with PLL1=powered down as it’s a rare and random problem. |
Below you find also the match between our schematics CPU reset logic and fig.80 on CPU datasheet.
Thanks you.
Giacomo
2) COP_SRESET_B should be connected to the HRESET_B through open-drain gate because the processor drives HRESET_B low during POR sequence and contention with the TAP signal is possible.
3) Please check that SerDes reference clocks amplitudes comply with the Data Sheet specifications.
4) What is the PORESET_B assertion duration after all supplies are stable?
5) How PORESET_B is generated when RESET_REQ_B is asserted? How long it is asserted in this case?
Dear Ufedor,
I've done some measurements.
3) The SD1_REF_CLK1_P/N waveform is reported in figs. below and has V_pkpk=1.24V. The Vpkpk is obtained from a AC-LVPECL waveform with V_pkpk=1.8V further reduced of 70% with a series resistor 22ohm to be compliant with CPU requirements (V_pkpk < 0.8V).
4) the PORESET_B assertion duration after all supplies are stable is 213ms (both if the system starts up normally, see fig. 1, or if the systems fails to startup, see fig. 2).
5) the PORESET_B related logic is:
if [(EN_HRESET_REQ_B = 0 and HRESET_REQ_B=0) or FPGA_HRESET_REQ_B=0] then:
CPU_HRESET_B=0 for 200ms;
If PS_RESET_B=0 then:
CPU_HRESET_B=0;
if CPU_HRESET_B=0 or COP_HRESET_3.3V_B=0 then:
PORESET_B=0;
As explained below we identified the exact conditions that activated the PORESET_B=0 in case of failing startup: see bold condition above.
What’s happen during normal startup (see fig. 1):
It takes about 213ms from the time all supplies are stable (in the picture I put only VDD_DDR3=1.35V [trace C1, fig.1,2] supply because it’s the last supply to stabilize) to the first PORESET_B: 0->1 [trace C2, fig.1,2] POR transition. This 213ms delay is originated by voltage monitor U31 (STM6905) which, during this time, keeps its output PS_RESET_B=0; moreover DS2 keeps CPU_HRESET=0 [trace C4, fig. 1,2] causing U37 to keep PORESET_B=0 after initial stabilization.
Until this moment HRESET_REQ_B [trace C3, fig. 1,2] was not asserted.
Now, 213ms after all supplies are stable, voltage monitor U31 releases PS_RESET_B, then DS2 releases CPU_HRESET_B(*) [trace C4, fig. 1,2], so they both have a transition 0->1; at the same time U37 releases PORESET_B [trace C2, fig. 1,2].
(*) at this moment CPU_HRESET_B is already been released by supervisor U39 (STM6322) which has a lower timeout, 200ms.
At 948ms since supplies are stable we see a EN_HRESET_REQ_B=1->0 transition (as in fig. 6): it’s the GPIO used to enable the CPU reset request.
From this moment there aren’t any other transitions on CPU_HRESET_B, PORESET_B, HRESET_REQ_B signals. In this case the system works normally.
What’s happen during system startup fail event (see fig. 2,3):
It takes about 213ms from the time all supplies are stable (in the picture I put only VDD_DDR3=1.35V [trace C1, fig.1,2] supply because it’s the last supply to stabilize) to the first PORESET_B: 0->1 [trace C2, fig.1,2] POR transition. This 213ms delay is originated by voltage monitor U31 (STM6905) which, during this time, keeps its output PS_RESET_B=0; moreover DS2 keeps CPU_HRESET=0 [trace C4, fig. 1,2] causing U37 to keep PORESET_B=0 after initial stabilization.
Until this moment HRESET_REQ_B [trace C3, fig. 1,2] was not asserted.
(#)
Now, 213ms after all supplies are stable, voltage monitor U31 releases PS_RESET_B, then DS2 releases CPU_HRESET_B(*) [trace C4, fig. 1,2], so they both have a transition 0->1; at the same time U37 releases PORESET_B [trace C2, fig. 1,2].
(*) at this moment CPU_HRESET_B is already been released by supervisor U39 (STM6322) which has a lower timeout, 200ms.
The first difference of fig.2 (compared to fig. 1) is that, after 213+3ms since supplies are stable (only for first event) or 3ms since this PORESET_B: 0->1 transition the CPU(**), yet out of POR state, drives the first HRESET_REQ_B: 1->0 [trace C3, fig. 2,3] transition.
(**) we suppose that’s the CPU that drives the HRESET_REQ_B transition because HRESET_REQ_B is connected to:
- CPU -> we suppose that drives this transition;
- FPGA -> the FPGA doesn’t drive the signal as it’s not programmed at this moment;
- U41 -> U41 doesn’t drive the signal because - after we cut its input EN_HRESET_REQ_B and added there a pullup - we could still measure the same transition on HRESET_REQ_B CPU side.
After 948ms since supplies are stable (only for first event) or 734ms since this PORESET_B: 0->1 transition there is the EN_HRESET_REQ_B=1->0 transition (as in fig. 4), as in the normal startup case: as HRESET_REQ_B=0 (as told above) then U41-U40-U39-U37 logic chain lead to a PORESET_B: 1->0 transition and the CPU goes for the first time in a unwanted POR state (see fig. 2,3).
Side note: we excluded that the PORESET_B: 1->0 transition is unstable voltage supply (because PS_RESET_B = 1 is stable for all time).
After 948ms+0.3ms since supplies are stable (only for first event) or 734ms+0.3ms since this PORESET_B: 0->1 transition we see a EN_HRESET_REQ_B=1->0 transition as in fig. 4, probably due to CPU POR state.
The waveforms explanation continues at fig. 5:
after further 190ms in other words after 948+190ms since supplies are stable (only for first event) or 734ms+190ms since this PORESET_B: 0->1 transition, as in fig. 5, we see the first PORESET_B: 0->1 transition and the CPU exits for the first time from the unwanted POR state.
(#)
From this moment the waveforms repeat periodically between the (#) marks with T=926m.
NOTE: if we artificially shutdown SD1_REF_CLK1 the system fails EVERY startup exactly in the same way of above description.
Do you have all information useful to reply? We are in a hurry because we need to proceed with new production batch.
One question we need fast reply is: is it ok to supply AVDD_SD1_PLL1 from our board-level +VDD_DDR3=1.35V rail (like actual board) or should we supply it from X1VDD CPU balls (e.g. AC12)? It's not clear from manual.
Thank you,
Giacomo.
------------------------
Figures:
Normal startup:
FIG.1: Normal startup (C1=VDD_DDR3, C2=PORESET_B, C3=HRESET_REQ_B, C4=CPU_HRESET_B)
FIG. 6: Normal startup (C1=VDD_DDR3, C2=PORESET_B, C3=EN_HRESET_REQ_B, C4=U41_OUT)
Wrong startup – first unwanted POR event (“CPU reset”):
FIG. 2: Wrong startup - first unwanted POR event (C1=VDD_DDR3, C2=PORESET_B, C3=HRESET_REQ_B, C4=CPU_HRESET_B)
FIG. 3: Wrong startup - first unwanted POR event (C1=VDD_DDR3, C2=PORESET_B, C3=HRESET_REQ_B, C4=CPU_HRESET_B)
FIG. 4: Wrong startup - first unwanted POR event (C1=VDD_DDR3, C2=PORESET_B, C3=EN_HRESET_REQ_B, C4=U41_OUT)
Wrong startup – following unwanted POR events (“CPU reset”):
FIG. 5: Wrong startup - further unwanted POR periodical events (C1=VDD_DDR3, C2=PORESET_B, C3=HRESET_REQ_B, C4=CPU_HRESET_B)
SerDes reference clock amplitude looks greater than specified in the processor's Data sheet:
HRESET_B should not be driven externally during POR - only PORESET_B.
You wrote:
> is it ok to supply AVDD_SD1_PLL1 from our board-level +VDD_DDR3=1.35V rail
It is possible to use any 1.35V supply if all requirements and specifications of the processor's Data Sheet are fulfilled.
Dear Ufedor,
I really appreciate you speed on answers.
I reply point to point:
" SerDes reference clock amplitude looks greater than specified in the processor's Data sheet:"
Sorry if I wasn' t clear but our 1.26V signal was SD1_REF_CLK_P - SD1_REF_CLK_N alias differential peak-peak voltage (Vod_pkpk) so it's double relative to single ended signal SD1_REF_CLK_P - GND (Vod_pk). So it should be correct as its single ended equivalent is Vod_pk=0.63V <0.8V as datasheet request.
"HRESET_B should not be driven externally during POR - only PORESET_B."
We don't drive HRESET_B externally: on our boards this signal is named SRESET_1.8V_B and may come:
- from FPGA (during reset event the FPGA is not yet operational); or
- from JTAG TAP connector (J1, pin11) as 3.3V signal further translated to 1.8V with level translator (U12);
If I try to depoulate R124 (adding pullup at pin 1) this is the result: it shows that HRESET_B CPU signal is not pulled low externally.
Prehaps our signal name induced you in confusion? On my previous post I talk about " CPU_HRESET_B", " HRESET_REQ_B" but these signals are the logic related to PORESET_B not to HRESET_B. Our schematics on pages 2 and 4 is clarifying.
" It is possible to use any 1.35V supply if all requirements and specifications of the processor's Data Sheet are fulfilled."
So the datasheet sentence "AVDD_SD1_PLL1n should be a filtered version of X1VDD" actually means that " AVDD_SD1_PLL1n should be a filtered version of X1VDD source voltage"? In this case our supply architecture is compliant.
Do you have any other advice?
Post scriptum: we are re-designing the SerDes differential traces: which are the max acceptable crosstalk limits (NEXT, FEXT) to be used for QSGMII? We saw the QSGMII spec but it talks only about SDD11,SDD22, SCC22 but nothing about crosstalk.
Thank you so much,
Giacomo.
Please check noise amplitude on the PLL supply.
Please create separate Community question concerning the QSGMII interface layout.
Dear Ufedor,
here are the voltage measured on AVDD_SD1_PLL1 supply: in particular the positive terminal was R63,pin2 and negative terminal was on local GND near AA16 CPU ball. I measured in input filter not on balls as in advice in datasheet.
I measured with a FET low cap 1GHz probe and the worst case waveforms show that we are within 1.35+/-65mV:
This is the worst case with CLK_OUT=ON.
The other unit I measured with CLK_OUT=OFF show less noise, as below:
Dear Ufedor,
do you have any other suggestion about the issue? The noise on PLL supply seems ok...
Thank you,
Giacomo.
I have no further ideas concerning points to check remotely.