Application: each suite in a multi-storey hotel has its own control unit that handles lighting, air conditioning, access control (door locks) and other features involving interaction with hotel guests.
The control unit implements TCP-IP/Ethernet networking, to permit exchange of data on an on-going basis with a central server in the hotel. Network is closed (not accessible from the Internet), but there can be several hundred control units in the system. These form local area networks, typically on a per-floor basis. Thus in general there are two hops between control unit and central server, one to the LAN router and one from router to central server.
Control unit uses an MC9S12XDT512 processor interfaced to a 10Mb/s Ethernet controller, Microchip type ENC28J60. The interface uses the SPI2 port of the processor, which in the 80-pin device does not have the Slave Select line bonded to an external pin. The SS function for the ENC28J60 therefore uses a GPIO port line (PTS.3).
Problem: when reading data from the ENC28J60 via SPI2 (which can involve several hundred bytes in one operation), the monitoring loop which tests SPTEF and the one that tests SPIF appear to 'hang', showing all the appearances of waiting for a flag that never becomes set.
A block of data is read from the ENC28J60 using the following subroutine. The number of bytes to be read is passed in Accum D, and the starting address of a receiving buffer in HCS12 RAM is passed in Index Reg X:
read_mem: PSHD ; Save byte count
rdmem01: LDAB SPI2SR ; Check SPTEF
BITB #mSPI2SR_SPTEF
BEQ rdmem01 ; Keep looping until set
BCLR PTS, mSS2_EN ; Assert ENC28J60 CS line
NOP ; Delay 2 T(bus)
NOP
MOVB #RBM_CMD, SPI2DR ; Send Read Memory command byte
rdmem02: LDAB SPI2SR ; Check SPIF
BITB #mSPI2SR_SPIF
BEQ rdmem02 ; Keep looping until set
LDAB SPI2DR ; Read returned byte, discard
rdmem03: LDAB SPI2SR ; Check SPTEF
BITB #mSPI2SR_SPTEF
BEQ rdmem03 ; Keep looping until set
CLR SPI2DR ; Send null byte
rdmem04: LDAB SPI2SR ; Check SPIF
BITB #mSPI2SR_SPIF
BEQ rdmem04 ; Keep looping until set
MOVB SPI2DR, 1,X+ ; Get returned byte, copy to
; receiving memory
DECW 0, SP ; Decrement counter and loop
BNE rdmem03
LEAS 2, SP ; Adjust stack
BSET PTS, mSS2_EN ; De-assert CS line
RTS
The above issues a 'read memory' command (RBM_CMD) to the ENC28J60, after which the latter supplies data from consecutive locations of its own memory buffer as SPI clocks are generated.
The NOPs to delay 2 bus clock periods just after the ENC28J60 CS line is asserted are included because there is a certain amount of decoding logic (as well as a 5V/3.3V level translator) between the PTS.3 pin and the actual enable pin of the ENC28J60, and the device Slave Select should be stable before SPI transactions begin. Processor bus clock frequency is 39.3216 MHz (period = 25.4 nsec).
Diagnostic instructions to turn a row of four PCB LEDs on in certain combinations were inserted before and after each of the loops that test SPTEF and SPIF, and these indicate the point at which the program gets stuck.
Tests consist of attaching about 30 control units to a local network via Ethernet switches. A computer running the central server software (as well as DHCP server software) is also on the LAN. On power-up, each control unit successfully acquires a DHCP assigned IP address from the server and begins normal operation. Normal operation consists of relatively low volume traffic, each control unit receiving a poll from the server at 20-30 second intervals. After about 10 minutes the control units begin to go off line and cease to respond to Ping messages. After 1-2 hours, more than half of the units have ceased to operate, and in all cases the diagnostic LEDs indicate hanging at one or other of the test loops for SPTEF and SPIF.
Further points:
SPI clock was originally set at the maximum of half bus clock frequency. But I later noticed that this must be derated above 35 MHz, so I reduced SPI clock frequency to one-sixth of bus clock (approx 6.5 MHz). This made no discernible difference to the behaviour.
(ii) The SPI0 port of the same processor is also being used for something. Its SPI clock frequency is lower (approx 1.6 MHz), but otherwise operates in pretty well exactly the same way from a firmware point of view. No problems have come to light with this however. On the face of it, the next step should be to reduce SPI2 clock further, even though it is now well within spec.
(iii) The processor executes a very short (~1usec) interrupt routine every 10msec. I tried disabling this during execution of the above subroutine, but this made no difference.
Are there known issues with the SPI ports of the MC9S12X series, in particular when operating at higher frequencies? If not, can anyone spot a weakness in the above code or make any further suggestions?
解決済! 解決策の投稿を見る。
Two months down the line -
Thanks, Kef , for your suggestion. You were very much on the right track, in that SPI (and other) configuration registers were sometimes being corrupted. But discovering why this was happening was quite another matter.
The underlying problem lay with the Ethernet controller, Microchip type ENC28J60 - or rather, in the way I was using it. It's pretty clear that this device is a mask programmed processor of some sort, therefore on exit from Reset, it executes an initialisation routine of its own. After that, it's ready for the host processor (MC9X12DT512 in my case) to configure it for the application by writing to its various registers. The ENC28J60 datasheet specifies a 300 usec minimum delay following exit from reset before attempting to write to these registers (a later engineering note recommends extending this to 1 msec). Of course Juggins was only giving it 50 usec.
Result was that the pointers defining the start and finish of the circular buffer for receiving Ethernet packets (up to 8K available in the ENC28J60) were not being properly set, since the commands issued via the SPI interface would have been ignored if the device was still doing its initialisation.
When the ENC28J60 places a received packet in the circular buffer, it precedes it with three 2-byte words: Next Packet Pointer, which gives the next location beyond the last one occupied by the present packet; byte count for the present packet; and two bytes containing various status bits. After reading and storing these quantities, a firmware function running on the host processor copies the packet content to a buffer reserved in HCS12 memory. In my application, this is a 512-byte buffer located in paged memory at 0x1000-11FF.
Trouble was, due to the start and finish points of the circular buffer not being set properly (for the reason given above), the firmware function that copies the packet would fetch a totally erroneous value for byte count. In some cases this would exceed 0x1000 (4K), therefore data was being dumped into the entire 4K block of paged memory, and spilling over into the fixed page at 0x2000. This is where the heap for the networking firmware is located, therefore several global data structures pertaining to TCP and HTTP processes were being overwritten with incoming packet data.
The result of a TCP or HTTP process using wildly erratic data is of course virtually unpredictable. But by using various highly tortuous methods, I was able to track execution of the program and to identify (among other things) cases of the program writing to the register block at address zero onwards. One effect of this was, as you suggested, corruption of an SPI configuration register.
The moral of the tale I guess is that when using a complex device such as the EN28J60, which itself represents a firmware controlled sub-system, read the bloody datasheet!
When CPU hangs, did you verify SPI2 configuration is the same as after initialization. What if some uninitialized pointer, maybe pointer overflow or something like that leads to write to SPI2 control registers, possibly disabling SPI2?
Two months down the line -
Thanks, Kef , for your suggestion. You were very much on the right track, in that SPI (and other) configuration registers were sometimes being corrupted. But discovering why this was happening was quite another matter.
The underlying problem lay with the Ethernet controller, Microchip type ENC28J60 - or rather, in the way I was using it. It's pretty clear that this device is a mask programmed processor of some sort, therefore on exit from Reset, it executes an initialisation routine of its own. After that, it's ready for the host processor (MC9X12DT512 in my case) to configure it for the application by writing to its various registers. The ENC28J60 datasheet specifies a 300 usec minimum delay following exit from reset before attempting to write to these registers (a later engineering note recommends extending this to 1 msec). Of course Juggins was only giving it 50 usec.
Result was that the pointers defining the start and finish of the circular buffer for receiving Ethernet packets (up to 8K available in the ENC28J60) were not being properly set, since the commands issued via the SPI interface would have been ignored if the device was still doing its initialisation.
When the ENC28J60 places a received packet in the circular buffer, it precedes it with three 2-byte words: Next Packet Pointer, which gives the next location beyond the last one occupied by the present packet; byte count for the present packet; and two bytes containing various status bits. After reading and storing these quantities, a firmware function running on the host processor copies the packet content to a buffer reserved in HCS12 memory. In my application, this is a 512-byte buffer located in paged memory at 0x1000-11FF.
Trouble was, due to the start and finish points of the circular buffer not being set properly (for the reason given above), the firmware function that copies the packet would fetch a totally erroneous value for byte count. In some cases this would exceed 0x1000 (4K), therefore data was being dumped into the entire 4K block of paged memory, and spilling over into the fixed page at 0x2000. This is where the heap for the networking firmware is located, therefore several global data structures pertaining to TCP and HTTP processes were being overwritten with incoming packet data.
The result of a TCP or HTTP process using wildly erratic data is of course virtually unpredictable. But by using various highly tortuous methods, I was able to track execution of the program and to identify (among other things) cases of the program writing to the register block at address zero onwards. One effect of this was, as you suggested, corruption of an SPI configuration register.
The moral of the tale I guess is that when using a complex device such as the EN28J60, which itself represents a firmware controlled sub-system, read the bloody datasheet!
Hello bdp,
I think that the main issue here is to never trust the validity of any data received from an external device, and to provide a sanity check whenever the data bounds are limited. In this case, a simple bitwise AND of the packet size with $0FFF would keep the data within the buffer, whatever the circumstances.
For a linear buffer, another method that might also be implemented would be to control the buffer pointer so it can never exceed the upper limit. It could potentially return to the start of the buffer and over-write the earlier data.
A further approach might be to use a FIFO (circular) buffer at the outset.
Your Ethernet problem would not be solved, but at least the problem would be limited to the sub-routines handling the received data.
Regards,
Mac
Hi bigmac
Thanks for your comments. Very sound advice, and coupled with my recent experiences, will no doubt help help transform me into a shrewder and more prudent programmer in the future. I am primarily designer of electronic hardware, but like many of my generation of electronic engineers, have become increasingly involved in programming in later life.
This project has involved me, single-handed, in the development of a restricted spec TCP-IP package from scratch - the most ambitious emebedded software exercise yet undertaken. It's ironic though that the cause of the most elusive problem in the de-bugging phase was nothing to do with that, but a simple case of taking something for granted at the hardware level when I should have known better. Bitter experience indeed, and costing me many hours and days of frustration, but they say that experience is cheap at any price!
Regards,
BDP