SGPIO SPI master emulation, bits reversed?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Zuofu on Sat Sep 21 17:52:49 MST 2013
Hi,

According to AN11210 (SGPIO SPI master emulation) Figure 3 (and almost every schematic of the SPI standard I've seen), data is to be shifted out MSB first. However, from what I can tell from the example code running on a LPC4337, data written using the SGPIO_spiWrite(...) function results in it being shifted out LSB first. This is consistent with the datasheet description of the REGx slice data register, which describes it as being right shifted (LSB first). Of course, the example code is a loopback test, so it would work regardless of whether the LSB or the MSB is shifted first, but clearly this doesn't work for general devices which use the MSB-out first convention.

Is the diagram in the document a mistake, or did I do something wrong? Is there a way to change the shifting behavior of the SGPIO port? I would guess not due to the post referenced here: http://www.lpcware.com/content/forum/reversing-bits-sgpio-output

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Wed Oct 09 13:33:00 MST 2013
Hi John.

I'll suggest that you keep that part of the code as highly optimized as possible, since sometimes some interrupt may 'hit' the execution flow and disturb juust 2 clock cycles too much.
What I'm saying is that I usually keep as much free CPU-time as possible, so these things do not occur.
Loading values once and keeping them in a register *is* a very good optimization instead of loading the values from memory each time.
Eg. you do save one clock cycle by "not working" instead of "working", so why do the extra work ? -Remember CPUs very much likes to do as little as possible. =)

One clock cycle can make a difference, especially if it's right on the edge of making things fail. Depending on how much free CPU time you have left, it could also mean that the M0 core would get less hot.

I do not know if you can use the DMA in this case; if you can, I think you'll benefit from making the DMA swap the endianness (like suggested earlier in this thread), this would save you a few clock cycles as well.

...I once tried writing cycle-accurate code, where one clock cycle out of 20 million would make a huge difference, but that's a rare case (in that particular case I also knew exactly at which clock cycle an interrupt would occur - code ran for 6 months without interruptions and without failing).
Still, if you are running the M0 core on a kHz frequency, a single clock cycle is expensive - but I guess only you know exactly what can be allowed and what can't. ;)

If you're not using the DMA, but you're reading the ADC values directly by the M0 core, it would probably pay to do the reversal as soon as you've read the ADC-value into a register, so you don't have to store the value then read it and finally store it again; that's two clock cycles; plus the one clock cycle from saving a constant value in a register (perhaps more than one). For each clock cycle you optimize, you gain a higher percentage proportionally. Eg. from 100 to 99 doesn't appear to be much, but from 18 to 17 is almost 6%. What I'm saying is, that the closer you get to zero, the more it pays to make the code spend just a single clock cycle less.

Also - quite important... IF you are currently running the reversal code from Flash memory, try copying the routine to SRAM and execute it from there.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by JohnR on Wed Oct 09 06:18:39 MST 2013
Hi Pacman,

I have looked at the disassembly for the reversal code and there is only a little that can be done to speed it up. Assigning registers to hold raw_val and the constants would save a number of memory accesses but I don't think that would reduce the overall time by more than a few percent.

In my case, I was collecting blocks of 100 ADC samples and then doing the reversals. Doing the calculations for a conversion while the following conversion cycle is waiting for the new SPI data works. The conversion rate is slowed of course but overall the collection of data is more even.

John

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Tue Oct 08 06:48:24 MST 2013

Quote: starblue
As far as I can tell RBIT is not an option for Cortex M0, only in Cortex M3 and up.
At least I couldn't find it in any of the ARM Cortex M0 documents.

Unfortunately, it seems you are correct. At first, I did search infocenter.arm.com, but I didn't notice that I was in the Cortex-M3 document. :/

So we gotta blame ARM for this. ;)

Still, it would save a lot of clock-cycles if NXP could implement a bit-reversal address, where you store one 8/16/32-bit word and then read it somewhere else as reversed. That's two clock-cycles. Yes, I'm dreaming and it won't help us getting a solution right now.

...unless you sacrifice 64 GPIO pins on your chip, heh, and do the hardwiring yourself, but I guess that's not a very interesting idea.
You'd probably have to do it as two 16-bit values, though, because there are no complete GPIO port with all 32 bits available.

Until now, I think John's approach is worth looking deeper into. Disassembling and optimizing the disassembly could perhaps gain an extra clock-cycle maybe even more (depending on how good a job the compiler did).

lpcware · ‎06-15-2016

Content originally posted in LPCWare by starblue on Mon Oct 07 06:07:12 MST 2013
> Yes, but rbit is optional.

As far as I can tell RBIT is not an option for Cortex M0, only in Cortex M3 and up.
At least I couldn't find it in any of the ARM Cortex M0 documents.

Do you have different information?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Zuofu on Tue Oct 01 04:31:13 MST 2013
Very interesting. Thanks for the info. One thing I was looking at is that it is possible to use the DMA to swap endian-ness (but not bit order), so it is possible to save one level of swapping if DMA is used. The individual bits in a byte still need to be swapped (perhaps by the look-up table method)...

lpcware · ‎06-15-2016

Content originally posted in LPCWare by JohnR on Mon Sep 30 14:32:14 MST 2013
I have just timed three bit reversal schemes using the LPC4350 M4 processor on 14 bit ADC data.
The function __rbit() was not used as the code will normally be run by M0.

The first, the bit-by-bit method,

uint32_t v;     // reverse the bits in this
uint32_t t;     // t will have the reversed bits of v
uint32_t i;

for (i = 15; i; i--)
{
  t <<= 1;
  v >>= 1;
  t |= v & 1;
}
t &= 0xFFFF;

This executes in 10.4 microseconds.

The second was based on the code I posted earlier. This took 1.4 microseconds.

The third was based on reversal by a look-up table. This took 1.25 microseconds.

At present the whole twin ADCs/DAC conversion cycle takes only some 960 nanoseconds without doing any bit reversal, which if included would halve the data rate.

Now, if only __rbit() was available on M0 this problem would just disappear!.

John

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Fri Sep 27 14:15:46 MST 2013
> You should ask ARM, as it is their core

Yes, but rbit is optional. :)

The microcontroller manufacturer can freely choose whether or not to include this functionality.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by starblue on Fri Sep 27 08:35:08 MST 2013
I reverse the complete 32-bit word one byte at a time using an 8-bit table,
as a drop-in for RBIT.

In the application there are 2 words with 2 values in each word
(I read all four channels of the AD7367 as fast as possible, 2 in parallel).

I don't think I tried the bit-twiddling method,
as the table method was fast enough and memory is not that scarce
(the code for the M0 and the table is in internal RAM, so access is fast).

lpcware · ‎06-15-2016

Content originally posted in LPCWare by JohnR on Fri Sep 27 07:02:02 MST 2013
Hi Starblue,

How do you do the reversal?

John.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by starblue on Fri Sep 27 06:38:26 MST 2013
> I would really like to get an answer from NXP as to why _RBIT is not supported in the M0 core.

You should ask ARM, as it is their core.

BTW, I also reverse bits on the M0, from an AD7367 ADC.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Fri Sep 27 05:09:04 MST 2013
Hacker's Delight have them from somewhere else. I think this source here were among the first: Bit Twiddling Hacks

If you work on a block of data, I can see how the code might be optimized by the compiler.
Obviously you've gone beyond the standard speed-optimization and tested everything you could find.
In the above mentioned link, there are a few ways involving 64-bit multiplication. This might be worth trying too.
The one with modulus will most likely be too slow.

If you choose to solve the problem with external components, then it might get complicated if using 8-bit serial-to-parallel and parallel-to-serial shift registers. That is... if you needed 32 bits. But you could solve the problem partly in hardware by using just one 8-bit reverser and then swap the bytes using software.

(I wish there was some cheap SPLDs for these kinds of tasks; or even better that there would be an area in the ARM chips, which was user-configurable from within the microcontroller; eg. not by using a computer; I think it would do wonders if having simple GAL/PAL functionality mixed with the ARM).

I'm almost off-topic now. It might be best if the experts from NXP could put in some words on how to send the highbits first.
Personally I wouldn't mind if the specs for SGPIO was altered, so from the next chips, the highbit would go first.
It could be solved on the silicon by having two FIFO addresses, one for highbit first, one for lowbit first (to allow for the DMA to send bits reversed too), but I don't know if that's a good idea. It would probably not be too difficult to implement.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by JohnR on Thu Sep 26 06:33:59 MST 2013
Hi Pacman,

At an earlier stage, I did implement the bit reversal using a look-up table but, from memory as I can't find the actual data, it was not faster than the method in my post, which I should have credited to Hacker's Delight (http://www.hackersdelight.org/hdcodetxt/reverse.c.txt).

It is such a shame that the SGPIO module which is so versatile in many ways does not have the ability to reverse the bit-stream output from the registers.

I have looked at the possibility of adding some logic between the ADC and the SGPIO input to reverse the bit-stream. As the 14-bit data serial data are read in MSB first, the previous set of data is pushed out LSB first.

John.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Wed Sep 25 10:36:03 MST 2013
Hi John.

If you at some point operate on a single byte, perhaps you could bit-reverse using a 256-byte table ?

revbyte = revtab[byte];

It could be slightly faster on the M0 than using the shifts as the AND operations read the mask from memory anyway.
-But it all depends ... If you need a full 32-bit value swapped, you'll have to bitshift anyway.

Unfortunately I haven't found any place where the DMA can do the bit-swapping; it would be real useful.
It seems I'll have to forget about 8-channel I2S in my app. if the bits can't be shifted out MSB first. :/

lpcware · ‎06-15-2016

Content originally posted in LPCWare by JohnR on Mon Sep 23 05:20:36 MST 2013
Hi,

The lack of -RBIT instruction in M0 is a real pain as so often SPI data are sent MSB first. This is my current version for reversing a pair of 14-bit input data, Value[0] and Value[1].

// reverse the data for both signal and reference channels
// reverse the data for both signal and reference channels
raw_val = (Value[0] & 0xFFF80000) | (((Value[1]&0xFFF80000))>>16);
// swap odd and even bits
raw_val = ((raw_val >> 1) & 0x55555555) | ((raw_val & 0x55555555) << 1);
// swap consecutive pairs
raw_val = ((raw_val >> 2) & 0x33333333) | ((raw_val & 0x33333333) << 2);
// swap nibbles ...
raw_val = ((raw_val >> 4) & 0x0F0F0F0F) | ((raw_val & 0x0F0F0F0F) << 4);
// swap bytes
//raw_val  = ((raw_val >> 8) & 0x00FF00FF) | ((raw_val & 0x00FF00FF) << 8);
raw_val = __REV(raw_val);

integrate[0] += ((raw_val>>0x01) & 0x3FFF);
integrate[1] += ((raw_val>>0x10) & 0x3FFF);

The code is slow but I have not been able to find anything faster for the M0 processor. I don't want to use M4 as, in my case, the floating point and DSP instructions are needed to process the integrated data. Like you I would really like to get an answer from NXP as to why _RBIT is not supported in the M0 core. My guess is that it must use a lot of silicon?

John.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Zuofu on Sun Sep 22 21:35:45 MST 2013
LSB/MSB here is with regards to the bit order, not the endian-ness.

I've confirmed that the hardware will shift out from the right by looking at the SPI example (AN11210). In fact, careful reading of page 9 of AN11210 agrees with this:

Quote:
In CHPA mode 1, the LSB bit of the slice is immediately applied to the output when the data register is written, before the first edge.

Of course, this conflicts with the images given in Figures 2 and 3 of the same document (AN11210), which show:
[img]http://www.lpcware.com/system/files/sgpio_spi_1.png[/img]

In addition, this agrees with the user manual UM10503, on page 357 (as you mentioned) which says:

Quote:
At an active shift clock data is right shifted; captured data is shifted in at bit 31, and register data is shifted out from bit 0.

I suspect that this is never a problem in any of the examples because all of the example applications use a loop-back test. However, once you have real data, this becomes an actual problem. In my application, the SPI is connected to an FPGA, so it is fairly straightforward for me to do bit-reversal in hardware, but there are many applications where this is not the case. As you mention, most serial protocols expect the master to shift out the MSB first, so I am not sure why the decision was made to (apparently) hard-wire the shift direction to be right shift only.

With the M4 core, this isn't much of a problem because of the __RBIT instruction to perform bit-reversal. However, as the other poster mentioned, this causes problems with the M0 core, which does not have the __RBIT instruction. It would be helpful to get a clarification from NXP to confirm the hard-wired nature of the shift direction as well as maybe some insight as to why this is the case.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Sun Sep 22 12:38:19 MST 2013
I find this a very good and important question and I agree with you, though I haven't looked at any code.

First of all... Are we talking about [color=#00f]L[/color]east/[color=#00f]M[/color]ost [color=#00f]S[/color]ignificant [color=#00f]B[/color]yte, or [color=#00f]l[/color]east/[color=#00f]m[/color]ost significant [color=#00f]b[/color]it ?

If looking in the UM10503, on page 357 I read:

Quote:
The bit order is optimized for MSB first. Interfaces that require LSB first should use a software instruction (RBIT) to reverse the bit order (not supported by the ARM Cortex-M0).

The above looks very good to me and very clear (unless MSB/LSB means byte order, but that would be a complete disaster), BUT as you say, if I look at page 366 (Section 18.6.4),REG[0...15], I see the following:

Quote:
At an active shift clock data is right shifted; captured data is shifted in at bit 31, and register data is shifted out from bit 0.

Table 220:

Quote:
Symbol REG: At each active shift clock the register shifts right; loading REG(31) with data captured from DIN(n) and DOUT(n) is set to REG(0).

That is a bit confusing.

I think that if it's not possible to output the msb (most significant bit) first, SGPIO would be more or less useless, because:
[list]
[*]I2C sends msb first.
[*]I2S (and S/PDIF) sends msb first (I2S is also big endian).
[*]SPI sends msb first.
[*]UART sends msb first.
[*]SD/MMC (which is basically SPI) card sends msb first.
[/list]
(any other serial interfaces we need to emulate ?)

In other words: There would be very, very little benefit in shifting out the least significant bit first.
If NXP tested emulating one of the above mentioned interfaces (I do expect that they did), then it should be possible to shift out msb first.

It would be ideal to have an option to control the shift-direction (a single bit in a control-register), but I have not found anything.

If, on the other hand, you have trouble with big/little endian, you should be able to set up the DMA to do the byte-reversing.

In the UM10503, there's an example (no code, though) on how the SGPIO can be used for extending the I2S from 2 channels to 8 channels.
If that is tested and it works using the DMA, then the most significant bit simply must go first.

I still haven't been able to send code to my 43xx, so unfortunately I cannot verify anything.

SGPIO SPI master emulation, bits reversed?

SGPIO SPI master emulation, bits reversed?

LPC43xx