MMDVSQ (Memory-Mapped Divide and Square Root)

mjbcswitzerland · ‎12-04-2017

Hi All

The KL28 includes a MMDVSQ and today I did a few test of its performance, and looked at how it can possibly be used to generally speed up 'standard' code.

First of all, this module is a (small) co-processor dedicated to performing integer square root calculations or integer divide/remainder calculations which NXP is adding to some select Cortex-M0+ based processors that don't have these instructions supported in the Cortex core - in order to give them a bit more calculating performance when used in applications that rely on such calculations.

These are some tests of the calculation times measured on a KL28 running at 48MHz (not its top speed) and then compared to the same time taken for the calculation to be performed by the processor when it uses traditional code to do it.

MMDVSQ Integer square root sqrt
0          0.77us
1          0.78us
2          0.77us
9          0.78us
100        0.77us
1000       0.92us
10000      0.92us
100000     0.92us
1000000    0.92us
10000000   1.07us
100000000 1.07us
0xffffffff 1.07us

These are just to show the slight dependency on the input it needs to calculate on and there is no reference to using a library square root since this will use floating point rather than integer, which is not very interesting for a comparison. There is also a slight overhead due to a subroutine call included in the measured time. The times in comparison to integer divides are however interesting because the integer square root is obviously efficient....

Next are some values of calculating the quotient of an integer division (that is the rounded-down divide result):

MMDVSQ signed divide quotient
1/1 0.52us
0x7fffffff / 3 0.52us
0x7fffffff / 0x7fffffff 0.83us
2536 / 8827634 0.62us
63 / 32 0.64us

and in comparison to tradition code doing the same:

1/1 1.29us
0x7fffffff / 3 6.45us
0x7fffffff / 0x7fffffff 1.13us
2536 / 8827634 0.52us
63 / 32 1.96us

Interestingly, the traditional code is slightly faster in the case where the result is 0 but overall the MMDVSQ is faster, to a few times faster (depending on the numbers involved).

The calculation of the remainder is next compared, bearing in mind that this is the result of a modulo calculation.

MMDVSQ signed divide remainder
1/1 0.64us
0x7fffffff / 3 0.96us
0x7fffffff / 0x7fffffff 0.96us
2536 / 8827634 0.75us
63 / 32 0.64us

in comparison to traditional code calculation:

1/1 1.77us
0x7fffffff / 3 6.92us
0x7fffffff / 0x7fffffff 1.60us
2536 / 8827634 0.95us
63 / 32 2.44us

The MMDVSQ improves performance in all cases.

Considering general purpose code, the question was how useful it would be to make use of the MMDVSQ?
The following is an example of something that is often done in embedded code - it is the method used to calculate register and bit locations in the NVIC based on an interrupt ID and similar code is probably found in many locations in an embedded project.

ptrIntSet += (iInterruptID / 32); // move to the interrupt enable register in which this interrupt is controlled
*ptrIntSet = (0x01 << (iInterruptID & 32)); // enable the interrupt

After adding the functions to make use of the MMDVSQ (sub-routines or in-lined) this code can now be replaced by

ptrIntSet += (fnFastUnsignedIntegerDivide(iInterruptID, 32)); // move to the interrupt enable register in which this interrupt is controlled
*ptrIntSet = (0x01 << (fnFastUnsignedModulo(iInterruptID, 32))); // enable the interrupt

The result is that this particular calculation (the 63 / 32 is a representative reference in the benchmark measurements) no longer takes typically 70ns to execute but instead around 1us, some 14x longer!

Therefore the result shows that the use of the MMDVSQ method for many typical embedded code tasks is not of interest since it greatly reduces efficiency.

Explanation of limitation:
The reason for this is due to the fact that the compiler will not perform integer divides or remainder calculations when a modulo 2 divisor is used. Instead it can perform the operation using a much more efficient shift. The MMDVSQ will always perform a division and so doesn't profit from this potential.

The only locations where it makes sense to use MMDVSQ routines is when the divisor is a variable or a fixed non-modulo 2 value. In these cases it is mostly more efficient, as shown by the comparisons.

Although there are usually such locations in general project code (analog oriented rather than digital) the tend to be rather less dominant than the reference case type.

Therefore the MMDVSQ can be used to increase code efficiency if used carefully but is not s a blanket solution to increasing efficiency of all "mod" and "div" usage, where it can instead have a degradation effect!

Regards

Mark

P.S. To be absolutely fair to the MMDVSQ , when the reference case does use a volatile variable with the value 32 instead of a fixed value (forcing the integer divides) the MMDVSQ does win. The time goes down from typically 1.5us to around 1.0us....

Kinetis: http://www.utasker.com/kinetis.html
Kinetis KL28: http://www.utasker.com/kinetis/FRDM-KL28Z.html

bobpaddock · ‎12-05-2017

In Security systems a constant execution time is actually more important than the fastest execution time.

Variability can lead to side channel timing attacks.

Would the Square Root be beneficial to doing the square root of the sum of the squares? Comes up often in Accelerometer projects.

mjbcswitzerland · ‎12-05-2017

Bob

Also SW based square root calculations will have an execution time dependent on the input value due to the fact that the number of program iterations to approximate the result is not always the same.

This is also seen in the SW integer divide result, which varies by a factor of 10 in time depending on the input values. With the MMDVSQ the 'jitter' is less ad so would be safer/less predictable.

To achieve constant timing a HW timer can be used to start the calculation and to return the result after a fixed time (longer than the worst case calculation duration). The KL28 also has TSTM (Time Stamp Timer Module) which can be used to synchronise such operations to a us resolution.

Eg.

disable_int();
x = TSTMR0_L;
(void)TSTMR0_H;
y = (x + 3);
while (TSTMR0_L == x) { // wait for next us boundary
(void)TSTMR0_H;

}

(void)TSTMR0_H;
result = fnIntegerSQRT(input); // takes between approx. 0.7us and 1.1us

while (TSTMR0_L != y) { // wait for us match boundary

(void)TSTMR0_H;

}

enable_int();

return result;

This will give 1us jitter due to the synchronisation of the clock to the instruction but the result will always take 2u to be returned, irrespective of its value.

If your sum of squares are integers and you need a RMS in integer the MMDVSQ will do it. The maximum summed square input is limited to 0xffffffff and the maximum RMS result is 0xffff.

For floating point RMS I use the CMSIS arm_sqrt_f32().

For accelerometer to velocity to displacement measurements I use CMSIS arm_cfft_f32() to perform the integration in the frequency domain (and remove DC offsets).

Regards

Mark