How to Squeeze Every Last Drop of Performance Out of SIMULINK Models with AMMCLIB

Daniel_Popa · ‎03-21-2018

This article is about how to employ the powerful features of Automotive Math and Motor Control Library (AMMCLIB) to speed up the execution of the applications designed to run on NXP microprocessors. We're going to show a simple but relevant use-case that is a perfect candidate to support the usage of AMMCLIB whenever possible but if you are interested in mode details, please refer to the document attached.

The Problem ...

During the preparation of the last 2 modules of the 3-Phase PMSM Control Workshop with NXP's Model-Based Design Toolbox, I've run into a performance issue with the model I was trying to build. Basically, after adding a couple of new blocks that are needed for the sensorless observer, I noticed that the motor was not moving anymore, the FreeMASTER recorder was crashing the application and all sort of such strange behaviors.

So, I went back to a previous working Simulink model and started to add one by one new Simulink blocks and check what's going on. Soon after, I found the reason and I was a bit surprised: the time spent in processing the motor control fast loop computations was far bigger than expected.

To find the culprit I started to profile the application using the dedicated MBDT Profiler block. Soon after i had a clear picture about various subsystem processing requirements. Overall the entire FOC calculations (PI current regulators, PARK/CLARKE coordinate transformations, SVM, PWM updates, SINE/COSINE) took about 40 micro-seconds but one simple operation took exactly 20.8 microseconds - yes - half off all the others.

In the Simulink model, I'm using floating point operations based on single precision data types but at the end of the computation i need to cast these values to unsigned 32b data type since that is the input required by FTM block.

That particular cast from single to unsigned integer data types was killing the application due to the usage of fmodf() function in the generated code. The generated code look like this:

/* Product: '<S50>/TransPerMin' */
 rtb_Sum = fmodf(truncf(FOC_SC_DWork.Merge_b[0] * 1000.0F), 4.2949673E+9F);
 FOC_SC_DWork.TransPerMin[0] = rtb_Sum < 0.0F ? (uint32_T)-(int32_T)(uint32_T)
 -rtb_Sum : (uint32_T)rtb_Sum;
 rtb_Sum = fmodf(truncf(FOC_SC_DWork.Merge_b[1] * 1000.0F), 4.2949673E+9F);
 FOC_SC_DWork.TransPerMin[1] = rtb_Sum < 0.0F ? (uint32_T)-(int32_T)(uint32_T)
 -rtb_Sum : (uint32_T)rtb_Sum;
 rtb_Sum = fmodf(truncf(FOC_SC_DWork.Merge_b[2] * 1000.0F), 4.2949673E+9F);
 FOC_SC_DWork.TransPerMin[2] = rtb_Sum < 0.0F ? (uint32_T)-(int32_T)(uint32_T)
 -rtb_Sum : (uint32_T)rtb_Sum;‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

At this point you may think ...

so, what? is a call to a standard floating point function and processors with FPU shall handle it just fine, right ?

Well, think again! Here is a nice article about this topic: Know Your FPU, 2006 Edition

The Solution ...

In order to go further with the model development I had 2 options:

change all the algorithm to fix-point implementation: that means i had to redue all the examples and skip all the FPU capabilities of the microcontroller
find a way to resolve the conversion issue by avoiding the fmod() function call generation

Personally, I consider that the days when you had to stick with the usage of fixed point to implement algorithm are gone - nowadays there are plenty of micro-controllers capable of dealing with floating point numbers and lets not forget we are using model-based approach where we should figure on design rather than specific implementations. Therefore the only acceptable option at this point was #2.

The AMMCLIB provides a specific blocks to convert between various data type formats. These block are available within the MLIB subset.

The Proof ...

Using these blocks, I've started to do some investigations.

First I've build a simple model to check verify the conversion results and the generated code. The scope of this test is to see if we can get rid of the fmod() function usage while preserving a good conversion precision.

The code generated for this model looks like this:

C-code for SIMULINK Data Type Conversion single2int

/* DataTypeConversion: '<Root>/SIMULINK Data Type Conversion' */
 tmp = (real32_T)fmod((real32_T)floor(rtb_Add), 4.2949673E+9F);
 rtb_SIMULINKDataTypeConversion = tmp < 0.0F ? -(int32_T)(uint32_T)-tmp :
 (int32_T)(uint32_T)tmp;
/* DataStoreWrite: '<Root>/Data Store Write SIMULINK' */
 SIMULINK_vs_AMMCLIB_DW.S = rtb_SIMULINKDataTypeConversion;‍‍‍‍‍‍

C-code for AMMCLIB MLIB Conversion single2int

/* S-Function (MLIB_Convert_SF_F32FLT): '<S1>/MLIB_Convert_SF' */
 rtb_SIMULINKDataTypeConversion = MLIB_Convert_F32FLT(rtb_Add,4.66E-10F);
/* DataStoreWrite: '<Root>/Data Store Write AMMCLIB' */
 SIMULINK_vs_AMMCLIB_DW.A = rtb_SIMULINKDataTypeConversion;
‍‍‍‍

and .... voila!!! mission accomplished. Using the AMMCLIB MLIB block there is no expensive floating point library function call needed.

How Fast Is It ?

Using MBDT Profiler we can easily check the number of ticks/time needed to execute each of the methods. Code generated to measure the performance looks like this:

/* user code (Output function Body) */
 {
     /* Start of Profile Code */
     uint32_t tmp1;
     uint32_t tmp2;
     tmp1 = profiler_get_cnt();
    /* Start Profiling This Function.*/
    
    /* S-Function (MLIB_Convert_SF_F32FLT): '<S3>/MLIB_Convert_SF' */
     localB->MLIB_Convert_SF = MLIB_Convert_F32FLT(rtu_In1,4.66E-10F);
    /* user code (Output function Trailer) */
    
    /* Profile Code : Compute function execution time in us. */
     tmp2 = profiler_get_cnt();
     profile_buffer[1] = gt_pf(tmp1, tmp2);
    /* End of Profile Code */
 }‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The profiler results for both methods is shown below in this FreeMASTER scope capture. As can be seen the MLIB Conversion (green) is 10 times faster than the standard Simulink data type conversion conversion (red).

The MBDT Profiler measures the number of LPIT counts needed to execute a particular piece of code. The source clock of the LPIT is BUS CLOCK divided by 2 which provides a clock of 40MHz.

Converting the counts to seconds we get:

Single2Integer Conversion Method	Profiler Counts	Time (micro seconds)
Standard Simulink Data Type Conversion	204	5.1
AMMCLIB MLIB Conversion	22	0.55

Conclusions

If you wish to speed up your computations and squeeze out the performance out of NXP microcontroller, try using AMMCLIB since it provides a set of optimized functionalities specially designed for embedded applications. For more details please visit: Automotive Math and Motor Control Library Set|NXP

gramirezv · ‎11-09-2018

Hi dumitru-daniel.popa‌,

Thank you for this example, it shows the advantages of using the math libraries for the S32K.

I'm trying to implement some simple floating point operations and trying to figure out which blocks to use is quite the challenge. One thing that I'd like to point out is that the name descriptions and the help provided within Simulink is almost non existent and just painful to read.

As you mention, following the link you provided in your post: Automotive Math and Motor Control Library Set|NXP

gets you to the AMMCLIB site, and going to the downloads page you can download the libraries but more important the User's Manual.

The User's Guide SK32K14XMCLUG.pdf is a great resource (I'm not sure if I'm allowed to attach it here, that's why I pointed out how to get it), specially because each function contains the Declaration, Arguments, Implementation details and Code Example in C.

The names of the Simulink blocks match to the function names in the User's Guide, so you can search for the block name and find the details for each function, and check the Typedefs reference for a description of the data types.

Best regards,

Gustavo