PowerQuad Coproc for Incremental FIR calculations?

mzielinski · ‎05-05-2021

Hi,

For the past couple of days I've been trying to figure out how to use the PowerQuad coproc of my RT600 to perform FIR calculations in blocks. I see that the API has PQ_FIR and PQ_FIRIncrement functions, but they don't work how I need them to. In the examples, it's shown how to use PQ_FIR on the first half of a buffer of input data, and then how to run PQ_FIRIncrement on the second half of the same buffer of input data. (In a second example, the arm_fir_init_f32 and arm_fir_f32 functions are used directly, but in essentially the same way: the first half of the buffer is processed, then the second half).

But if you have data being read continuously from an ADC, this doesn't work, since DMA is periodically placing new blocks of data in the same buffer, not appending new data to the buffer. In many hours of experimentation I haven't been able to run arm_fir_f32 on a buffer with psrc=&buffer[0], reload the buffer with new data, and rerun arm_fir_f32 again with psrc=&buffer[0] and have the output data be correct, as if I had run arm_fir_f32 on a single large block of data.

I suspect that this implementation of arm_fir_f32 is not properly keeping track of the filter state through the state struct in the arm_fir_instance_f32 parameter to arm_fir_f32. Instead it's offloading everything to the PowerQuad, and state is just being kept implicitly in the regs or temporary scratch space of the PowerQuad.

Any advice would be greatly appreciated,

-Mike

jingpan · ‎05-07-2021

Hi mzielinski,

You can let DMA work in ping-pong mode. It will fill ADC data into first buffer, when the buff is full, fill data into second buffer, and then first buffer again.

arm_fir_instance_f32 only keep an offset address. This address can tell powerquad how to get the real data address. You can see that it is always increase. Other parameters are saved in powerquad.

Regards,

Jing

mzielinski · ‎05-07-2021

Hi jingpan,

Thanks for your reply. Using DMA in ping-pong mode is exactly what I'm doing. Have you seen the code I posted in reply to my original post? The arrays dataForFIR1 and dataForFIR2 are exactly like those first and second buffers you mention. My problem is that I can't run arm_fir_f32 on the first buffer, then the second, then the first, etc, and get the correct answer. Please take another a look at the code I posted. The whole problem is that the official arm implementation of arm_fir_f32 works 100% correctly and how I would expect; with it I could run it on the first buffer, then the second, then the first, etc, and get the correct answer. But the NXP powerquad-accelerated version does not work the same way.

If the powerquad version doesn't work the same way as the official arm version and doesn't give the correct answer, it shouldn't be called arm_fir_f32 and claim to implement the standard CMSIS-DSP interface.

mzielinski · ‎05-06-2021

Hi again,

I wanted to add some more info. I took a look at the current standard implementations of the ARM CMSIS FIR functions available on github. These do not use any specialized hardware or coprocessors. I took a look specifically at arm_fir_init_f32 and arm_fir_f32 available here:

https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/DSP/Source/FilteringFunctions/arm_fir_ini...
https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/DSP/Source/FilteringFunctions/arm_fir_f32...

I added the code here manually into the SDK library in my example project as arm_fir_init_f32_native and arm_fir_f32_native. This snippet from my example project shows there is a definite difference in the functionality of the native ARM and accelerated NXP implementations:

/* Float FIR */
static void arm_fir_f32Example(void)
{
    float32_t dataForFIR1[FIR_INPUT_LEN/2];
    float32_t dataForFIR2[FIR_INPUT_LEN/2];
    float32_t taps[FIR_TAP_LEN];
    float32_t FIRRef[FIR_INPUT_LEN];
    float32_t FIRResult[FIR_INPUT_LEN] = {0};
    float32_t state[FIR_INPUT_LEN + FIR_TAP_LEN - 1];
    arm_fir_instance_f32 fir;
    uint32_t i;

    for (i = 0; i < FIR_INPUT_LEN/2; i++)
    {
        dataForFIR1[i] = firInput[i] * 100.0f;
    }
    for (i = 0; i < FIR_INPUT_LEN/2; i++)
	{
		dataForFIR2[i] = firInput[i+FIR_INPUT_LEN/2] * 100.0f;
	}

    for (i = 0; i < FIR_TAP_LEN; i++)
    {
        taps[i] = firTaps[i] * 100.0f;
    }

    for (i = 0; i < FIR_INPUT_LEN; i++)
    {
        FIRRef[i] = firRef[i] * 10000.0f;
    }

    memset(FIRResult, 0, sizeof(FIRResult));
    memset(state, 0, sizeof(state));

    arm_fir_init_f32_native(&fir, FIR_TAP_LEN, taps, state, FIR_INPUT_LEN/2);
    arm_fir_f32_native(&fir, dataForFIR1, &FIRResult[0], 				FIR_INPUT_LEN/2);
    arm_fir_f32_native(&fir, dataForFIR2, &FIRResult[FIR_INPUT_LEN/2], 	FIR_INPUT_LEN/2);

    for (i = 0; i < ARRAY_SIZE(FIRRef); i++)
    {
/*
 * This assert succeeds. The result matches the reference.
 */
        EXAMPLE_ASSERT_TRUE(fabs(FIRRef[i] - FIRResult[i]) < 0.0001);
    }

    memset(FIRResult, 0, sizeof(FIRResult));
	memset(state, 0, sizeof(state));

	arm_fir_init_f32(&fir, FIR_TAP_LEN, taps, state, FIR_INPUT_LEN/2);
	arm_fir_f32(&fir, dataForFIR1, &FIRResult[0], 				FIR_INPUT_LEN/2);
	arm_fir_f32(&fir, dataForFIR2, &FIRResult[FIR_INPUT_LEN/2], FIR_INPUT_LEN/2);

	for (i = 0; i < ARRAY_SIZE(FIRRef); i++)
	{
/*
 * This assert fails. The first FIR_INPUT_LEN/2 entries (calculated by the first
 * arm_fir_f32 call) do match, but the next FIR_INPUT_LEN/2 entries (calculated by
 * the second arm_fir_f32 call) do not match.
 */
		EXAMPLE_ASSERT_TRUE(fabs(FIRRef[i] - FIRResult[i]) < 0.0001);
	}
}

Based on this, I would dare to say that the current FIR implementation using the PowerQuad provided in the NXP SDK is not actually compliant to ARM CMSIS. Assuming I'm not making a silly mistake, I would like to see this acknowledged as a bug in the SDK. I'm attaching my entire project below.

Thanks,

-Mike

james_fan · ‎05-18-2021

Hi Mzielinski

My test conclusion is that PowerQuad needs the address space of pSrc of

static void _arm_fir_increment(const void *pSrc,
uint32_t srcLen,
const void *pTap,
uint16_t tapLen,
void *pDst,
uint32_t offset,
uint32_t elemSize)
{
POWERQUAD->INABASE = ((uint32_t)(const uint32_t *)pSrc) + (offset * elemSize);
POWERQUAD->INBBASE = (uint32_t)(const uint32_t *)pTap;
POWERQUAD->LENGTH = (((uint32_t)tapLen & 0xFFFFUL) << 16U) + (srcLen & 0xFFFFUL);
POWERQUAD->OUTBASE = ((uint32_t)(uint32_t *)pDst) - (offset * elemSize);
POWERQUAD->MISC = offset;
POWERQUAD->CONTROL = (CP_FIR << 4U) | PQ_FIR_INCREMENTAL;
}

to be continuous when doing FIR incremental calculation. To prove this, I made a case based on your attachment and changed it as follows:

1.Added float32_t dataForFIR0 [FIR_INPUT_LEN / 2];

2. The content of dataForFIR0 and dataForFIR2 are the same

3. Modify POWERQUAD->INABASE = ((uint32_t)(const uint32_t *)pSrc) -(offset elemSize); to POWERQUAD->INABASE = ((uint32_t)(const uint32_t *)pSrc) + (offset * elemSize);.

In this way, the address and content of POWERQUAD->INABASE become continuous. The result of this calculation is correct.

Therefore, the conclusion is that when powerquad is doing FIR increment calculation, the address of pSrc needs to be continuous, this should be a hardware limitation of Powerquad. Otherwise, the calculation result is wrong. In addition, like you said, the FIR calculation interface provided by Powerquad cannot be claimed to be consistent with CMSIS-DSP.

I'm not sure if there is a workaround at the software level, it seems difficult, anyway I will create a jira ticket for the SDK.

Attachment is my code example.

James Fan

PowerQuad Coproc for Incremental FIR calculations?

PowerQuad Coproc for Incremental FIR calculations?

i.MXRT 600