LPSPI bugs: LPSPI_MasterInit replacement

davenadler · ‎03-29-2024

I've started recoding LPSPI driver as I've got no response from NXP and it looks like NXP is no longer supporting FSL. So far just LPSPI_MasterInit (still to do: rework transfer API to remove unnecessary long delays between transferred 32-bit words, complained about on this forum but never corrected in FSL). The code below performs all delay calculations at compile-time and produces extremely compact code for run-time LPSPI setup.

I'd really appreciate any constructive comments!

Requirements for Efficient Support of LPSPI

correct timing parameters must be generated (unlike buggy FSL LPSPI_MasterInit)
all timing calculations must be done at compile time (timing settings are static; for efficient code size and speed, calculations must not be done at run time).
compile-time calculations must produce complete register values ready to load.

Example of Code Using New Initialization Classes

We need to write simple efficient code. The following results in just a handful of instructions (replacing nested functions with iterations getting wrong results):

// example LPSI setups...
BMP581_times.Set_CCR_and_TCR(LPSPI4, BMP581_TCR_initial.TCR);
// or
ND120_times.Set_CCR_and_TCR(LPSPI4, ND120_TCR_initial.TCR);

The clock calculation coding must be simple, readable, and compile-time only. The following generates in only 3 words in flash for CCR class and 1 word for TCR class (no run-time calculations or calculation code):

// Example constant calculations showing exact timing results of calculated CCR values:
// ND120 has *REALLY SLOW* SPI. Big nuisances:
// - 6uS clock period is 166.666kHz clock (ie 48uSec for 8-bit byte)
// - inter-byte delay of 52uSec is needed for minimum byte cycle time 100uSec
// - 100uSec delay required from CS assertion to first clock
// - 20uSec delay required after last clock til de-asserting CS
/// SPI timing information for ND120 pressure sensor
constexpr static LPSPI_timeCalc_T ND120_times({
	.LPSPIrootClockHz=BOARD_BOOTCLOCKRUN_LPSPI_CLK_ROOT,
	.maxClockHz=166666U, // 6us per bit
	.initialDelayNsec=         100000U,
	.delayBetweenTransfersNsec= 52000U, // (100us min byte cycle time - 8*6us) = 52us
	.finalDelayNsec=            20000U
});
// ND120 times: CCR = 0x29ce6a0b, prescaleExponent=5, SPI clock signal=158931Hz
// ...   scaled clock period=484ns, delays=100188ns,52272ns,20328ns
constexpr static LPSI_TCR_T ND120_TCR_initial({
	// Clock SPI mode '01' (CPOL = 0, CPHA = 1)
	.CPOL_SCK_Inactive_High = 0,
	.CPHA_SCK_Capture_trailing_edge = 1,
	.PRESCALE_exponent = ND120_times.prescaleExponent,
	.PCS_number = 0,
	.CONT_Continuous_Transfer = 1,
	.CONTC_Continuing_Command = 0,
	.Frame_Size = 8
});

/// SPI timing information for BMP581 pressure sensor
constexpr static LPSPI_timeCalc_T BMP581_times({
	.LPSPIrootClockHz=BOARD_BOOTCLOCKRUN_LPSPI_CLK_ROOT,
	.maxClockHz=12000000U,
	.initialDelayNsec=40U, // Missing from manual bst-bmp581-ds004.pdf figure 7 / table 18
                           // From Bosch tech support: the minimal T_setup_csb value is 40ns.
	.delayBetweenTransfersNsec=0U,
	.finalDelayNsec=40U	// T_hold_csb = 40ns
});
// BMP581 times: CCR = 0x02020004, prescaleExponent=0, SPI clock signal=11111111Hz
// ...   scaled clock period=15ns, delays=45ns,30ns,45ns
constexpr static LPSI_TCR_T BMP581_TCR_initial({
	// Clock SPI mode '11' (CPOL = 1, CPHA = 1)
	.CPOL_SCK_Inactive_High = 1,
	.CPHA_SCK_Capture_trailing_edge = 1,
	.PRESCALE_exponent = BMP581_times.prescaleExponent,
	.PCS_number = 1, // 1-2-3 for the three BMP581
	.CONT_Continuous_Transfer = 0, // BMP581 IO uses a single frame of 8*nBytes length
	.CONTC_Continuing_Command = 0,
	.Frame_Size = 8 // overriden during IO
});

Implementation of Compile-Time Constant Calculation Classes

/// \file LPSPI_driver.hpp
/// \brief
/// Classes to aid replacement of severely buggy NXP LPSPI driver for
/// iMX.RT processors.
/// - Compute timing values needed for a specific SPI device at compile-time
/// - Create a TCR value more easily at compile-time
/// - inline function to initialize (or re-initialize) timing for a specific SPI device
/// - Implementation below completely replaces severely buggy LPSPI_MasterInit
///
/// ToDo SPI: Provide replacements for FSL LPSPI transfer API, especially
///   eliminate absurd 10usec delay between 4-byte chunks of BMP581, etc.
/// \author Dave Nadler
/// \copyright MIT License

/*
 * The MIT License (MIT)
 *
 * Copyright (c) 2024 Dave Nadler (www.nadler.com)
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
 *
 */

#include <bit> // std::countl_zero
#include <algorithm> // std::max
#include <stdio.h> // printf (diagnostic only)
#include <stdint.h> // uint32_t

#include "MIMXRT1024.h" // CMSIS-style register definitions
#include "MIMXRT1024_features.h" // CPU specific feature definitions

/// Parameter structure supports named arguments to LPSPI_timeCalc_T ctor below...
struct LPSPI_timeCalc_params {
    uint32_t LPSPIrootClockHz;          ///< root clock for all LPSPI modules as configured in clocks setup
    uint32_t maxClockHz;                ///< maximum transfer rate for this device
    uint32_t initialDelayNsec;          ///< delay from CS to first clock
    uint32_t delayBetweenTransfersNsec; ///< delay between bytes
    uint32_t finalDelayNsec;            ///< delay from last clock to de-asserting CS
};

/// LPSPI_timeCalc_T calculates all timing parameters needed for an SPI device
/// at compile-time (CCR clock control register value and required clock prescale).
/// The **ONLY** thing placed in ROM is 3 words of timing information;
/// no timing calculation code is placed in ROM nor executed at runtime.
/// Small executable members include:
/// - Set_CCR_and_TCR  Write the timing parameters to an LPSPI module, and
/// - Printf           Print diagnostic output for debugging only
class LPSPI_timeCalc_T {
  public:
    uint32_t CCR; ///< register image ready for loading into CCR, containing delays and clock divisor (but not prescaler value)
    uint32_t prescaleExponent; ///< set in TCR, not CCR...
    uint32_t LPSI_clockPeriodNsec; ///< saved for diagnostics only
    constexpr LPSPI_timeCalc_T(const LPSPI_timeCalc_params &p)
    {
        uint32_t baseLPSIclockPeriodNsec = 1000000000U/p.LPSPIrootClockHz; // base clock period before any scaling...
        // All delays and clock divisor in CCR work from prescaled (divided-down) clock...
        // Find maximum required delay in nanoseconds
        uint32_t maxDelayNsec = std::max(std::max(p.initialDelayNsec,p.delayBetweenTransfersNsec),p.finalDelayNsec);
        // Max delay as a multiple of the period of un-scaled LPSI clock root
        uint32_t maxDelayCycles = maxDelayNsec / baseLPSIclockPeriodNsec;
        // Clock divisor is also in units of prescaled-clock cycles and required to fit in 8 bits.
        uint32_t clockDivisor = p.LPSPIrootClockHz/p.maxClockHz; // assuming prescale is 0
        // Find maximum value (in units of prescaled-clock cycles) required to fit in 8 bits
        uint32_t maxUnscaledCyclecount = std::max(maxDelayCycles,clockDivisor);
        uint8_t leadingZeroBits = std::countl_zero(maxUnscaledCyclecount);
        // Leading zero bit count of a uint32 must be >= 24 for a value to fit in 8 bits.
        // Otherwise, the number of bits to shift and fit delay in 8 bits is the minimum required exponent.
        prescaleExponent = (leadingZeroBits>=24) ? 0 : (24-leadingZeroBits);
        LPSI_clockPeriodNsec = 1000000000U/(p.LPSPIrootClockHz >> prescaleExponent);
        // precise lambda rounding function ensures minimal delays...
        auto CCRdelayValue = [this](uint32_t delayNsec, uint32_t CCR_offset) {
            uint32_t delay = (delayNsec+LPSI_clockPeriodNsec-1)/LPSI_clockPeriodNsec;
            if(delay>=CCR_offset) delay-=CCR_offset; // Actual delay is 1-2 cycle more than CCR value, but don't go negative
            return delay;
        };
        CCR =
            ((clockDivisor>>prescaleExponent)-1)          << LPSPI_CCR_SCKDIV_SHIFT |  // More precise rounding here could yield faster clock...
            CCRdelayValue(p.delayBetweenTransfersNsec,2)  << LPSPI_CCR_DBT_SHIFT    |  // delay between continuous frames
            CCRdelayValue(p.initialDelayNsec,1)           << LPSPI_CCR_PCSSCK_SHIFT |  // delay from CS to first clock
            CCRdelayValue(p.finalDelayNsec,1)             << LPSPI_CCR_SCKPCS_SHIFT ;  // delay from last clock to de-asserting CS
    }
    /// Set up for a specific SPI device: Set CCR and TCR (disables module first, and re-enables after)
    /// TCR value should be constructed using prescaleExponent
    inline void Set_CCR_and_TCR(LPSPI_Type * pLPSPI, uint32_t TCR_) const {
        pLPSPI->CR &= ~LPSPI_CR_MEN_MASK; // CCR write is not permitted with module enabled
        pLPSPI->CCR = CCR;
        pLPSPI->CR |= LPSPI_CR_MEN_MASK; // re-enable module to put above into effect
        pLPSPI->TCR = TCR_; // don't write TCR with module disabled
    }
    /// Printf member to aid debug (show actual delays and SPI clock values computed)...
    void Printf(const char* label) const {
        printf("%s times: CCR = 0x%08lx, prescaleExponent=%ld, SPI clock signal=%ldHz\n",
                label, CCR, prescaleExponent,
                (1000000000U/LPSI_clockPeriodNsec)/(((CCR&LPSPI_CCR_SCKDIV_MASK)>>LPSPI_CCR_SCKDIV_SHIFT)+2)  );
        printf("...   scaled clock period=%ldns, delays=%ldns,%ldns,%ldns \n",
                LPSI_clockPeriodNsec,
                (((CCR&LPSPI_CCR_PCSSCK_MASK)>>LPSPI_CCR_PCSSCK_SHIFT)+1)*LPSI_clockPeriodNsec,
                (((CCR&LPSPI_CCR_DBT_MASK   )>>LPSPI_CCR_DBT_SHIFT   )+2)*LPSI_clockPeriodNsec,
                (((CCR&LPSPI_CCR_SCKPCS_MASK)>>LPSPI_CCR_SCKPCS_SHIFT)+1)*LPSI_clockPeriodNsec
        );
    }
};

/// LPSI_TCR_params provides a named-parameter argument list to LPSI_TCR_T ctor below.
/// TCR fields we will never ever ever use are not parameterized.
struct LPSI_TCR_params {
    /// 31  CPOL    Clock Polarity
    ///     The Clock Polarity field is only updated when PCS negated.
    ///     See Figure 43-2.
    ///     - 0b - The inactive state value of SCK is low
    ///     - 1b - The inactive state value of SCK is high
    uint8_t CPOL_SCK_Inactive_High;

    /// 30  CPHA    Clock Phase
    ///     The Clock Phase field is only updated when PCS negated.
    ///     See Figure 43-2.
    ///     - 0b - Captured. Data is captured on the leading edge of SCK and changed on the following edge of SCK
    ///     - 1b - Changed. Data is changed on the leading edge of SCK and captured on the following edge of SCK
    uint8_t CPHA_SCK_Capture_trailing_edge;

    /// 29-27    PRESCALE
    ///     Prescaler Value (exponent; clock is divided by 2^PRESCALE)
    ///     For all SPI bus transfers, the Prescaler value applied to the clock configuration register.
    ///     The Prescaler Value field is only updated when PCS negated.
    uint8_t PRESCALE_exponent;

    // 26 Reserved

    /// 25-24    PCS  Peripheral Chip Select
    ///     Configures the peripheral chip select used for the transfer. The Peripheral Chip Select field is only
    ///     updated when PCS negated.
    ///     - 00b - Transfer using PCS[0]
    ///     - 01b - Transfer using PCS[1]
    ///     - 10b - Transfer using PCS[2]
    ///     - 11b - Transfer using PCS[3]
    uint8_t PCS_number;

    // 23  LSBF  LSB First
    //     - 0b - Data is transferred MSB first
    //     - 1b - Data is transferred LSB first

    // 22  BYSW  Byte Swap
    //     Byte swap swaps the contents of [31:24] with [7:0] and [23:16] with [15:8] for each transmit data word
    //     read from the FIFO and for each received data word stored to the FIFO (or compared with match
    //     registers).
    //     - 0b - Byte swap is disabled
    //     - 1b - Byte swap is enabled

    /// 21  CONT  Continuous Transfer
    ///     - In Master mode, CONT keeps the PCS asserted at the end of the frame size, until a command
    ///       word is received that starts a new frame.
    ///     - In Slave mode, when CONT is enabled, LPSPI only transmits the first FRAMESZ bits; after which
    ///       LPSPI transmits received data (assuming a 32-bit shift register) until the next PCS negation.
    ///     - 0b - Continuous transfer is disabled
    ///     - 1b - Continuous transfer is enabled
    uint8_t CONT_Continuous_Transfer;

    /// 20  CONTC  Continuing Command
    ///     In Master mode, the CONTC bit allows the command word to be changed within a continuous transfer.
    ///     - The initial command word must enable continuous transfer (CONT = 1),
    ///     - the continuing command must set this bit (CONTC = 1),
    ///     - and the continuing command word must be loaded on a frame size boundary.
    ///     For example, if the continuous transfer has a frame size of 64-bits, then a continuing command word
    ///     must be loaded on a 64-bit boundary.
    ///     - 0b - Command word for start of new transfer
    ///     - 1b - Command word for continuing transfer
    uint8_t CONTC_Continuing_Command;

    // 19  RXMSK   Receive Data Mask
    //     When set, receive data is masked (receive data is not stored in receive FIFO).
    //     - 0b - Normal transfer
    //     - 1b - Receive data is masked

    // 18  TXMSK   Transmit Data Mask
    //     When set, transmit data is masked (no data is loaded from transmit FIFO and output pin is tristated).
    //     In Master mode, the TXMSK bit initiates a new transfer which cannot be aborted by another command word;
    //     the TXMSK bit is cleared by hardware at the end of the transfer.
    //     - 0b - Normal transfer
    //     - 1b - Mask transmit data

    // 17-16  WIDTH    Transfer Width
    //     Configures between serial (1-bit) or parallel transfers. For half-duplex parallel transfers, either Receive
    //     Data Mask (RXMSK) or Transmit Data Mask (TXMSK) must be set.
    //     - 00b - 1 bit transfer
    //     - 01b - 2 bit transfer
    //     - 10b - 4 bit transfer
    //     - 11b - Reserved

    // 15-12  Reserved

    /// 11-0  FRAMESZ    Frame Size
    ///     Configures the frame size in number of bits equal to (FRAMESZ + 1).
    ///     - The minimum frame size is 8 bits
    ///     - The minimum word size is 2 bits; a frame size of 33 bits is not supported.
    ///     - If the frame size is larger than 32 bits, then the frame is divided into multiple words of 32-bits;
    ///       each word is loaded from the transmit FIFO and stored in the receive FIFO separately.
    ///     - If the size of the frame is not divisible by 32, then the last load of the transmit FIFO and store of the
    ///       receive FIFO contains the remainder bits. For example, a 72-bit transfer consists of 3 words: the
    ///       1st and 2nd words are 32 bits, and the 3rd word is 8 bits.
    uint8_t Frame_Size;
};

/// Construct a constant TCR from a named parameter list
class LPSI_TCR_T {
  public:
    uint32_t TCR;
    constexpr LPSI_TCR_T(const struct LPSI_TCR_params &p) {
      TCR =
        LPSPI_TCR_CPOL(p.CPOL_SCK_Inactive_High)        |
        LPSPI_TCR_CPHA(p.CPHA_SCK_Capture_trailing_edge)|
        LPSPI_TCR_PRESCALE(p.PRESCALE_exponent)         |
        LPSPI_TCR_PCS(p.PCS_number)                     |
        LPSPI_TCR_CONT(p.CONT_Continuous_Transfer)      |
        LPSPI_TCR_CONTC(p.CONTC_Continuing_Command)     |
        LPSPI_TCR_FRAMESZ(p.Frame_Size-1)               ;
    }
};

Hope this helps folks!
Best Regards, Dave

jrnTransAct · ‎03-29-2024

I applaud your efforts!
Why C++ though? Not a criticism, I'm genuinely curious about that decision.

I too wish NXP was at times more active in these forums. Do other vendors do a better job at supporting their HAL's?

------
Jim Norton
Sr. Firmware Engineer - TransAct Technologies, Inc.

davenadler · ‎03-29-2024

Thanks, unfortunately I'm forced to do this work to work around serious FSL bugs.

C++ because, since C++ 2020 especially, you can write relatively 'normal' code and get it evaluated at compile-time (so no runtime code). Read the LPSPI_timeCalc_T ctor, and you'll see this would be truly painful in C (can be done with macros, but give it a try and you'll see not so easy). Anyway I've been doing embedded in C++ for close to 30 years!

Many vendors have poor HAL and support; a couple of other vendors are banned at one of our customers because of problems like this. We've already had one customer give up on NXP and another ask me to start looking for alternatives. Buggy stuff like FSL costs serious time and $ and hurts time-to-market. We need to be spending time and $ on our applications and not debugging and rewriting drivers.

Thanks again!
Best Regards, Dave

LPSPI bugs: LPSPI_MasterInit replacement

LPSPI bugs: LPSPI_MasterInit replacement

i.MXRT 101x

i.MXRT 102x

i.MXRT 105x

i.MXRT 106x

i.MXRT 600