Skip navigation

LPC Microcontrollers

11 Posts authored by: Eli Hughes Employee

I wanted to take a quick break from some of the PowerQuad articles to show off a neat library that works well with the LPC55S69.      One of the design features of the Mini-Monkey experiment was a 240x240 Pixel IPS display.     I feel that the LPC55S69 is a good fit for small, low active power embedded graphics applications.  It has quite a bit of internal SRAM to store a framebuffer and has lots of processing power to composite a scene on a small display.  In some of my previous articles,  we use this display to show static images as well as displaying time series data from a built in MEMs microphone.   I ran across a twitter user “The Performance Whisper” who had recently released a lightweight and efficient  animated GIF decoder.     I *really* wanted to give this library a try and decided to port it to the Mini-Monkey.


Here it is in action:




You can find the original library here:


 You can read more about the origins and design of the GIF library here:


A 'Low Memory' GIF decoder 


While the library is targeting the Arduino platform, the core decoder is written in C and can be easily ported.    My port can be found here:


For this demonstration, I embedded the GIF files in internal flash.   It would be straightforward to add some SPI flash to store larger animations.   The LPC55S69 also has SDIO interfaces so you could also use an SD card or eMMC read files from a file system.  I will have more to say on embedded graphics on the LPC55S69 in the future.   In the meantime, check out these additional LPC55S69 resources.

I had some design updates for “Rev B” of my Mini-Monkey design that I wanted to get in the "queue" for testing.  For the next revision, I wanted to try PCB:NG for the board fabrication and assembly.  PCB:NG is an “on-demand” PCB assembly service focused on turnkey prototypes via simple a web interface.    The pricing looked attractive and it appeared that the Mini-Monkey fit within their standard design rules.    The Mini-Monkey design uses an NXP LPC55S69 microcontroller that is in a 0.5mm pitch VFBGA98 package.   NXP offers guidance on how to use this device with low-cost design rules and I thought this would be a great test for PCB:NG.   I had success with Rev A at Macrofab and thought I would give PCB:NG a shot.


Getting your design uploaded is straightforward with PCB:NG.   You can upload your Gerber files and get a preview of the PCB.  As you move through the process, the web interface will give you an updated price:


Figure 1: PCB:NG Gerber Upload


The online PCB:NG interface includes a Design For Manufacture (DFM) check.  The check is exhaustive and includes all the common DFM rules such as trace width, clearance, drill hits etc.  In my case, I had some features that violated minimum solder mask slivers and copper to board outline clearances.  The online tool allows you to “ignore” DFM violations that may not be an issue.   I was able to look through all the violations and mark which ones were of no concern.


Once the Gerber files are uploaded, you can add your parts and as well as the pick/place data.  The PCB:NG interface will show you part pricing and availability as soon your Bill of Materials (BOM) is uploaded.   You have the option to mark parts as Do Not Place (DNP) if you do not want them populated.   In my case, I had 2 components on the Mini-Monkey BOM (a battery and a display) that I did not include as they required some manual assembly steps that I was going to perform once I had the units in hand.



Figure 2 : PCB:NG BOM Upload


Along with the BOM, you must  upload XYRS placement data.    The XYRS data can be combined in the spreadsheet file used for the BOM.  The PCB:NG viewer will also show you where it thinks all the placements are and can make manual adjustments if necessary.



Figure 3 : PCB:NG Part Placement Interface




I had placed my order on 2020-06-10.   Throughout the process, PCB:NG sent email updates when materials were in house,  when production started, etc.    I did have to send in a note that one of the parts (a MEMs microphone) was sensitive to cleaning  processes.    I received a response the same day noting the exception (PCB:NG uses a no-clean process) and they would add the part to their internal database of exceptions.


I had placed the order when they were in the middle of some equipment upgrades.   When I checked the price a few day ago I found that it was lower ($ 380 vs $496) after the new process upgrades.    I consider the service a huge value given that they handle some potentially difficult parts.   Getting the BGA packages microcontroller and the LGA packaged MEMs soldered professionally was well worth the price.   The boards shipped out 2020-06-29.  It was a bit longer than the published lead time but communication during the process was good.  I think I caught the team in the middle of some equipment upgrades which may have delayed things a bit.   PCB:NG took some extra time to get me photos from the X-tay inspection of the BGA and LGA parts.     Getting these photos was well worth the wait!



Figure 4: LPC55S69 VFBGA98 Post Assembly X-ray - View 1



Figure 5: LPC55S69 VFBGA98 Post Assembly X-ray – View 2



Figure 6: MEMS Microphone (LGA) Post Assembly X-ray


As you can see of the X-ray images, the solder joints were good.   It was also cool seeing the via structures in the PCB and bond wires in the IC packages.   You can even see tiny little via structures in the VFBGA98 package itself.   How did the build turn out?   Here is a video of the Mini-Monkey Rev B:


Mini-Monkey Rev B w/ PCB:NG 


PCB:NG was also kind enough to show the Mini-Monkeys getting setup for placement:


Mini-Monkey Fabrication at PCB:NG 


Final Thoughts


The experience with PCB:NG was excellent.   The boards turned out a great and I was able to test all my changes quickly.    Having someone else handle part procurement and assembly is a huge value to me as it allows me to focus on other aspects of the design such as firmware develop for the board bring-up.    One possible improvement with the online PCB:NG interface would be to be able to submit ODB++ or IPC-2581 data.  These formats bake in more information and could really streamline design upload.       I will certainly be using PCB:NG in the future for my prototypes.    The on-demand model is helpful, especially when you are busy and need to get some help accelerating your development efforts.   


Onward to Revision C!    I think I may add eMMC storage and improve the battery circuit.   If you want to see the current raw design files,   they are available on BitBucket in Altium Designer format.




Test Software: 


I'll be posting more updates on the Mini-Monkey as new revisions are complete.   In the mean time, here are other articles and videos related to the LPC55S69. Cheers!

In my last article, we examined a common time domain filter called the Biquad and how it could be computed using the LPC55S69 PowerQuad engine.        We will now turn our attention to another powerful component of the PowerQuad, the “Transform Engine”.      The PowerQuad transform engine can compute a Fast Fourier Transform (FFT) in both a power and time efficient manner leaving your main CPU cores to handle other tasks.  


Before we look at the implementation on the LPC55S69, I want to illustrate what exactly an FFT does to a signal.   The meaning of the data is often glossed over or even worse yet, explained in purely mathematical terms without a description of *context*.    I often hear descriptions like “transform a signal from the time domain to the frequency domain”.   While these types of descriptions are accurate, I think that many do not get an intuitive feel for what the numbers are *mean*.    I remember my 1st course in ordinary differential equations.  The professor was explaining the Laplace transform (which is more generalize case of the Fourier Transform) and I asked the question “What does s actually mean in a practical use case?”.


Figure 1. Laplace Transform. What is s?


My professor was a brilliant mathematician and a specialist in complex analysis.  He could explain the transform from 3 different perspectives with complete mathematical rigor.       Eventually we both got frustrated and he said “s will be a large number complex number in engineering applications”.   It turned out the answer was simple in terms of the electrical engineering problems we were solving.  After many years and using Laplace again in Acoustic Grad School it made sense but at the time it was magical.       I hope to approach the FFT a bit differently and you will see it is simpler than you may think.     While I cannot address all the aspect of using FFT’s in this article, I hope it gives you a different perspective from a “getting started” perspective.


Rulers, Protractors, and Gauges


One of my favorite activities is wood working.   I am not particularly skilled in the art, but I enjoy using the tools, building useful things, and admiring the beauty of the natural product.  I often tell people that “getting good” at wood working is all about learning how to measure, gauge and build the fixtures to carry out an operation.      When you have a chunk of wood, one of the most fundamental operations is to measure its length against some fixed standard.      Let us begin with a beautiful chunk of Eastern Hemlock:

Figure 2.  A 12” x 3” x 26” Piece of rough sawn eastern hemlock. 


One of the first things we might want to do to with this specimen is use some sort of standard gauge to compare it to:


Figure 3.  Comparing our wood against a reference.


We can pick a standard unit and compare our specimen to a scale based upon that unit.   In my case the unit was “inches” of length, but it could be anything that helps up solve our problem at hand.  Often you want to pick a unit and coordinate system that scales well to the problem at hand.     If we want to measure circular “things”, we might use a protractor as it makes understanding the measurement easier.    The idea is to work in a system that is in the “coordinate system” of your problem.     It makes sense to use a ruler to measure a “rectangular” piece of wood.


What does this have to do with DSP and Fourier transforms?   I hope to show you that a Fourier Transform (and its efficient discrete implementation, the FFT) is simplly just a set of gauges that can be used to understand a time domain signal.   We can then use the PowerQuad hardware to carryout out the “gauging”.  For the sake of this discussion, let us consider a time domain signal such as this:


Figure 4.   An example time domain signal


This particular signal is a bit more complex than the simple sine wave used in previous articles.     How exactly would we “gauge” this signal?   Amplitude?  Frequency?  Compute statistics such as variance?    Most real-world signals can have quite a bit of complexity, especially when they are tied to some physical process.    For example, if we are a looking at the result of some sort of vibration measurement, the signal could look very complicated as there are many physical processes contributing to the shape.   In vibration analysis, the physical “things” we are examining with move and vibrate according to well understood physics.     The physics show that the systems can be modeled with even order differential equations.       This means always be we can write the behavior of the system over time as the sum of sinusoidal oscillations.        So, what would be a good gauge to use to examine our signal?    Well, we could start with a cosine wave at some frequency of interest:


Figure 5.   Gauging our signal against a cosine wave.


Choosing a cosine signal as a reference gauge can simplify the problem as we can easily identify the properties of our unit of measure,  i.e. its frequency and amplitude.      We can fix the amplitude and frequency of our reference and then compare it to our signal.     If we do our math correctly, we can get a number that indicates how well correlated our input signal is to a cosine wave of a particular frequency and a unit amplitude.     So, how exactly do we perform this correlation?     It turns out to be a simple operation.    If we think about the input signal and our reference gauge as discrete arrays of numbers (i.e. vectors),  we compute the dot-product between them:


Figure 6.   Computing the correlation between a test signal and our feference gauge.


The operation is straightforward.    Both your signal input and the “gauge” has the same number of samples.   Multiply the elements of each array  together and add up the results.    Using an array like notation, the input is represented by “x[n]” and the gauge is represented by “re[n]” where n is an index in the array:


Output = x[0]*re[0]  +  x[1]*re[1]  + x [2]*re[2] + .  .  .


What we end up with is a single number (scalar).   Its magnitude is proportional to how well correlated out signal is to the particular gauge we are using.      As a test, you could write some code and use a cosine wave as your input signal.  The test code could adjust the frequency of input and as the frequency of the input gets closer to the frequency of the gauge, the magnitude of the output would go up.


As you can see the math here is just a bunch of multiplies and adds, just like the IIR filter from our last article.     There is one flaw however with this approach.      There is special case of the input where the output will be zero.   If the signal input is a cosine wave of the *exact* frequency as the gauge and  is 90 degrees phase shifted with respect to the reference gauge, we would get a zero output.    


Figure 7.  A special case of our reference gauge that would render zero output.


This is not desirable as we can see that input is correlated our reference gauge, it just is shifted a in time.  There is a simple fix and we can even use our piece of hemlock lumber to illustrate.


Figure 8.  Gauging along a different side of the lumber.


In Figure 3, I showed a ruler along the longest length of the wood.     We can also rotate the ruler and measure along the shorter side. It is the same gauge, just used a different way. Imagine that board was only 1” wide but 24” long.  I could ask an assistant to use a ruler and measure the board. Which of those two numbers is “correct”?    The assistant could report to me either of those numbers and be technically correct.   We humans generally assume length to be the longer side of a rectangular object but there is nothing special about that convention.  In figure 6, we were only measuring along 1 “side” of the signal.   It is possible to get a measurement that is zero (or very small) while have a signal that looks very similar to the gauge (like in figure 7).   We can fix this by “rotating” our ruler similar to figure 8 and measure along the both ”sides” of the signal.


Figure 9.  Using two reference gauges.  One is “rotated” 90 degrees.


In figure 9, I added another “gauge” labeled “A” in purple.  The original gauge is labeled “B”.    The only difference between the two gauges is that B is phase shifted by 90 degrees.   This is equivalent to rotating my ruler in figure 8 and measuring the “width” of my board.     In figure 9, I am showing 3 of the necessary multiply/add operations but you would carry out the multiple/add for all points in the signal. Writing it out:


B = x[0]*Re[0]  +  x[1]*Re[1]  + x [2]*Re[2] +  .  .  .

A = x[0]*Im[0]  +  x[1]*Im[1]  + x [2]*Im[2] +  .  .  .


In this new formulation we get a pair of numbers A,B for our output.   Keep in mind that we are gauging our input against a *single* frequency of reference signals at a unit amplitude.     This is analogous to measuring the length and width of our block of wood.      Another way of thinking about it is that we now have a measuring tool that evaluates along 2 axes which are “orthogonal”.     It is almost like a triangle square.

Figure 10.  A two-axis gauge.


Once we have our values A & B, it is typical to consider them as a single complex number


Output = B + iA


The complex output gives us a relative measure of how we are correlated to our reference gauges. To get a relative amplitude, simply compute the magnitude:


||Output|| = sqrt(A^2 + B^2)


You could even extract the phase:


Phase = arctan(B/A)


It common to think about the output in “polar” form (magnitude/phase).  In vibration applications you typical want understand the magnitude of the energy at different frequency components of a signal. There are applications in communications, such as orthogonal frequency domain multiplexing (OFDM), where you work directly the with real and imaginary components. 


I previously stated that the correlation we were performing is essentially a vector dot product operation.    The dot product shows up in many applications. One of which is dealing with vectors of length 2 where we use the following relationship:



The interesting point here is that the dot product is a simple way of getting a relationship of the angle and magnitude between two vectors and b.    It is easy to think about a and b as vectors on a 2d plane, but the relationship extends to vectors of any length.  For digital data, we work with discrete samples, so we define everything in terms of the dot product.   We are effectively using this operation to compute magnitudes and find angles between “signals”.   In the continuous time world, there is the concept of the inner-product space.   It is the “analog” equivalent of the dot product and underpins the mathematical models for many physical systems.  


At this point we could stop and have a brute force technique of comparing a signal against a single frequency reference.    If we want to determine if a signal had a large component of a particular frequency, we could tailor our reference gauges to the *exact* frequency we are looking for.  The next logical step is to compare our signal against a *range* of reference gauges of different frequencies:


Figure 11: Using a range of reference gauges at different frequencies.


In Figure 11, I show four different reference gauges at frequencies that have an integer multiple relationship. There is no limit to the number of frequencies you could use.     With this technique, we can now generate a “spectrum” of outputs at all the frequencies of interest for a problem.    This operation has a name:  the Discrete Fourier Transform (DFT).    One way of writing the operation is:

Figure 12. The Discrete Fourier Transform (DFT)


N is the number of samples in the input signal.


k is the frequency of the cosine/sine reference gauges.     We can generate a “frequency” spectrum by computing DFT over a range of “k” values.  It is common to use a linear spacing in when selecting the frequencies.   For example, if your sample rate is 48KHz and you are using N=64 samples, it is common (we will see why later) to use 64 reference gauges spaced at (48000/64)Hz apart. 

The “Fast Fourier Transform”


The Fast Fourier Transform is a numerically efficient method of computing the DFT.  It was developed by J. W. Cooley and John Tukey in 1965 as a method of performing the computation with a fewer adds and multiplies as compared to the direct implementation shown in Figure 11.    The development of the FFT was significant as we can do our number crunching much more efficiently by imposing a few restrictions on the input.  There are a few practical constraints that need to be considered when using an FFT implementation


  1. The length of your input must be a power of 2.   i.e. 32, 64, 128, 256.
  2. The “bins” of the output are spaced in frequency by the sample rate of your signal divided by the number of samples in the input.     As an example, if you have a 256-point signal sampled at 48Khz, the array of outputs corresponds to frequencies spaced at 187Hz.      In this case the “bins” would correlate to 0Hz, 187.5Hz, 375 Hz, etc.  You cannot have arbitrary input lengths or arbitrary frequency spacing in the output.
  3. When the input the FFT/DFT are “real numbers” (i.e. samples from an ADC), the array of results exhibits a special symmetry.   Consider an input array of 256 samples.    The FFT result will be 256 complex numbers.   The 2nd half of the output are a “mirror” (complex conjugates) of the 1st half.      This means that for a 256-sample input, you get 128 usable “bins” of information. Each bin has a real and imaginary component.  Using our example in #2, the bins would be aligned to 0Hz, 187.5Hz, 375Hz, all the way up to one half of our sample rate (24KHz).


You can read more details about how the FFT works as well as find plenty of instructional videos on the web. Fundamentally, the algorithm expresses the DFT of signal length N recursively in terms of two DFTs of size N/2.  This process is repeated until you cannot divide the intermediate results any further.   This means you must start with a power of 2 length.  This particular formulation is called the Radix-2 Decimation in Time (DIT) Fast Fourier Transform.  The algorithm gains its speed by re-using the results of intermediate computations to compute multiple DFT outputs.  The PowerQuad uses a formulation called “Radix-8” but the same principles apply.


Using the PowerQuad FFT Engine


The underlying math to a DFT/FFT boils down to multiplies and adds along with some buffer management.   The implementation can be pure software, but this algorithm is a perfect use case for a dedicated coprocessor.    The good news is that once you understand the inputs and outputs of a DFT/FFT, using the PowerQuad is quite simple and you can really accelerate your particular processing task. The best way to get started with using the PowerQuad FFT is to look at the examples in the SDK.     There is an example project called “powerquad_transform” which has examples that test the PowerQuad hardware.


Figure 13.  PowerQuad Transform examples in the MCUXpresso SDK for the LPC55S69


In the file powerquad_transform.c, there are several functions that will test the PowerQuad engine in its different modes.     For now, we are going to focus on the function PQ_RFFTFixed16Example(void).



This example will set up the PowerQuad to accept data in a 16-bit fixed point format.      To test the PowerQuad, a known sequence of input and output data is used to verify results.     The first thing I would like to point out is that the PowerQuad transform engine is used fixed point/integer processing only.   If you need floating point, you will need to convert beforehand.  This is possible with the matrix engine in the PowerQuad.      I personally only every use FFTs with fixed point data most of my source data comes right from analog to digital converter data.       Because of the processing gain of the FFT, I have never seen any benefit of using a floating-point format for FFTs other than some ease of use for the programmer.   Let us look at the buffers used in the example:



Notice that the input data length FILTER_INPUT_LEN (which is 32 samples).   The arrays used to store the outputs are twice the length.     Remember that an FFT will produce the same number of *complex* samples in the output as there are samples for the input.     Since our input sample are real values (scalars) and the outputs have real/imaginary components, it follows that we 2x the length to storage the result.     I stated before that one of implications of the FFT with real valued inputs is that we have a mirror spectrum with complex conjugate pairs.   Focusing on the reference for testing the FFT output in the code:



The 1st pair 100,0 corresponds to the 1st bin which is a “DC” or 0Hz component.  It should always have a “zero” for the imaginary component.    The next bins can be paired up with bins from the opposite end of the data:


76,-50  <->   77,49

29,-62   <->   29, 61

-1, -34   <->  -1,33


These are the complex conjugate pairs exhibiting mirror symmetry.   You can see that they are not quite equal.  We will see why in a moment.  After all the test data in initialized, there is a data structure used to initialize the PowerQuad:


One of the side effects of computing an FFT is that you get gain at every stage of the process.   When using integers, it is possible to get clipping/saturation and the input needs to be downscaled to ensure the signal down not numerically overflow during the FFT process.     The macro FILTER_INPUTA_PRESCALER is set to “5”.    This comes from the length of the input being 32 samples or 2^5.     The core function of the Radix-2 FFT is to keep splitting the input signal in half until you get to a 2-point DFT.  It follows that we need to downscale by 2^5 as we can possible double the intermediate results at each stage in the FFT.       The PowerQuad uses a Radix-8 algorithm, but the need for downscaling is effectively the same.      I believe that some of the inaccuracy we saw in the complex conjugates pairs the test data was from the combination of an input array values that are numerical small and the pre-scale setting.   Note that the pre-scaling is a built in hardware function of the PowerQuad.


The PowerQuad needs an intermediate area to work from.  There is a special 16KB region starting at address 0xe0000000 dedicated to the PowerQuad.   The PowerQuad has a 128-bit interface to this region so it is optimal to use this region for the FFT temporary working area.  You can find more details about this private RAM in AN12292 and   AN12383


Once you configure the PowerQuad, the next step is to tell the PowerQuad the input and result data is stored with the function PQ_transformRFFT().



Notice in the implementation of the function, all that is happening is setting some more configuration registers over the AHB bus and kicking off the PowerQuad with a write to the CONTROL register. In the example code, the CPU blocks until the PowerQuad is finished and then checks the results.    It is important to point out that in your own application,  you do not have to block until the PowerQuad is finished.      You could setup an interrupt handler to flag completion and do other work with the general purpose M33 core.  Like I stated in my article on IIR filtering with the PowerQuad,   the example code is a good place to start but there are many opportunities to optimize your particular algorithm.   Example code tends to include additional logic to check function arguments to make the initial experience better.    Always take the time look through the code to see where you can remove boilerplate that might not be useful.


Parting Thoughts


  • The PowerQuad includes a special engine for computing Fast Fourier Transforms.
  • The FFT is an efficient implementations of the Discrete Fourier Transform.   This process just compares a signal against a known set of reference gauges (Sines and Cosines)
  • The PowerQuad has a private region to do its intermediate work.  Use it for best throughput.
  • Also consider the memory layout and AHB connections of where your input and output data lives.  There may be additional performance gains by making sure you input DSP data is in a RAM block that is on a different port than RAM used in your application for general purpose task.  This can help with contention when different processes are accessing data.   For example, SRAM0–3 are all on different AHB ports.   You might consider locating you input/output data in SRAM3 and having your general-purpose data in SRAM0-2.   Note:   You still need to use 0xE0000000 for the PowerQuad TEMP configuration for its intermediate working area.


At this point you can begin looking through the example transform code.  Also make sure to read through  AN12292 and   AN12383 for more details.       While there are more nuances and details to FFT and “frequency domain” processing, I will save those for future articles.    Next time I hope to show some demos of the PowerQuad FFT performance on the Mini-Monkey and illustrate some other aspect of the PowerQuad.      Until then, check out some of the additional resources below on the LPC55S69.


Additional LPC55S69 resources:

In my last article,   we starting discussing the PowerQuad engine in the LPC55S69 as well as the concept of data in the “time domain”.    Using the Mini-Monkey board, we showed the function of collecting a bucket of data over time.     I chose to use a microphone as a data source as it is easy to visualize and understand.      You can now easily imagine replacing the microphone with *anything* that changes over time.     In this article we are going to look at some common algorithms for processing data in the time domain.  In particular, we will look at the “Dual Biquad IIR” engine in the LPC55S69 PowerQuad. An IIR biquad is a commonly used building block as it is possible to configure the filter for many common filtering use cases.   This article is not intended to review all of the DSP theory behind IIR filter implementations but I do want to highlight some key points and the PowerQuad implementation.


Digital Filtering with Embedded Microcontrollers


When sampling data “live”, one can imagine data being continuously recorded at a known rate.      A time domain filter will accept this input data and output a new signal that is modified in some way.


Figure 1.   Filtering In the Time Domain


The concept here is that the output of the filter is just another time domain signal.   You may choose to do further processing on this new signal or output to a Digital Analog Converter (DAC).      If we are thinking in terms of “sine waves”, a digital filter adjusts the amplitude and phase of the input signal.     As we apply different frequency inputs (or a sum of different frequencies), the filter attenuates or gains to the sinusoidal components.   So, how does one compute a digital filter?   It is quite simple.   Let us start with a simple case.  :

Figure 2.   Sample by Sample Filter Processing using a History of the Input


One operation we perform is to *mix* the most recent input sample with samples we have previously recorded.    The result of this operation is our next *output* sample. The name of this filter configuration is an FIR or Finite Impulse Response filter.      One way to write this algorithm is to use a “c array style” notation and difference equations.


x[n]       The current input

x[n-1]     Our previous input

y[n-2]     An input from 2 sample ago 

y[n]       Our next output


Figure 2 could be written as


y[n] = b0*x[n] + b1*x[n-1]  + b2*x[n-2]


All we are doing is multiplying our input sample and its history by constant coefficients and then adding them up.    We are multiplying then accumulating! The constants b0, b1 and b2 control the frequency response of the filter.    By choosing these numbers correctly, we can attenuate “high” frequencies (low pass filter), attenuate low frequencies (high pass filter), or perform some combination of the two (band pass filter).     We can also use more samples from the input history.    For example, instead of just using the previous 3 samples, one could use 128 samples.      A filter of this type (FIR) can require quite a bit of time history to get precise control over its frequency response.         The code to implement this structure is simple but can be very CPU intensive as you need to do the multiply and adds for *every* sample at your signal sample rate.  


There is an adjustment we can make to figure 2 that can allow for tighter control over our frequency response without having to use a long time history.


Figure 3.   Sample by Sample Filter Processing using a History of the Input and Output


The key difference between figure 2 and figure 3 is that we can also mix in previous filter *outputs* to generate the output signal.     Adding this “feedback” can yield some interesting properties and is the root of another class of digital filters called IIR (Infinite Impulse Response filters).    


y[n] = b0*x[n] + b1*x[n-1]  + b2*x[n-2]  + a1*y[n-1] + a2*y[n-2]


One of the primary advantages of this approach that you need fewer coefficients than an FIR filter structure to get a desired frequency response.   There are always trade-offs when using IIR filters vs. FIR filters so be sure to read up on the differences.    The example I showed in figure 3 is called a “biquad”.     A biquad filter is a common filter building block that can be easily cascaded to construct larger filters.       There are several reasons to use a biquad structure, one of which being that there are many design tools that can generate the coefficients for all of the common use cases.      Several years ago, I built a tool around a set of design equations that were useful for audio filtering.


Figure 4.  An IIR Biquad Filter Design Tool.


At the time I made the tool shown in figure 4, I was using biquad filter structures for tone controls on a guitar effects processor.     The frequency and phase response plots where designed to show frequencies of interest of an electric guitar pickup.  There are lots of options for coming up with coefficients and numerous libraries to help.  For example, you could use Python:


In my guitar effects project, I embedded the filter design equations in my C code so I could recompute coefficients dynamically!


Using the PowerQuad IIR Biquad Engines


The PowerQuad in the LPC55S69 has dedicated hardware to compute IIR biquad filters.      Like an FIR filter, the actual code to implement a biquad filter is straightforward.  An IIR filter may be simple to code but can use quite a bit of CPU time to crunch through all the multiply and accumulate operations.     The PowerQuad is available to free up the CPU from performing the core computational component of the biquad computation.       A good starting point for using the PowerQuad IIR biquad engine is to use the MCUXpresso SDK.     It is important to note that the SDK will be a starting point.  The SDK code is written to cover as many use cases as possible and to demonstrate the different functions of the PowerQuad.  It can be helpful to read through the source code and decide which pieces you need to extract for your own application.   DSP code often requires some hand tuning and optimization for a particular use case.    The PowerQuad is connected via the AHB bus and the Cortex-M33 co-processor interface.  Let’s take a look at the SDK source code to see how you the IIR engine works.


Using the “Import SDK Examples” wizard in MCUXpresso, you will find PowerQuad examples under driver_examples PowerQuad


Figure 5.  Selecting the PowerQuad Digital Filter Example


The powerquad_filter project has quite a few examples of the different filter configurations.  We are going to focus on a floating point biquad example as a starting point.  In the file powerquad_filter.c, there are several test functions that will demonstrate a basic filter setup.   I am using LPC55S69 SDK 2.7.1 and there is function around line 455 (Note the spelling mistake PQ_VectorBiqaudFloatExample).


Figure 6.  Vectorized Floating Point IIR Filter Function


The 1st important point to note is that PowerQuad computes IIR filters using “Direct Form II”.        In the previous figures I showed the filter using “Direct Form I”.      When one is 1st introduced to IIR filters, “Direct Form I” is the natural starting point as it is the clearest and most straightforward implementation.    It is possible however to re-arrange the flow of multiplies and adds and get the same arithmetic result.


Figure 7.  IIR Direct Form II



When using "Direct Form II", we do not need to store history of both inputs and outputs.  Instead, we store an intermediate computation which is labeled v[n].         During the computation of the filter, the intermediate history v[n] must be saved.    We will refer these intermediate values as the filter “state”.  To setup the PowerQuad for IIR filter operation, there are handful of registers on the AHB bus where the state and coefficients are stored. In the SDK examples, the state of the filter is initialized with PQ_BiquadRestoreInternalState().       


Figure 8.   Restoring/Initializing Filter State


Once the PowerQuad IIR engine is initialized,  data samples can be processed through the filter.   Let us take a look at the function PQ_VectorBiqaudDf2F32() in fsl_powerquad_filter.c


Figure 9.   Vectorized IIR Filter Implementation.


This function is designed to process longer blocks of input samples, ideally in multiples 8.       Note that many of the SDK examples are designed make it simple to get started but could be easily tuned to remove operations that may be not applicable in your application code.  For example, the modulo operation to determine if the input block is a multiple of 8 is something that could be easily removed to save CPU time.         In your application, you have complete control over buffer sizes and can easily optimize and remove unnecessary operations.  The actual computation of the filter can be observed in the code block that processes the 1st block of samples.


Figure 10.  Transfering Data to the IIR Engine with the ARM MCR Coprocessor Instruction


Data is transferred to the PowerQuad with the MCR instruction.   This instruction transfers data from an CPU register to an attached co-processor (the PowerQuad in this case).  The PowerQuad does the work of crunching through the Direct Form II IIR structure.    While it take some CPU intervention to move data into the PowerQuad,   the PowerQuad is much more efficient at the multiply and adds for the filter implementation.


To get the result, the MRC instruction is used.   MRC moves data from a co-processor to a CPU register.


Figure 11.  Retrieving the IIR Filter result with the MRC instruction.


Further down in PQ_VectorBiquadDf2F32(), there is assembly code tuned to inject data in blocks of 8 samples.    Looking at PQ_Vector8BiquadDf2F32():


Figure 12.  Vectorized Data Insertion into the PowerQuad.


Notice all the MCR/MRC functions to transfer data in and out of the biquad engine.    All the other instructions are “standard” ARM instructions to get data into the registers that feed coprocessor.  Take some time to run the examples in the SDK.  They are structured to inject a known sequence to verify correct filter operation.    Now that you have seen some the of the internals,  you can use the pieces you need from the SDK to implement your signal processing chain.


Some take-aways


  • The PowerQuad can help accelerate biquad filters. There are 2 separate biquad engines built into the PowerQuad.
  • The PowerQuad IIR functions are configured through registers on the AHB bus and the actual input/output samples transferred through the Cortex M33 coprocessor interface.
  • The SDK samples are a good starting point to see how configure and transfer data to the PowerQuad.  There are optimization opportunities for your particular application so be sure to inspect all of the code.
  • If you need more than two biquad filters, you will need to preserve the “state” of the filter.  This can be a potentially expensive operation if you are constantly saving/restoring state.  In this case you will want to consider processing longer blocks of data.
  • You may not need to save the entire “state” of the filter.   For example, if the filter coefficients are the same for all of the your filters,  all you need to save and restore is v[n].
  • While the PowerQuad can speed up (6x) the core IIR filter processing,  you still need the CPU to setup the PowerQuad and feed in samples.   Consider using one the extra Cortex M33 cores in the LPC55S69 to do your data shuffling.


You now have a head start on performing time domain filtering with the LPC55S69 PowerQuad.    We examined IIR filters, which have lots of applications in audio and sensor signal processing, but the PowerQuad can also accelerate FIR filters.  Next time we are going to dive a litter deeper with some frequency domain processing with the PowerQuad transform engine.      The embedded transform engine can accelerate processing of Fast Fourier Transforms *significantly*. Stay tuned for more embedded signal processing goodness!


Additional LPC55S69 resources:

Built into the LPC55S69 is a powerful coprocessor called the “PowerQuad”.   In this article we are going to introduce the PowerQuad and some interesting use cases.   Over the next several weeks we will look at using some of the different processing elements in the PowerQuad using the “Mini-Monkey” board.   

Figure 1:  NXP PowerQuad Signal Processing Engine


The PowerQuad is a dedicated hardware unit that runs in parallel to the main Cortex M33 cores inside the LPC55S69. By using the PowerQuad to work in parallel to the main CPU,  it is possible to implement sophisticated signal processing algorithms while leaving your main CPU(s) available to do other tasks such as communication and IO.    This is a very import use case in distributed sensor systems and the Industrial Internet of the Things (IIOT).  Over the next several weeks, I am going to show some practical aspects of using the PowerQuad in some various applications.     I feel it is a very good fit for many tightly embedded applications need a combination of the general-purpose processing, IO, and dedicated signal processing while maintain a very low active power profile.


Embedded Systems, Sensors and Signal Processing


Before we get started, I think it is helpful to review some concepts and explain why some of the functions of the PowerQuad are useful.   Even though many engineers may have learned about Digital Signal Processing (DSP)  in college or university, there is often little connection to real hardware and code.    Many introductions to DSP begin with formal explanations (i.e. heavy math!).   While this formalism is important for developing the underlying algorithms, it is easy to get lost when trying to make something work.   As an example, one of core algorithms to many DSP applications is the Fast Fourier Transform.  It can be difficult for one to understand to how use at a black box level software if all you have ever worked with was the mathematical formalism.    Being able to link the formalizing with real application is where real magic can happen!  In these upcoming articles, I will break down what is actually happening in the code so it is a bit easier to use the PowerQuad hardware.


For an overwhelming majority of sensor and industrial IOT applications, we encounter “time series” data.  By time series, all we mean is that we take some sort of measurement at a constant interval and put the recorded data into a bucket.      We might process this data one sample at a time as it comes in or wait to our bucket fills up to a level before working with the information.  A key feature here is that we have some measurement (temperature, pressure, voltage level) that is captured fixed rate.     What we end up with is a data set that spans some amount of “time”.   We do not have infinite resolution in the measurement “amplitude”  nor can we take measurements infinitely fast. For example, if we take voltage readings over time, our “step” size might be 1milli-second with 1milli-volt resolution in our amplitude.  The details of how fast and with how much precision is application dependent.  


Figure 2:   A Time Series Cartoon


In Figure 3,   notice that the "dots" are not connected to indicate that we have a discrete set of data.   Many times we fill in the space between the dots on a chart to get a better visualiztion of the signal but what we have to work with is a discrete bucket of data.


Let’s take a look at an example using the LPC55S69 on the “Mini-Monkey”.  The Mini-Monkey circuit has a digital microphone connected via an I2S interface to the MCU and a 240x240 pixel display connection via SPI.     Using the display, we can visualize the time series (my voice).  As a demonstration,  I grabbed of a bucket of 256 samples from the microphone via the I2S interface and rendered raw time series data on the display.       The microphone on the Mini-Monkey (Knowles Acoustic SPH0645LM4H-B) was setup to output data at a rate of 32KHz.   The resolution in amplitude from this device is 18-bits.   Since my OLED screen is 240 pixels high, I divided down the amplitude of the samples so they would fit. 


Here is an animated .gif of the result:




A video with corresponding audio:



All I am doing is collecting data into "buffer" and then continually displaying the information on the screen.  It is an easy way to visualize what is going on.  Now, instead of a using microphone measuring acoustic pressure, you could sample something else.  A velocity measurement, a voltage signal, etc.   The time series data set is your starting point.    Now it is time to start doing something with the numbers and that is where PowerQuad can help.    Most signal processing algorithms boil down to simple, repetitive operations over arrays of data.   Just about everything can be boiled down to a multiplication and add.  This is why you may have heard quite a bit about multiply and accumulate units (MAC) in DSP engines.  It is a ideal use case for a coprocessor.


The PowerQuad at its core has the logic to handle the most common “building blocks”.    Sometimes when you have a time series, you process the data in a manner to preserves all of the “time information”.  Meaning, the get information out the “signal processing black box” that is still a set of datapoints correlated to some block of time.   They just might be filtered or modified in some way.    For example,  maybe you have a a signal where you want to remove 60Hz noise.    You might consider a digital FIR or IIR filter.   Other times you “transform” your data into information that is “correlated” to something else, such as a rate or “frequency”.   We will be exploring both of these application in future articles but the PowerQuad help with both of these use cases. 


LPC55S69 PowerQuad Application - Machine Condition Monitoring


The LPC55S69 can bring in time series data via several interfaces.    In this article I measured acoustic pressure with a digital MEMs microphone over a digital audio port (I2S).  You could also take measurements with the analog to digital converter.    For example, I have a little breakout board for an ADXL1001BCPZ accelerometer I built last year:


Figure 4: ADXL1001BCPZ Accelerometer Board (Left)


This ADXL1001BCPZ is high bandwidth accelerometer useful for machine monitoring and vibration analysis applications.   Many common MEMS accelerometers do not have a high enough bandwidth to capture all the dynamic information in a vibrating system.    The -3dB bandwidth of the ADXL1001 stretches to 11KHZ, making it ideal of vibration problems.  Low-cost accelerometers used for simple motion detection and orientation have a very low bandwidth and may not be able to capture the dynamics you are looking for in a vibration application.   Furthermore, many of the MEMs device that can measure in multiple axis do not have the same bandwidth and noise performance on all axes.    We can use the internal ADC in the LPC55S69 to sample the accelerometer over time and build up a time series to understand how something is vibrating.    While microphones can pick up sound traveling in air, accelerometers can be used to understand sound traveling through a physical structure.  Using signal processing techniques, we even combine information from multiple sensors (measuring the same thing in different ways) to better understand a problem.


In the neck of the woods where I grew up, there were lots of experienced auto mechanics who could quickly identify problems without even opening the hood. The first method to debug a problem was to take the car for a drive or start the motor and “listen”.   Many of these individuals were well trained could know exactly what an issue was is simply by listening.    All mechanical systems vibrate.    *How* they vibrate is dependent on their size, shape, material properties, and operating conditions.     These mechanical vibrations couple to the air and we can “hear” what is going on.     If you have some situational awareness of the mechanical system, you know how something *should* sound when the system is operating normally.      If a component starts failing, the mechanical system changes and it will vibrate differently.     Because the “boundary conditions” of the system changed, the nature of the sound produce changes.       We can instrument the machine with sensors, say an accelerometer, and capture the time series.    Using some math (DSP) and our a-priori knowledge of how the system is supposed to behave, it is possible to see predict failure before it occurs.


Our global industry is driven by large and expensive electro-mechanical machines.    All the things we consider essential for life, say Oreo cookies and toilet paper, are produced in large factories with large, high dollar value processes.    It makes absolute sense to automate the measurement and analysis of high value machines in as the money saved from unplanned downtime is incredible.       The LPC55S69 can be a good fit for many “smart sensor” applications as it can be packed in tight spaces, consume little power and be able to do a 1st level data reduction at the sensor.   Instead of transmiting large amounts of data from a system, the LPC55S69 can allow for significant signal processing to reduce a complex time series into other metrics that can be analyzed at an enterprise level to determine if a failure will occur.     The LPC55S69 with the PowerQuad is a great fit for the Industrial IOT. 


LPC55S69 PowerQuad Application - Power Line Communications and Metering


A completely different but interesting use-case for the LPC55S69 PowerQuad is Power Line Communications (PLC).   There are many sensor applications where you need to transmit and receive data, but you only have access to DC or AC power lines.     Many new smart meters attached to you home employ this technology.   PLC uses sophisticated techniques such as Orthogonal Frequency Division Multiplexing (OFDM) to transmit data on a power line.    OFDM is an interesting technique as it allows you to send data bits down a communications channel *in parallel* across several frequency bands.     It is tolerant to noise as you can achieve high bit rates by using many parallel channels/bands where each band contains slowly moving data.


A core requirement of any OFDM solution is being able to compute Fast Fourier Transforms (FFT) in real time on an incoming time series.     If you can efficiently compute an FFT, it is straightforward to encode/decode data on both the transmitting and receiving ends of the system.          Using bins of the FFT, data is encoded using the real and imaginary components (amplitude and phase) to make up bits of a data "word".    Once you encode data in the "bins", you can use an inverse FFT to get a time signal to output to a digital to analog converter.     Decoding is essentially figuring out when you signal starts and then using an FFT to get the "bins".   Once you have your frequency bins, you look at amplitude/phase information to reconstruct your data word.


Figure 5:   OFDM Time Series,  Frequency Domain Symbol Spectrum and QAM Symbols.


This is a gross simplification of the OFDM process but accelerators such as the PowerQuad are a key element to making it work     The LPC55S69 is well suited to this particular application as most of the complexities of the algorithm could be implemented using the PowerQuad leaving your computational resources (such as the Cortex M33) to implement your metering and measurement application.   All of this can be done while consuming very little active energy in a small package.  At one time, you would have needed a power-hungry IC to perform this process.


Moving Forward with the PowerQuad


I hope you are now interested in some of use cases of the LPC55S69 and the PowerQuad engine.   In the coming articles we can going to dive into some of the different aspects of the PowerQuad engine and demonstrate some processing on the Mini-Monkey platform.    Stay tuned and feel free to check out the LPC55S69.     And in case you missed it, here are some other LPC55S69 blogs/videos:

The Mini-Monkey is now officially “out the door”.   I just sent the files to Macrofab and can’t wait to see the result.   Before I talk a bit about Macrofab, we will look at what going to get built. A few weeks ago, I introduced a design based upon the LPC55S69 in the 7mm VFBGA98.   The goal was to show that this compact package can be used with low cost PCB/Assembly service without having to use the more expensive build specifications. The Mini-Monkey board will also be used to show off some of the neat capabilities of the PowerQuad DSP engine in future design blogs.    Here is what we ended with for the first version:

Figure 1.  Mini-Monkey Revision A



  • Lithium-Polymer battery power with micro-USB Charging
  • High-speed USB 2.0 Interface
  • SWD debug via standard ARM .050” and tag-connect interface
  • Digital MEMs microphone with I2S Interface
  • 240x240 1.54” IPS Display with HS-SPI interface
  • Op-amp buffer for one of the 1MSPS ADC channels
  • 3 push buttons.  One can be used to start the USB ROM bootloader
  • External Power Input
  • 16MHz Crystal
  • 11 dedicated IO pins connected to the LPC55S69.   Functions available:
    • GPIO
    • Dedicated Frequency Measurement Block
    • I2C
    • UART
    • State Configurable Timers (Both input and output)
    • Additional ADC Channels
    • CTIMERs
  • The HS-SPI used for the IPS display is also brought to IO pins


I am a firm believer in not trying to get anything perfect on the 1st try.    It is incredibly inexpensive to prototype ideas quickly so I decided to try to get 90% of what I wanted in the first version.   As we will see, it is inspesive to iterate on this design to work in improvements.    Without too much trouble,    I was able to get everything I wanted on 2 signal layers with filling in a power reference on the top and bottom sides.  If this was a production design, I would probably elect to spend a bit more to get two solid inner reference planes by using a 4-layer design.     Once a design hits QTY 100 or more, the cost of using a 4-layer stack-up can be negligible. A 4-layer stack-up makes the design much easier to execute and compliant with EMI, RFI requirements.      For most of my “industrial” designs where I know that it won’t be high quantity, I always start at 4-layer unless it is a simple connector board.    


For this 1st run, I wasn’t trying to push the envelope with how much I could get done with low cost design rules and a 2-layer stack-up. The VFBGA leaves quite a bit of space for fanning out IO.  Quite a bit can be done on the top layer without vias.      I had a few IO that ended up in more difficult locations, but routing was completely quickly.


Figure 2.  Mini-Monkey VFBGA Fanout


As you can see, I did not make use of all the IO.       If I had used a 4-layer board I would be simpler to get quite a bit more of the IO fanned out.       Moving to smaller vias, traces and a 4-layer stack-up would probably allow one to get all IO’s connected.   For this design,  I was trying to move quickly as well as use the standard “prototype” class specs from Macrofab.    This means 5 mil traces, 10 mil drills with a 4-mil annular ring.  If you can push to 3.5mil trace/space,  NXP AN12581 has some suggestions.


I did want to take a minute to talk about Macrofab.     I normally employ the services of a local contract manufacturer but this time I elected to this online service a try.     After going through the order process, I must say I was thoroughly impressed!       The 1st step is to upload your PCB design files.  I use Altium Designer PCB package and Macrofab recommends uploading in OBD++ format.   Since this format has quite a bit more meta-data baked than standard Gerbers, the online software can infer quite a bit about your design.


Figure 3.  Macrofab PCB Upload


The Macrofab software gives you a cool preview of your PCB with a paste mask out of the gate.  Note that this design is using red solder mask as that is what is included in the prototype class service.  Once you have all the PCB imported, you can now upload a Bill of Materials (BOM).

Figure 3.  Macrofab BOM Upload


Macrofab provides clear guidance on how to get your BOM formatted for maximum success.      Once the BOM is uploaded, the online tool searches distributors and you can select what parts you want to use.   The tool also allow one to  leave items as Do No Place (DNP).       I was impressed that it found almost everything I wanted out of the box.   Pricing and lead time are transparent.


Next up is part placement:


Figure 4.  Macrofab Part Placement


Using the ODB++ data, the Macrofab software was able to figure out my placements.   I was thoroughly impressed with this step as it was completely automatic.      The tool allows you to nudge components if needed.    Once placements are approved, the tool will give you a snapshot of the costs.



 Figure 5.  Cost Analysis and Ordering


What I liked here was how transparent the process was.    Using the prototype class service, a single board was $152.  This is an absolute steal when you consider that all the of the setup costs, parts and PCBs are baked in. If you consider the value of your time, this is an absolute no brainer.    I also like that it gives you a cost curve for low volume production.      In the future, I am going to have a hard time using another service that can’t give me much data with so little work.        


I ended up ordering 3 prototype units.  Total cost plus 2-day UPS shipping was $465.67.      Note, I did end up leaving one part off the board for now:  the 1.54” IPS display.     This part requires some extra “monkeying” around as it is hot bar soldered and needs some 2-sided tape.    I decided to solder the 1st three prototypes on my bench to get a better feel for the process of using this display.  However, I am more than happy to push the BGA and SMT assembly off to someone else.


It looks like board are going to ship on the 1st of May.  I’ll post a video and update when they come in.  So far, the experience with Macrofab has been quite positive and I am eager to see the results.  Once I get the design up and running, I’ll post documentation to bitbucket.

In part two in this series on designing with the LPC55S69 VFBGA98 package,  I am going to show you how to use the NXP MCUXpresso SDK tools to help with physical design process.    Combining some features in MCUXpresso with my PCB tool of choice, Altium Designer, I can significantly reduce the time in the CAD process.


The first step in designing a PCB with a new MCU is to add the part into your component libraries.      Component library management can a source of passionate disagreements between design engineers.      My own view on library management is rooted in many years of making mistakes!  These simple mistakes ultimately caused delays and made projects more difficult than they needed to be.   Often time these mistakes were also driven by a desire to "save time".   Given my experience, there are a few overarching principles I adhere to.


  1. The individual making the component should also be the one who has to stay the weekend and cut traces if a mistake is made. This obviously conflicts of the “librarian/drafter” model but I literally have seen projects where the librarian made a mistake on a 1000+ pin BGA that cost >$5k.  This model was put in a library and marked as “verified”.         The person making the parts needs some skin in the game!     In this case, the drafting teams claimed they had a processing that included a double check but *no one in that process knew they context on how the part was going to be used*.     
  2. Pulling models from the internet or external libraries is OK as a starting point but it is just that,  A starting point. You must treat every pin as if it was wrong and verify. Since many organizations have specific rules on how a part should look,  you will need to massage the model to meet your own needs.   Software engineers shake their head at this rule.  "Why not build on somebody else's libraries?   It is what we do!".     Well,    A mistake in a hardware library can take weeks if not months to really solve....  The cost, time and frustration impact can be huge.   We hardware engineers can't simply "re-compile".   
  3.  I don’t trust any footprint unless I know it has been used in a successful design.  The context of how a part is used is very important (which leads to #4).
  4. I believe the design re-used is best done at a schematic snippet level, not an individual part.   After all,   once I get this Mini-Monkey board complete,  I will never again start with just the LPC55S69.  I want all the “stuff” surrounding the chip that makes it work!


To the casual observer,  these principles seems onerous and time consuming but I have found that the *save me time over the course of the project*.  Making your own parts may seem time consuming but it *does not have to be*.     There are tools that can make your life simpler and the task less arduous.        Also making your own CAD part is  useful for a few other reasons:


  1. You have to go through a mental exercise when looking at each of the pins. It forces you brain to think about functionality in a slightly different way.      When starting with a new part/family, repeated exposure is a very good way to learn.
  2. Looking at the footprint early on gets your brain in a planning mode for when you do get started.


One could argue that this is “lost” time as compared to getting someone else to do the CAD library management it but I really feel strongly that it saves time in the long run.     I have witnessed too many projects sink time into unnecessary debugging due to the bad CAD part creation.   I feel the architect of the design needs to be intimately involved and take ownership of the process.


The LPC55S69 in the VFBGA package has only 98 pins.    With no automation or tools, it would not take all that long build a part right from the datasheet.   However, it is on the edge of being a time consuming endeavor.     Also,   when I build schematic symbols, I tend to label the pins with all possible IO capabilities allowed by the MCU pin mux.  This can make the part quite large but it also helps see what also is available on a pin if I am in in a debug pinch.       Creating pins with all this detail can be quite time consuming.     I use Altium Designer for all of my PCB design and it has some useful automation to make parts more quickly.   NXP’s MCUXpresso tool also has a unique feature that can really help board designers get work done quickly.


Creating the Pin List


Built into MCUXpresso is a pins tool that is *very* useful in large projects with setting up the pin mux’s and doing some advanced planning.    While it is primarily a tool for bootstrapping pin setup for the firmware, It can also use useful to drive the CAD part creation process.       Simply create a new project and start the pins tool:



The pins tools gives you a tabular and physical view of pin assignments.   Very useful when planning your PCB routing.    We will use the export feature to get a list of all the pins, numbers and labels.



The pins tool generates a CSV file that you can bring into your favorite editor. Not only do I get the pin/ball numbers,   I get all of the IO options available via the MCU pin mux. 




Using the Pin List To Generate Component Pins


 With just a few modifications, I can get the spreadsheet into a format useful for the Altium Smart Grid Paste Tool.



Altium Designer requires a few extra columns of meta-data to be able import the data into a grouping of pins in the schematic library editor.   At this point you could group the pins to your personal preference.  I personally like to see all pin function of the schematic but does create rather large symbols.         The good news here is that by using MCUXpresso and Altium you can make this a 10-minute job, not a 3 hour one.  Imagine going through the reference manual line by line!






Viola!  A complete symbol.     It just took a few minutes of massaging to get what I wanted.     Like I stated previously, a 98 pin package is not that bad to do manually but you can imagine a 200 or 300 pin part (such as the i.MX RT!) 


The VFBGA package is 7mmx7mm with a 0.5mm pitch.    There are balls removed from the grid for easier route escaping when use this part with lower cost fabrication processes.



Once again,   with a quick look at NXP documentation and using the Altium IPC footprint generator,   we can make quick work of getting an accurate footprint.




The IPC footprint generator steps you through the entire process.  All you need is the reference drawing.   


A quick note about the IPC footprint tool in this use case.   The NXP VFBGA has quite a few balls removed to allow of easier escaping.     The IPC footprint generator can automatically remove certain regions, I found that this particular arrangement needed a few minutes of hand work to delete the unneeded pads given the unique pattern.


By using Altium and NXP’s MCUXpresso tool together, I was about to get my CAD library work done very quickly.   And because I spent some time with the design tools,   I became more familiar with the IO’s and physical package.   This really helps get the brain primed for the real design work.




At this point in the proces I have a head start on the schematic entry and PCB layout.     Next time we are going to dive in a bit to see what connections we need to bootstrap the LPC55S69 to get it up and running.    We will take a look at some of the core components to get the MCU to boot and some peripheral functions that will help the Mini-Monkey come alive!    

Now that we have discussed the LPC5500 series at a high level and investigated some of the cool features,  it is time to roll up our sleeves work on some real hardware.    In this next series of articles, I want to step through a simple hardware design using the LPC55S69.   We are going to step a bit beyond the application notes and going through a simple design using Altium Designer to implement a simple project.  


Many new projects start with development boards (such as the LPC55S69-EVK) to evaluate a platform and to take a 1st cut at some of the software development work.      Getting to a form-factor compliant state quickly can just as important as the firmware efforts.      Getting a design into a manufacturable form is a very important step in the development process.  With new hardware, I like to address all of my “known unknowns” early in the process so I almost always make my own test PCBs right away.  The LPC5500 series devices are offered in some easy to use QFP100 and QFP64 packages.      Designers also have the option of a very small VFBGA98 package option.     Many engineers flinch when you mention BGA, let alone a “fine pitch” BGA.     I hope to show you that it is not be bad as you may think and one can even route this chip on 2 layers.


Figure 1.  The LPC55S69 VFBGA98 Package. QFP100 comparison on the bottom.


The LPC55S69 is offered at an attractive price but packs a ton of functionality and processing power into a very small form-factor that uses little energy in both the active and sleep cases.     Having all of this processing horsepower in a small form-factor can open new opportunities.  Let’s see what we can get done with this new MCU.


The “Mini-Monkey” Board


In this series of “how to” articles, I want to step through a design with the LPC55S69 in the VFBGA and *actually build something*.   The scope of this design will be limited to some basic design elements of bringing up a LPC55S69 while offering some interesting IO for visualizing signal processing with the PowerQuad hardware.      Several years ago, I posted some projects on the NXP community using the Kinetis FRDM platform.   One of the projects showcased some simple DSP processing on an incoming audio signal.


The “Monkey Listen” project used an NXP K20D50 FRDM board with a custom “shield” that included a microphone and a simple OLED display.       For this effort I wanted to do something similar except using the LPC55S69 in the VFBGA98 package with some beefed-up visualization capabilities.       There is so much more horsepower in the LPC55S69 and we now have the potential to do neat applications such as real time feature detection in an audio signal, etc.        Also given the copious amounts of RAM in the in the LPC55S69, also wanted to step up the game a bit in the display.     The small VFPGA98 package presents with an opportunity to package quite a bit in a small space.  So much has happened since the K20D50 hit the street!


I recently found some absolutely gorgeous IPS displays with a 240x240 pixel resolution from   They are only a few dollars and have a simple SPI interface.  I wired a display to the an LPC55S69-EVK for a quick demonstration:


   Figure 2:  The LPC55S69EVK driving the 240x240 Pixel 1.54” IPS display.


It was difficult for me to capture how beautiful this little 1.54” display is with my camera.  You must see it to believe it!    Given the price I figured I would get a boxful to experiment with for this design project!


Figure 3:   240x240 Pixel 1.54” IPS display from


The overarching design concept with the “mini-monkey” is to fit a circuit under the 1.54” display that uses LPC55S69 with some interesting IO:


  • USB interface
  • LIPO Battery and Charger circuitry
  • Digital MEMs microphone
  • SWD debugging
  • Buttons
  • Access to the on-chip ADC


I want to pack some neat features beneath the screen that can do everything the “Monkey Listen” project can, just better.    With access to the PowerQuad, the sky is the limit on what kinds of audio processing that can be implemented.  The plan is to see how much we can fill up underneath the display to make an interesting development platform.    I started a project in Altium designer and put together a concept view of the new “Mini-Monkey” board to communicate some of the design intent:


Figure 4:  The “Mini-Monkey” Concept PCB based upon the LPC55S69 in the VFBGA98 package


While this is not the final product, I wanted to give you an idea of where I was going.      The “Mini-Monkey” will be a compact form fact board that can be used for some future articles on how to make use of the LPC5500 series PowerQuad feature.   There will be some extra IO made available to enable some cool new projects to showcase the awesome capabilities of the LPC55S69.    Got some ideas for the "Mini-Monkey"?    Leave a comment below!


In the next article we will be looking at the schematic capture phase and how we can use NXP’s MCUXpresso SDK to help automate some of the work required in Altium Designer.     I will be showing some of the basic elements to getting an LPC55S69 design up and running from scratch.      We will then look at designing with the VFBGA98 package and get some boards built.   I hope I now have you interested so stay tuned.   In the meantime, checkout this application note on using the VFBGA package on a 2-layer board:

I recently wrote about the ample processing capabilities built into the LPC55S69 MCU  in addition to the Dual USB capabilities and large banks of RAM.  Now it is time to explore some peripherals and features that are often overlooked in the LPC family but are very beneficial to many embedded system designs.


The State Configurable Timer


An absolute gem in the LPC family is the “State Configurable Timer” (SCT).      It has been implemented in many LPC products and I feel is one of the most under-rated and often misunderstood peripherals.    When I first encountered the SCT, I wrote it off as a “fancy PWM” unit.   This was a mistake on my part as the SCT is an extremely powerful peripheral that can solve many logic and timing challenges.     I have personally been involved in several design efforts where I could remove the need for an additional programmable logic device on a PCB by taking advantage of the SCT in an LPC part.  At its core, the SCT is a up/down counter that can be sequenced with up to 16 events.   The events can be triggered by IO or by one of 16 possible counter matches.   An event can then update a state variable, generate IO activity (set, clear, toggle), or start/stop/reverse the counter.


Consider an example which is similar to a design problem I previously used the SCT for.


Given a 1 cycle wide Start input signal

i.) Assert a PowerCtrl signal on the 3rd Clk cycle after the start.
ii.) After 2 Clk cycles the assertion of PowerCtrl, output exactly 2 pulses on the Tx output pin at a programmable period.
iii.) 5 Clk cycles after ii.), de-assert PowerCtrl
iv.) After 2 Clk cycles of the de-assertion of PowerCtrl, output a 1 cycle pulse to the Complete pin.




This task could be done in pure software if the incoming CLK was slow enough.    Most timer/counter units in competing MCUs would not be able to implement this particular set of requirements       In my use case (an acoustic transmitter), I was able to implement this completely in the SCT with minimal CPU intervention and no external circuitry.     This is a scenario where I might consider an external CPLD or FPGA but the SCT would be more than capable of implementing the behavior.    I highly recommend grabbing the manual for the LPC55 family and read chapter 24.   If you have never used a peripheral like the SCT, I highly recommend learning out about it. 


Programmable Logic Unit


In addition to the SCT, there is a small amount of programmable logic in the LPC55 family.       The PLU is an array of twenty 5-input Look up tables (LUTs) and four flip-flops.    From the external pins of the LPC55xx, there are 6 inputs to the PLU fabric and 8 outputs.     While this is not a large amount of logic, it is certainly enough to replace some external glue logic you might have in your design.  There is even a free tool to draw your logic schematically or describe using the Verilog HDL.



I often find I need a just handful of gates in a design to glue a few things together and the PLU is the perfect peripheral for this need.




Another indispensable feature that has been in the LPC series since the beginning is a bootloader in ROM.   For me, it is a must have as it means I can program/recover code via one of many interfaces without a JTAG/SWD connection.     For factory/production programming and test, it saves quite a bit of hassle.    The boot rom allows device programming over SPI, UART, I2C or UART.   I typically use the UART or USB interface with FlashMagic.     This feature has certainly benefited me on *every* embedded project, especially when it comes to production programming and test.   There have even been some handy times to recover a firmware image in field.     Many designs included some sort of bootloader and having an option that is hard coded in ROM is a great benefit that you get for free in the LPC family.


It is difficult to capture all the benefits of the new LPC55 family, but we hope you are interested.    The LPC55 family is offered many convenient IC packages, is low power (both active and sleep) and is packed with useful peripherals.       The LPC55S69 development board is available at low cost.   Combining the low cost hardware tools with the MCUXpresso SDK, you can start LPC55 development today.   From here we are going to start looking at some interesting how-to’s and application examples with the LPC55 family.   Stay tuned and visit to learn more.

I recently wrote about the ample processing capabilities built into the LPC55S69 MCU. In this article I am going to highlight some very useful IO interfaces and memory.


Dual USB


One killer feature in some of the other LPC parts (for example the LPC4300 series and the LPC54000 series) is the *dual* USB interface. Dual USB enables some very interesting use cases and It is something that sets the LPC portfolio apart from its competitors. For the LPC5500 MCU series, High-Speed USB and Full-Speed USB with on-chip PHY features are fully supported, providing up to 480Mbit/s of speed. Let’s examine a scenario I comonly encounter.


In my projects, I like to have both USB device and USB host capabilities on separate connectors.   Instead of using USB On-the-Go (OTG) with a single connector, it has been my experience the many deeply embedded and industrial projects benefit from separate connectors.  Consider the arrangement in figure 1.




Figure 1:   Dual USB with FAT File System, SDIO and CDC.


On the device side, I almost always implement a mass storage class device along with a communications class device.   The mass storage interface is connected to the SDIO port through the FATFs IO Layer so a PC can access sectors on the  SD card.   FatFS  is my go library for embedded FAT file systems.  It is open source and battle tested.    While I choose to always pull the files from author’s siteMCUXpresso SDK has FatFS  built in.   With this file it can be easily copied between a PC and the LPC5500 system.   Data logging and configuration storage is now built into your application.   The CDC interface can provide a virtual COM port interface to implement a basic shell.     


I use the USB host port for mass storage as well.   Like the SDIO interface, I connect the host drivers (examples in the MCUXpresso SDK) to through FatFS  IO layer so my system can read write files on a thumb drive.       One very useful application in my projects is a secondary bootloader.  There have been several products I have worked on that required field updatability, but the users do not necessarily have access to a PC.   


To update the system, data files and new firmware can be placed on a thumb drive and inserted into the LPC5500 system.   A bootloader can then perform necessary programming to update the internal flash.         In additional firmware updates, the host port could also be used to copy device configuration information.   A technician would just carry a USB “key” to update units.     Having both USB device and host using the two LPC55S69 USB interfaces can unlock many benefits.  


With the SDIO interface and USB host, one is not limited to the more common SD cards and thumb drives.  There are other options for more robust physical interfaces.    Instead of a removable SD card,   a soldered down eMMC can be used.      For the USB host interface, there are rugged “DataKey” options available.    Also note that that the DataKeys come with an SDIO interface as well.




Figure 2:   Rugged Memory Options.   DataKey (Left) and eMMC (Right)


One last tidbit is that the SDIO interface can also be used to connect to many high speed WIFI chipsets.   It is an option that is easy to forget about.


Copious amounts of RAM


While I certainly came up in a time where RAM was sparse, I love having access to a large amount lot of it.    At 360KB of RAM, there is no shortage of RAM in the LPC55S69!      Relating to the USB and file storage application, large RAM buffers can be important for optimizing for transfer speeds.     It is common to write SD cards and thumb drives in 512-byte blocks.       This transfer size however is not always the most optimum case for overall speed.    The controller in the memory cards has to erase internal NAND flash in much larger sector sizes resulting in slow write performs   It has been my experience that queueing up data until I have at least 16KB can improve overall transfer speeds but up to an order of magnitude. In most of my use cases, I implement a software cache of at least 16KB to speed transfer of large files.     Larger caches can yield better results.     These file system caches can consume quite a bit of memory, so it is very helpful that the LPC5500 series has quite a bit of RAM available.


Given the security features of the LPC55S69, the extra RAM can make integration of SSL stacks for IOT applications much simpler.     One example is the use of WolfSSL for implementation of SSL/TLS.  While it targets the embedded space, SSL processing can be complicated and require a significant amount of stack and heap.      In one particular use case I had with an embedded IOT product, I needed 35k of Stack and about 40kB of heap to handle of the edge cases when dealing with connections to the internet over TLS.        The large reserve of RAM in the LPC55S69 easily allows for these larger security and encryption stacks.


Another use for the large memory capability is a graphics back-buffer.     It would be simple to hook a high-resolution IPS to the LPC55S59 and be able to store a complete image back buffer in memory.  For example a 240x240 IPS display with 16-bit color depth would require 112.5KiBytes of RAM!    There is plenty of RAM left in the LPC55S69 for your other tasks.  In fact, you could dedicate one of the CPUs in the LPC55S69 to handling all the graphics rendering.   The copious amount of RAM enables neat applications such as wearables, industrial displays and compact user interfaces.



Figure 3.   A 240x240 IPS Display with SPI Interface from


One other important aspect to the RAM in the LPC55S69 is its organization. It is intelligently segmented (with 272Kb continuous in the memory amp) via a bus matrix to allow the Arm Cortex-M33 cores, PowerQuad, CASPER and DMA engine access to memory with minimal contention between bus masters.



 Figure 4.   LPC55S69 Memory Architecture.


The LPC5500 Series offers a lot in a small, low power package. The large amount of internal SRAM and dual USB interface enables many applications and makes development simpler. Stayed tuned for part 3 of the LPC5500 series overview. I will be further examining some interesting peripherals in the LPC5500 series that set it apart from its competition.


For more information, visit:

Most of my life, programming and embedded microcontrollers has been a passion of mine.  Over the course of my career I have gained experienced on many different architectures including some that are very specialized for specific applications. Even with current diverse market of specialized devices,  I continue to find the general-purpose microcontroller market the most interesting. I believe this stems from how I first fell in love with computing. It can be traced back to the 7th grade when we were learning “Computer Literacy” with the Apple IIe computer. During the course, students learned how to code programs in the BASIC language. Projects spanned everything from simple graphics, printing and games. Simultaneous to that experience, I learned that my other 7th grade passion, playing the Nintendo, was connected to the activities in computer literacy. Through a popular gaming magazine, I discovered that the chip that powered the Nintendo was the device that powered the computers at school, the venerable “6502”. That was the real moment of epiphany. If a CPU could be both a gaming system and a word processor,  it could really *do anything* I wanted. It wasn’t long before I was digging into the intricate details of the 6502 to power my creations. The 6502 was my 1st general purpose CPU.


Fast forward 30 years … The exact same principal applies today. We have an incredible amount of power in small packages. There is a lot you can accomplish with seemly little. I am always on the lookout for new parts that may appear to be “vanilla” on the surface but have some hidden gems that really help me accomplish cool projects. The NXP LPC5500 series really appealed to my sensibilities as I immediately saw features that make it relevant to today’s design challenges. In the coming weeks I want to highlight some features of the LPC5500 series. This is not intended to be an all-encompassing review of the LPC5500 series, but I hope to hit on some highlights that could be beneficial to your design challenges. In this article we are going to focus a bit on the LPC55S69 device and its core platform. There is a lot under the hood!


First – It is actually 4 processors in 1!


From the block diagram in figure 1, one can see that there are two Arm Cortex-M33 cores. This by itself is an extremely useful feature given the low cost and low active power aspects of this device. I have made good use of the other LPC families with asymmetric cores (such as the LPC43xx device with a Cortex-M4 and -M0).  Having a 2nd core is very useful in offloading common tasks. In my experience with the LPC43xx, I used the Cortex-M0 as a dedicated graphics co-processor to offload UI tasks from the Cortex-M4 while was doing other time critical DSP operations.

In the case of the LPC55S69, both cores are Cortex-M33.  The Cortex-M33 is a new offering from ARM based upon the ArmV8-M Instruction set architecture.  Like the Cortex-M4, it has hardware floating point and DSP instructions but also includes TrustZone.  TrustZone enables new security states to ensure your critical code can be protected.    Another notable new feature is a co-processor interface for streamlining integration with dedicated co-processors.   This feature is germane to the LPC5500 series as there are 2 coprocessors that we are about to talk about.   You can learn more about the Cortex-M33 here.  


I can’t count the number of design scenarios where I wished I had an extra programmable CPU that could handle a task that might be extremely time critical but not actually need a lot of code space. For example, I have used OLED displays that have a non-standard I/O interface that needs bit-banged.  It became a great opportunity to have the 2nd core do the work. You could even turn that 2nd core into a small graphics co-processor.


Figure 1.  The LPC55S6x MCU Family Block Diagram


I mentioned four processors. So, where are the 3rd and 4th processors? Number three is hidden in the “DSP accelerator” block. The Cortex-M4 core of which many other LPC microcontrollers are built upon have DSP specific instructions that can accelerate certain math functions. I have given seminars at the Embedded Systems Conference on using the DSP instructions in a general-purpose CPU scenario. The LPC55S69 DSP accelerator (A.K.A . PowerQuad) is a separate core whose sole purpose is to accelerate DSP specific tasks. While PowerQuad is not a pure general purpose CPU, it can perform tasks that would significantly burden one of the Cortex-M33 cores. In many cases you can get a 10x improvement over convention software implements of certain algorithms. PowerQuad covers all the common use cases such as Fast Fourier Transforms (FFTs), IIR filters, convolution, trigonometric functions and matrix math. It has enough “brains” to do almost all the work so your main general purpose CPU(s) are free for other tasks. The PowerQuad is enabled by a very specific new feature in the Cortex-M33 (ARMv8‑M specifically) that allow for coprocessors to be connected to the CPU through a simple interface. Data transfer to the coprocessor is low latency and can sustain a bandwidth of up to twice the memory interface to the processor.


Lastly,   the 4th processor is another specialized core called “CASPER”. CASPER is high performance accelerator that is optimized for cryptographic computations. At its core, CASPER is a dual multiply-accumulate-shift engine that can operate of large blocks of data. CASPER has special access to 2 blocks of RAM so data can be accessed parallel. Applications of CASPER include accelerating cryptographic functions such as public key verification (i.e. TLS/SSL), hash computations or even blockchain. As CASPER is a general math engine, it is also possible to perform DSP operations in parallel with the PowerQuad. With a little bit of imagination, one could achieve quite a bit with minimal intervention from the general-purpose Cortex-M33 cores.


Figure 2.  PowerQuad (Left) and CASPER (right) Accelerators


While the PowerQuad and CASPER processing engines are not technically a 3rd and 4th general purposes cores, they can easily do the work that you might normally require of an entire CPU. We will be talking much more about these features in the future but the key take-away:


The PowerQuad DSP and CASPER accelerators are a powerful math engines that can allow you to number crunch a rate similar to dedicated DSPs. All this while still reserving your generally purpose processors to handle other system tasks.    


All of this functionality is delivered on a low power 40nm process technology packaged in approachable footprints at a low price point. Interested yet?  I know I am!


For more information, visit: