In some of my past articles on the PowerQuad, we examined some common signal processing operations (IIR BiQuad and the Fast Fourier Transform) and then showed how to use the PowerQuad DSP engine to accelerate the computations. The matrix engine in the PowerQuad can be used to perform common matrix and vector operations to free the M33 CPU(s) to perform other tasks in parallel. In general, the matrix engine is limited to a maximum size of 16x16 size (or 256 operations).
Figure 1: PowerQuad Matrix Operation Maximum Sizes
A simple, but useful, operation that is common in a processing pipeline is the Hadamard (elementwise) product. Think of it as multiplying two signals together. Let us say we have two input vectors/signals that are 1x32 in size:
Figure 2. Hadamard Product
A quick note: because the Hadamard product only needs a signal element from each of the inputs to produce each element in the output, the actual shape of the matrix/vector is inconsequential. For example, a 16x16 matrix and 1x256 vector would yield the same result if the input data is organized the same in memory.
The cartoon in Figure 2 illustrates a common application of the Hadamard product: windowing of a time domain signal. In my last article, we looked at how Discrete Fourier Transforms are constructed from basic mathematical operations. There was one assumption I made about the nature of the signal we comparing to our cosine/sine references. Consider the cartoon in Figure 3.
Figure 3: The Rectangular Window as a “Default”.
Let us say we captured 32 samples of a signal via an Analog to Digital Converter (ADC). In the “real” world, that signal existed before and after the 32 point “window” of time. Here is philosophical question to consider:
Is there any difference between our 32 samples and an infinite long input multiplied by a “rectangular” window of 1’s around our region of interest?
In fact there is not! The simple act of “grabbing” 32 samples yields us a mathematical entity that is the “product” of some infinite long input signal and a 32-point rectangle of 1’s (with zero’s elsewhere) around our signal. When we consider operations such as the Discrete Fourier Transform, what we are transforming is the input signal multiplied by a window function. Using mathematical properties of the Fourier Transform, it can be shown that this multiplication in the time domain is a “shift” of the window’s Fourier Transform in the frequency domain. There is a lot of theory available to explain this effect, but the takeaway is the rectangular window exists by the simple act of grabbing a finite number of samples. One of my pet peeves is seeing literature that refers to the “windowed vs non-windowed transforms”. The Rush song “Free Will” has a memorable lyric:
“If you choose not to decide, you still have made a choice”
By doing nothing, we have selected a rectangular window (which shows up as a sin(x)/x artifact around the frequency bins). While we cannot eliminate the effects of the window, we do have some choice in how the window artifacts are shaped. By multiplying the input signal by a known shape, we can control artifacts in the frequency domain caused by the window. Figure 2 shows a common window called a “Hanning” window.
In the context of the LPC55S69 and the PowerQuad, the matrix engine can be used to apply a different “window” an input signal. Since applying a window before computing a Fast Fourier Transform is a common operation, consider using the Hadamard product in the PowerQuad to do the work.
Vector Dot Product
In my last article, I showed that the Discrete Fourier Transform is the dot product between a signal and a Cosine/Sine reference applied to many different frequency bins. I wanted point this out here as the PowerQuad matrix engine can compute a dot product. While the FFT is certainly the “workhorse” of Frequency domain processing, it is always not the best choice for some applications. There are use cases where you may only to need to perform frequency domain analysis at a single or (just a few) frequency bins. In this case, directly computing the transform via the dot product may be a better choice.
One constraint of the using an FFT is that the bins of the resultant spectrum are spaced as the sample rate over the number of samples. This means the bins may not align to frequencies important to your application. The only knobs you have is to adjust are the sample rate and the number of samples (which must be a power of two). There are cases where you may need to align your analysis to an exact number which may not have a convenient relationship to your sample rate. In this case, you could use the dot product operation using the exact frequencies of interest. I have worked on applications that required frequency bins that were logarithmically spaced. In these cases, directly computing the DFT was the best approach to achieve the results we needed.
The FFT certainly has computational advantages for many applications but it is NOT the only method for frequency domain analysis. Speed is not always the primary requirement for some application so don't automatically think you need an FFT to solve a problem. I wanted to point this out in the context of matrix processing as the PowerQuad could still be used in these scenarios to do the work and keeping the main CPU free for general purpose operations.
Also, I do want to mention that in these special cases there are alternate approaches besides the direct computation of the DFT with the dot product such as Goerztel’s method. Even in these cases, you can use features in the PowerQuad to compute the result. In the case of Goerztel’s method, the IIR BiQuad engine would be a great fit.
There are literally hundreds of applications where you need efficiently perform matrix multiplication, scaling, inversion, etc. Just keep in mind the PowerQuad can do this work efficiently if the matrix dimensions are of size 16x16 or smaller (9x9 in the case of inversion). One possible application that came to mind was Field Oriented Control (FOC). FOC applications use special matrix transformations to simplify analysis and transform motor currents into a direct/quadrature reference frame:
Another neat application would be to accelerate an embedded graphics application. I was thinking that the PowerQuad Matrix Engine could handle 2D and 3D coordinate transformations that could form the basis for a “mini” vector graphics and polygon rendering capability. When I got started with computing, video games drove my interest in “how thing work”. I remember the awe i felt when I 1st saw games that could rotated shapes on a screen. It "connected" when I found computer graphics text that showed this matrix equation:
Figure 5: 2D Vector Rotation Matrix.
This opened my mind to many other applications as the magic was now accessible to me. Maybe I am just dreaming a bit but having a hardware co-processor such as the PowerQuad can yield some interesting work!
Getting Started with PowerQuad Matrix Math
Built into the SDK for the LPC55S69 are plenty of PowerQuad examples. “powerquad_matrix” has plenty of examples that exercise the PowerQuad matrix engine.
Figure 6: PowerQuad Matrix Examples in the SDK.
Let us take a quick peek at the vector dot product example:
Figure 7: PowerQuad Vector Dot Product.
As you can see, there is actually very little required setup PowerQuad for a matrix/vector computation. There are handful of registers over the AHB bus that need configured and then the PowerQuad will do the work. I hope this article got you thinking of some neat applications with the LPC55S69 and the PowerQuad. Next time we are going to wrap up the PowerQuad articles with a neat application demonstration. After that we are going to look at some interesting graphics and IOT applications with the LPC55S69. Stay tuned!
In the meantime, here are all my previous LPC55 articles just in case you missed them.
I wanted to take a quick break from some of the PowerQuad articles to show off a neat library that works well with the LPC55S69. One of the design features of the Mini-Monkey experiment was a 240x240 Pixel IPS display. I feel that the LPC55S69 is a good fit for small, low active power embedded graphics applications. It has quite a bit of internal SRAM to store a framebuffer and has lots of processing power to composite a scene on a small display. In some of my previous articles, we use this display to show static images as well as displaying time series data from a built in MEMs microphone. I ran across a twitter user “The Performance Whisper” who had recently released a lightweight and efficient animated GIF decoder. I *really* wanted to give this library a try and decided to port it to the Mini-Monkey.
Here it is in action:
This video is currently being processed. Please try again in a few minutes.
For this demonstration, I embedded the GIF files in internal flash. It would be straightforward to add some SPI flash to store larger animations. The LPC55S69 also has SDIO interfaces so you could also use an SD card or eMMC read files from a file system. I will have more to say on embedded graphics on the LPC55S69 in the future. In the meantime, check out these additional LPC55S69 resources.
I had some design updates for “Rev B” of my Mini-Monkey design that I wanted to get in the "queue" for testing. For the next revision, I wanted to try PCB:NG for the board fabrication and assembly. PCB:NG is an “on-demand” PCB assembly service focused on turnkey prototypes via simple a web interface. The pricing looked attractive and it appeared that the Mini-Monkey fit within their standard design rules. The Mini-Monkey design uses an NXP LPC55S69 microcontroller that is in a 0.5mm pitch VFBGA98 package. NXP offers guidance on how to use this device with low-cost design rules and I thought this would be a great test for PCB:NG. I had success with Rev A at Macrofab and thought I would give PCB:NG a shot.
Getting your design uploaded is straightforward with PCB:NG. You can upload your Gerber files and get a preview of the PCB. As you move through the process, the web interface will give you an updated price:
Figure 1: PCB:NG Gerber Upload
The online PCB:NG interface includes a Design For Manufacture (DFM) check. The check is exhaustive and includes all the common DFM rules such as trace width, clearance, drill hits etc. In my case, I had some features that violated minimum solder mask slivers and copper to board outline clearances. The online tool allows you to “ignore” DFM violations that may not be an issue. I was able to look through all the violations and mark which ones were of no concern.
Once the Gerber files are uploaded, you can add your parts and as well as the pick/place data. The PCB:NG interface will show you part pricing and availability as soon your Bill of Materials (BOM) is uploaded. You have the option to mark parts as Do Not Place (DNP) if you do not want them populated. In my case, I had 2 components on the Mini-Monkey BOM (a battery and a display) that I did not include as they required some manual assembly steps that I was going to perform once I had the units in hand.
Figure 2 : PCB:NG BOM Upload
Along with the BOM, you must upload XYRS placement data. The XYRS data can be combined in the spreadsheet file used for the BOM. The PCB:NG viewer will also show you where it thinks all the placements are and can make manual adjustments if necessary.
Figure 3 : PCB:NG Part Placement Interface
I had placed my order on 2020-06-10. Throughout the process, PCB:NG sent email updates when materials were in house, when production started, etc. I did have to send in a note that one of the parts (a MEMs microphone) was sensitive to cleaning processes. I received a response the same day noting the exception (PCB:NG uses a no-clean process) and they would add the part to their internal database of exceptions.
I had placed the order when they were in the middle of some equipment upgrades. When I checked the price a few day ago I found that it was lower ($ 380 vs $496) after the new process upgrades. I consider the service a huge value given that they handle some potentially difficult parts. Getting the BGA packages microcontroller and the LGA packaged MEMs soldered professionally was well worth the price. The boards shipped out 2020-06-29. It was a bit longer than the published lead time but communication during the process was good. I think I caught the team in the middle of some equipment upgrades which may have delayed things a bit. PCB:NG took some extra time to get me photos from the X-tay inspection of the BGA and LGA parts. Getting these photos was well worth the wait!
Figure 4: LPC55S69 VFBGA98 Post Assembly X-ray - View 1
Figure 5: LPC55S69 VFBGA98 Post Assembly X-ray – View 2
Figure 6: MEMS Microphone (LGA) Post Assembly X-ray
As you can see of the X-ray images, the solder joints were good. It was also cool seeing the via structures in the PCB and bond wires in the IC packages. You can even see tiny little via structures in the VFBGA98 package itself. How did the build turn out? Here is a video of the Mini-Monkey Rev B:
The experience with PCB:NG was excellent. The boards turned out a great and I was able to test all my changes quickly. Having someone else handle part procurement and assembly is a huge value to me as it allows me to focus on other aspects of the design such as firmware develop for the board bring-up. One possible improvement with the online PCB:NG interface would be to be able to submit ODB++ or IPC-2581 data. These formats bake in more information and could really streamline design upload. I will certainly be using PCB:NG in the future for my prototypes. The on-demand model is helpful, especially when you are busy and need to get some help accelerating your development efforts.
Onward to Revision C! I think I may add eMMC storage and improve the battery circuit. If you want to see the current raw design files, they are available on BitBucket in Altium Designer format.
In my last article, we examined a common time domain filter called the Biquad and how it could be computed using the LPC55S69 PowerQuad engine. We will now turn our attention to another powerful component of the PowerQuad, the “Transform Engine”. The PowerQuad transform engine can compute a Fast Fourier Transform (FFT) in both a power and time efficient manner leaving your main CPU cores to handle other tasks.
Before we look at the implementation on the LPC55S69, I want to illustrate what exactly an FFT does to a signal. The meaning of the data is often glossed over or even worse yet, explained in purely mathematical terms without a description of *context*. I often hear descriptions like “transform a signal from the time domain to the frequency domain”. While these types of descriptions are accurate, I think that many do not get an intuitive feel for what the numbers are *mean*. I remember my 1st course in ordinary differential equations. The professor was explaining the Laplace transform (which is more generalize case of the Fourier Transform) and I asked the question “What does s actually mean in a practical use case?”.
Figure 1. Laplace Transform. What is s?
My professor was a brilliant mathematician and a specialist in complex analysis. He could explain the transform from 3 different perspectives with complete mathematical rigor. Eventually we both got frustrated and he said “s will be a large number complex number in engineering applications”. It turned out the answer was simple in terms of the electrical engineering problems we were solving. After many years and using Laplace again in Acoustic Grad School it made sense but at the time it was magical. I hope to approach the FFT a bit differently and you will see it is simpler than you may think. While I cannot address all the aspect of using FFT’s in this article, I hope it gives you a different perspective from a “getting started” perspective.
Rulers, Protractors, and Gauges
One of my favorite activities is wood working. I am not particularly skilled in the art, but I enjoy using the tools, building useful things, and admiring the beauty of the natural product. I often tell people that “getting good” at wood working is all about learning how to measure, gauge and build the fixtures to carry out an operation. When you have a chunk of wood, one of the most fundamental operations is to measure its length against some fixed standard. Let us begin with a beautiful chunk of Eastern Hemlock:
Figure 2. A 12” x 3” x 26” Piece of rough sawn eastern hemlock.
One of the first things we might want to do to with this specimen is use some sort of standard gauge to compare it to:
Figure 3. Comparing our wood against a reference.
We can pick a standard unit and compare our specimen to a scale based upon that unit. In my case the unit was “inches” of length, but it could be anything that helps up solve our problem at hand. Often you want to pick a unit and coordinate system that scales well to the problem at hand. If we want to measure circular “things”, we might use a protractor as it makes understanding the measurement easier. The idea is to work in a system that is in the “coordinate system” of your problem. It makes sense to use a ruler to measure a “rectangular” piece of wood.
What does this have to do with DSP and Fourier transforms? I hope to show you that a Fourier Transform (and its efficient discrete implementation, the FFT) is simplly just a set of gauges that can be used to understand a time domain signal. We can then use the PowerQuad hardware to carryout out the “gauging”. For the sake of this discussion, let us consider a time domain signal such as this:
Figure 4. An example time domain signal
This particular signal is a bit more complex than the simple sine wave used in previous articles. How exactly would we “gauge” this signal? Amplitude? Frequency? Compute statistics such as variance? Most real-world signals can have quite a bit of complexity, especially when they are tied to some physical process. For example, if we are a looking at the result of some sort of vibration measurement, the signal could look very complicated as there are many physical processes contributing to the shape. In vibration analysis, the physical “things” we are examining with move and vibrate according to well understood physics. The physics show that the systems can be modeled with even order differential equations. This means always be we can write the behavior of the system over time as the sum of sinusoidal oscillations. So, what would be a good gauge to use to examine our signal? Well, we could start with a cosine wave at some frequency of interest:
Figure 5. Gauging our signal against a cosine wave.
Choosing a cosine signal as a reference gauge can simplify the problem as we can easily identify the properties of our unit of measure, i.e. its frequency and amplitude. We can fix the amplitude and frequency of our reference and then compare it to our signal. If we do our math correctly, we can get a number that indicates how well correlated our input signal is to a cosine wave of a particular frequency and a unit amplitude. So, how exactly do we perform this correlation? It turns out to be a simple operation. If we think about the input signal and our reference gauge as discrete arrays of numbers (i.e. vectors), we compute the dot-product between them:
Figure 6. Computing the correlation between a test signal and our feference gauge.
The operation is straightforward. Both your signal input and the “gauge” has the same number of samples. Multiply the elements of each array together and add up the results. Using an array like notation, the input is represented by “x[n]” and the gauge is represented by “re[n]” where n is an index in the array:
What we end up with is a single number (scalar). Its magnitude is proportional to how well correlated out signal is to the particular gauge we are using. As a test, you could write some code and use a cosine wave as your input signal. The test code could adjust the frequency of input and as the frequency of the input gets closer to the frequency of the gauge, the magnitude of the output would go up.
As you can see the math here is just a bunch of multiplies and adds, just like the IIR filter from our last article. There is one flaw however with this approach. There is special case of the input where the output will be zero. If the signal input is a cosine wave of the *exact* frequency as the gauge and is 90 degrees phase shifted with respect to the reference gauge, we would get a zero output.
Figure 7. A special case of our reference gauge that would render zero output.
This is not desirable as we can see that input is correlated our reference gauge, it just is shifted a in time. There is a simple fix and we can even use our piece of hemlock lumber to illustrate.
Figure 8. Gauging along a different side of the lumber.
In Figure 3, I showed a ruler along the longest length of the wood. We can also rotate the ruler and measure along the shorter side. It is the same gauge, just used a different way. Imagine that board was only 1” wide but 24” long. I could ask an assistant to use a ruler and measure the board. Which of those two numbers is “correct”? The assistant could report to me either of those numbers and be technically correct. We humans generally assume length to be the longer side of a rectangular object but there is nothing special about that convention. In figure 6, we were only measuring along 1 “side” of the signal. It is possible to get a measurement that is zero (or very small) while have a signal that looks very similar to the gauge (like in figure 7). We can fix this by “rotating” our ruler similar to figure 8 and measure along the both ”sides” of the signal.
Figure 9. Using two reference gauges. One is “rotated” 90 degrees.
In figure 9, I added another “gauge” labeled “A” in purple. The original gauge is labeled “B”. The only difference between the two gauges is that B is phase shifted by 90 degrees. This is equivalent to rotating my ruler in figure 8 and measuring the “width” of my board. In figure 9, I am showing 3 of the necessary multiply/add operations but you would carry out the multiple/add for all points in the signal. Writing it out:
B= x*Re + x*Re + x *Re + . . .
A= x*Im + x*Im + x *Im + . . .
In this new formulation we get a pair of numbers A,B for our output. Keep in mind that we are gauging our input against a *single* frequency of reference signals at a unit amplitude. This is analogous to measuring the length and width of our block of wood. Another way of thinking about it is that we now have a measuring tool that evaluates along 2 axes which are “orthogonal”. It is almost like a triangle square.
Figure 10. A two-axis gauge.
Once we have our values A & B, it is typical to consider them as a single complex number
Output = B + iA
The complex output gives us a relative measure of how we are correlated to our reference gauges. To get a relative amplitude, simply compute the magnitude:
||Output|| = sqrt(A^2 + B^2)
You could even extract the phase:
Phase = arctan(B/A)
It common to think about the output in “polar” form (magnitude/phase). In vibration applications you typical want understand the magnitude of the energy at different frequency components of a signal. There are applications in communications, such as orthogonal frequency domain multiplexing (OFDM), where you work directly the with real and imaginary components.
I previously stated that the correlation we were performing is essentially a vector dot product operation. The dot product shows up in many applications. One of which is dealing with vectors of length 2 where we use the following relationship:
The interesting point here is that the dot product is a simple way of getting a relationship of the angle and magnitude between two vectors and b. It is easy to think about a and b as vectors on a 2d plane, but the relationship extends to vectors of any length. For digital data, we work with discrete samples, so we define everything in terms of the dot product. We are effectively using this operation to compute magnitudes and find angles between “signals”. In the continuous time world, there is the concept of the inner-product space. It is the “analog” equivalent of the dot product and underpins the mathematical models for many physical systems.
At this point we could stop and have a brute force technique of comparing a signal against a single frequency reference. If we want to determine if a signal had a large component of a particular frequency, we could tailor our reference gauges to the *exact* frequency we are looking for. The next logical step is to compare our signal against a *range* of reference gauges of different frequencies:
Figure 11: Using a range of reference gauges at different frequencies.
In Figure 11, I show four different reference gauges at frequencies that have an integer multiple relationship. There is no limit to the number of frequencies you could use. With this technique, we can now generate a “spectrum” of outputs at all the frequencies of interest for a problem. This operation has a name: the Discrete Fourier Transform (DFT). One way of writing the operation is:
Figure 12. The Discrete Fourier Transform (DFT)
N is the number of samples in the input signal.
k is the frequency of the cosine/sine reference gauges. We can generate a “frequency” spectrum by computing DFT over a range of “k” values. It is common to use a linear spacing in when selecting the frequencies. For example, if your sample rate is 48KHz and you are using N=64 samples, it is common (we will see why later) to use 64 reference gauges spaced at (48000/64)Hz apart.
The “Fast Fourier Transform”
The Fast Fourier Transform is a numerically efficient method of computing the DFT. It was developed by J. W. Cooley and John Tukey in 1965 as a method of performing the computation with a fewer adds and multiplies as compared to the direct implementation shown in Figure 11. The development of the FFT was significant as we can do our number crunching much more efficiently by imposing a few restrictions on the input. There are a few practical constraints that need to be considered when using an FFT implementation
The length of your input must be a power of 2. i.e. 32, 64, 128, 256.
The “bins” of the output are spaced in frequency by the sample rate of your signal divided by the number of samples in the input. As an example, if you have a 256-point signal sampled at 48Khz, the array of outputs corresponds to frequencies spaced at 187Hz. In this case the “bins” would correlate to 0Hz, 187.5Hz, 375 Hz, etc. You cannot have arbitrary input lengths or arbitrary frequency spacing in the output.
When the input the FFT/DFT are “real numbers” (i.e. samples from an ADC), the array of results exhibits a special symmetry. Consider an input array of 256 samples. The FFT result will be 256 complex numbers. The 2nd half of the output are a “mirror” (complex conjugates) of the 1st half. This means that for a 256-sample input, you get 128 usable “bins” of information. Each bin has a real and imaginary component. Using our example in #2, the bins would be aligned to 0Hz, 187.5Hz, 375Hz, all the way up to one half of our sample rate (24KHz).
You can read more details about how the FFT works as well as find plenty of instructional videos on the web. Fundamentally, the algorithm expresses the DFT of signal length N recursively in terms of two DFTs of size N/2. This process is repeated until you cannot divide the intermediate results any further. This means you must start with a power of 2 length. This particular formulation is called the Radix-2 Decimation in Time (DIT) Fast Fourier Transform. The algorithm gains its speed by re-using the results of intermediate computations to compute multiple DFT outputs. The PowerQuad uses a formulation called “Radix-8” but the same principles apply.
Using the PowerQuad FFT Engine
The underlying math to a DFT/FFT boils down to multiplies and adds along with some buffer management. The implementation can be pure software, but this algorithm is a perfect use case for a dedicated coprocessor. The good news is that once you understand the inputs and outputs of a DFT/FFT, using the PowerQuad is quite simple and you can really accelerate your particular processing task. The best way to get started with using the PowerQuad FFT is to look at the examples in the SDK. There is an example project called “powerquad_transform” which has examples that test the PowerQuad hardware.
Figure 13. PowerQuad Transform examples in the MCUXpresso SDK for the LPC55S69
In the file powerquad_transform.c, there are several functions that will test the PowerQuad engine in its different modes. For now, we are going to focus on the function PQ_RFFTFixed16Example(void).
This example will set up the PowerQuad to accept data in a 16-bit fixed point format. To test the PowerQuad, a known sequence of input and output data is used to verify results. The first thing I would like to point out is that the PowerQuad transform engine is used fixed point/integer processing only. If you need floating point, you will need to convert beforehand. This is possible with the matrix engine in the PowerQuad. I personally only every use FFTs with fixed point data most of my source data comes right from analog to digital converter data. Because of the processing gain of the FFT, I have never seen any benefit of using a floating-point format for FFTs other than some ease of use for the programmer. Let us look at the buffers used in the example:
Notice that the input data length FILTER_INPUT_LEN (which is 32 samples). The arrays used to store the outputs are twice the length. Remember that an FFT will produce the same number of *complex* samples in the output as there are samples for the input. Since our input sample are real values (scalars) and the outputs have real/imaginary components, it follows that we 2x the length to storage the result. I stated before that one of implications of the FFT with real valued inputs is that we have a mirror spectrum with complex conjugate pairs. Focusing on the reference for testing the FFT output in the code:
The 1st pair 100,0 corresponds to the 1st bin which is a “DC” or 0Hz component. It should always have a “zero” for the imaginary component. The next bins can be paired up with bins from the opposite end of the data:
76,-50 <-> 77,49
29,-62 <-> 29, 61
-1, -34 <-> -1,33
These are the complex conjugate pairs exhibiting mirror symmetry. You can see that they are not quite equal. We will see why in a moment. After all the test data in initialized, there is a data structure used to initialize the PowerQuad:
One of the side effects of computing an FFT is that you get gain at every stage of the process. When using integers, it is possible to get clipping/saturation and the input needs to be downscaled to ensure the signal down not numerically overflow during the FFT process. The macro FILTER_INPUTA_PRESCALER is set to “5”. This comes from the length of the input being 32 samples or 2^5. The core function of the Radix-2 FFT is to keep splitting the input signal in half until you get to a 2-point DFT. It follows that we need to downscale by 2^5 as we can possible double the intermediate results at each stage in the FFT. The PowerQuad uses a Radix-8 algorithm, but the need for downscaling is effectively the same. I believe that some of the inaccuracy we saw in the complex conjugates pairs the test data was from the combination of an input array values that are numerical small and the pre-scale setting. Note that the pre-scaling is a built in hardware function of the PowerQuad.
The PowerQuad needs an intermediate area to work from. There is a special 16KB region starting at address 0xe0000000 dedicated to the PowerQuad. The PowerQuad has a 128-bit interface to this region so it is optimal to use this region for the FFT temporary working area. You can find more details about this private RAM in AN12292 and AN12383.
Once you configure the PowerQuad, the next step is to tell the PowerQuad the input and result data is stored with the function PQ_transformRFFT().
Notice in the implementation of the function, all that is happening is setting some more configuration registers over the AHB bus and kicking off the PowerQuad with a write to the CONTROL register. In the example code, the CPU blocks until the PowerQuad is finished and then checks the results. It is important to point out that in your own application, you do not have to block until the PowerQuad is finished. You could setup an interrupt handler to flag completion and do other work with the general purpose M33 core. Like I stated in my article on IIR filtering with the PowerQuad, the example code is a good place to start but there are many opportunities to optimize your particular algorithm. Example code tends to include additional logic to check function arguments to make the initial experience better. Always take the time look through the code to see where you can remove boilerplate that might not be useful.
The PowerQuad includes a special engine for computing Fast Fourier Transforms.
The FFT is an efficient implementations of the Discrete Fourier Transform. This process just compares a signal against a known set of reference gauges (Sines and Cosines)
The PowerQuad has a private region to do its intermediate work. Use it for best throughput.
Also consider the memory layout and AHB connections of where your input and output data lives. There may be additional performance gains by making sure you input DSP data is in a RAM block that is on a different port than RAM used in your application for general purpose task. This can help with contention when different processes are accessing data. For example, SRAM0–3 are all on different AHB ports. You might consider locating you input/output data in SRAM3 and having your general-purpose data in SRAM0-2. Note: You still need to use 0xE0000000 for the PowerQuad TEMP configuration for its intermediate working area.
At this point you can begin looking through the example transform code. Also make sure to read through AN12292 and AN12383 for more details. While there are more nuances and details to FFT and “frequency domain” processing, I will save those for future articles. Next time I hope to show some demos of the PowerQuad FFT performance on the Mini-Monkey and illustrate some other aspect of the PowerQuad. Until then, check out some of the additional resources below on the LPC55S69.
In my last article, we starting discussing the PowerQuad engine in the LPC55S69 as well as the concept of data in the “time domain”. Using the Mini-Monkey board, we showed the function of collecting a bucket of data over time. I chose to use a microphone as a data source as it is easy to visualize and understand. You can now easily imagine replacing the microphone with *anything* that changes over time. In this article we are going to look at some common algorithms for processing data in the time domain. In particular, we will look at the “Dual Biquad IIR” engine in the LPC55S69 PowerQuad. An IIR biquad is a commonly used building block as it is possible to configure the filter for many common filtering use cases. This article is not intended to review all of the DSP theory behind IIR filter implementations but I do want to highlight some key points and the PowerQuad implementation.
Digital Filtering with Embedded Microcontrollers
When sampling data “live”, one can imagine data being continuously recorded at a known rate. A time domain filter will accept this input data and output a new signal that is modified in some way.
Figure 1. Filtering In the Time Domain
The concept here is that the output of the filter is just another time domain signal. You may choose to do further processing on this new signal or output to a Digital Analog Converter (DAC). If we are thinking in terms of “sine waves”, a digital filter adjusts the amplitude and phase of the input signal. As we apply different frequency inputs (or a sum of different frequencies), the filter attenuates or gains to the sinusoidal components. So, how does one compute a digital filter? It is quite simple. Let us start with a simple case. :
Figure 2. Sample by Sample Filter Processing using a History of the Input
One operation we perform is to *mix* the most recent input sample with samples we have previously recorded. The result of this operation is our next *output* sample. The name of this filter configuration is an FIR or Finite Impulse Response filter. One way to write this algorithm is to use a “c array style” notation and difference equations.
x[n] The current input
x[n-1] Our previous input
y[n-2] An input from 2 sample ago
y[n] Our next output
Figure 2 could be written as
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2]
All we are doing is multiplying our input sample and its history by constant coefficients and then adding them up. We are multiplying then accumulating! The constants b0, b1 and b2 control the frequency response of the filter. By choosing these numbers correctly, we can attenuate “high” frequencies (low pass filter), attenuate low frequencies (high pass filter), or perform some combination of the two (band pass filter). We can also use more samples from the input history. For example, instead of just using the previous 3 samples, one could use 128 samples. A filter of this type (FIR) can require quite a bit of time history to get precise control over its frequency response. The code to implement this structure is simple but can be very CPU intensive as you need to do the multiply and adds for *every* sample at your signal sample rate.
There is an adjustment we can make to figure 2 that can allow for tighter control over our frequency response without having to use a long time history.
Figure 3. Sample by Sample Filter Processing using a History of the Input and Output
The key difference between figure 2 and figure 3 is that we can also mix in previous filter *outputs* to generate the output signal. Adding this “feedback” can yield some interesting properties and is the root of another class of digital filters called IIR (Infinite Impulse Response filters).
One of the primary advantages of this approach that you need fewer coefficients than an FIR filter structure to get a desired frequency response. There are always trade-offs when using IIR filters vs. FIR filters so be sure to read up on the differences. The example I showed in figure 3 is called a “biquad”. A biquad filter is a common filter building block that can be easily cascaded to construct larger filters. There are several reasons to use a biquad structure, one of which being that there are many design tools that can generate the coefficients for all of the common use cases. Several years ago, I built a tool around a set of design equations that were useful for audio filtering.
At the time I made the tool shown in figure 4, I was using biquad filter structures for tone controls on a guitar effects processor. The frequency and phase response plots where designed to show frequencies of interest of an electric guitar pickup. There are lots of options for coming up with coefficients and numerous libraries to help. For example, you could use Python:
In my guitar effects project, I embedded the filter design equations in my C code so I could recompute coefficients dynamically!
Using the PowerQuad IIR Biquad Engines
The PowerQuad in the LPC55S69 has dedicated hardware to compute IIR biquad filters. Like an FIR filter, the actual code to implement a biquad filter is straightforward. An IIR filter may be simple to code but can use quite a bit of CPU time to crunch through all the multiply and accumulate operations. The PowerQuad is available to free up the CPU from performing the core computational component of the biquad computation. A good starting point for using the PowerQuad IIR biquad engine is to use the MCUXpresso SDK. It is important to note that the SDK will be a starting point. The SDK code is written to cover as many use cases as possible and to demonstrate the different functions of the PowerQuad. It can be helpful to read through the source code and decide which pieces you need to extract for your own application. DSP code often requires some hand tuning and optimization for a particular use case. The PowerQuad is connected via the AHB bus and the Cortex-M33 co-processor interface. Let’s take a look at the SDK source code to see how you the IIR engine works.
Using the “Import SDK Examples” wizard in MCUXpresso, you will find PowerQuad examples under driver_examples > PowerQuad
Figure 5. Selecting the PowerQuad Digital Filter Example
The powerquad_filter project has quite a few examples of the different filter configurations. We are going to focus on a floating point biquad example as a starting point. In the file powerquad_filter.c, there are several test functions that will demonstrate a basic filter setup. I am using LPC55S69 SDK 2.7.1 and there is function around line 455 (Note the spelling mistake PQ_VectorBiqaudFloatExample).
Figure 6. Vectorized Floating Point IIR Filter Function
The 1st important point to note is that PowerQuad computes IIR filters using “Direct Form II”. In the previous figures I showed the filter using “Direct Form I”. When one is 1st introduced to IIR filters, “Direct Form I” is the natural starting point as it is the clearest and most straightforward implementation. It is possible however to re-arrange the flow of multiplies and adds and get the same arithmetic result.
When using "Direct Form II", we do not need to store history of both inputs and outputs. Instead, we store an intermediate computation which is labeled v[n]. During the computation of the filter, the intermediate history v[n] must be saved. We will refer these intermediate values as the filter “state”. To setup the PowerQuad for IIR filter operation, there are handful of registers on the AHB bus where the state and coefficients are stored. In the SDK examples, the state of the filter is initialized with PQ_BiquadRestoreInternalState().
Figure 8. Restoring/Initializing Filter State
Once the PowerQuad IIR engine is initialized, data samples can be processed through the filter. Let us take a look at the function PQ_VectorBiqaudDf2F32() in fsl_powerquad_filter.c
Figure 9. Vectorized IIR Filter Implementation.
This function is designed to process longer blocks of input samples, ideally in multiples 8. Note that many of the SDK examples are designed make it simple to get started but could be easily tuned to remove operations that may be not applicable in your application code. For example, the modulo operation to determine if the input block is a multiple of 8 is something that could be easily removed to save CPU time. In your application, you have complete control over buffer sizes and can easily optimize and remove unnecessary operations. The actual computation of the filter can be observed in the code block that processes the 1st block of samples.
Figure 10. Transfering Data to the IIR Engine with the ARM MCR Coprocessor Instruction
Data is transferred to the PowerQuad with the MCR instruction. This instruction transfers data from an CPU register to an attached co-processor (the PowerQuad in this case). The PowerQuad does the work of crunching through the Direct Form II IIR structure. While it take some CPU intervention to move data into the PowerQuad, the PowerQuad is much more efficient at the multiply and adds for the filter implementation.
To get the result, the MRC instruction is used. MRC moves data from a co-processor to a CPU register.
Figure 11. Retrieving the IIR Filter result with the MRC instruction.
Further down in PQ_VectorBiquadDf2F32(), there is assembly code tuned to inject data in blocks of 8 samples. Looking at PQ_Vector8BiquadDf2F32():
Figure 12. Vectorized Data Insertion into the PowerQuad.
Notice all the MCR/MRC functions to transfer data in and out of the biquad engine. All the other instructions are “standard” ARM instructions to get data into the registers that feed coprocessor. Take some time to run the examples in the SDK. They are structured to inject a known sequence to verify correct filter operation. Now that you have seen some the of the internals, you can use the pieces you need from the SDK to implement your signal processing chain.
The PowerQuad can help accelerate biquad filters. There are 2 separate biquad engines built into the PowerQuad.
The PowerQuad IIR functions are configured through registers on the AHB bus and the actual input/output samples transferred through the Cortex M33 coprocessor interface.
The SDK samples are a good starting point to see how configure and transfer data to the PowerQuad. There are optimization opportunities for your particular application so be sure to inspect all of the code.
If you need more than two biquad filters, you will need to preserve the “state” of the filter. This can be a potentially expensive operation if you are constantly saving/restoring state. In this case you will want to consider processing longer blocks of data.
You may not need to save the entire “state” of the filter. For example, if the filter coefficients are the same for all of the your filters, all you need to save and restore is v[n].
While the PowerQuad can speed up (6x) the core IIR filter processing, you still need the CPU to setup the PowerQuad and feed in samples. Consider using one the extra Cortex M33 cores in the LPC55S69 to do your data shuffling.
You now have a head start on performing time domain filtering with the LPC55S69 PowerQuad. We examined IIR filters, which have lots of applications in audio and sensor signal processing, but the PowerQuad can also accelerate FIR filters. Next time we are going to dive a litter deeper with some frequency domain processing with the PowerQuad transform engine. The embedded transform engine can accelerate processing of Fast Fourier Transforms *significantly*. Stay tuned for more embedded signal processing goodness!
Built into the LPC55S69 is a powerful coprocessor called the “PowerQuad”. In this article we are going to introduce the PowerQuad and some interesting use cases. Over the next several weeks we will look at using some of the different processing elements in the PowerQuad using the “Mini-Monkey” board.
Figure 1: NXP PowerQuad Signal Processing Engine
The PowerQuad is a dedicated hardware unit that runs in parallel to the main Cortex M33 cores inside the LPC55S69. By using the PowerQuad to work in parallel to the main CPU, it is possible to implement sophisticated signal processing algorithms while leaving your main CPU(s) available to do other tasks such as communication and IO. This is a very import use case in distributed sensor systems and the Industrial Internet of the Things (IIOT). Over the next several weeks, I am going to show some practical aspects of using the PowerQuad in some various applications. I feel it is a very good fit for many tightly embedded applications need a combination of the general-purpose processing, IO, and dedicated signal processing while maintain a very low active power profile.
Embedded Systems, Sensors and Signal Processing
Before we get started, I think it is helpful to review some concepts and explain why some of the functions of the PowerQuad are useful. Even though many engineers may have learned about Digital Signal Processing (DSP) in college or university, there is often little connection to real hardware and code. Many introductions to DSP begin with formal explanations (i.e. heavy math!). While this formalism is important for developing the underlying algorithms, it is easy to get lost when trying to make something work. As an example, one of core algorithms to many DSP applications is the Fast Fourier Transform. It can be difficult for one to understand to how use at a black box level software if all you have ever worked with was the mathematical formalism. Being able to link the formalizing with real application is where real magic can happen! In these upcoming articles, I will break down what is actually happening in the code so it is a bit easier to use the PowerQuad hardware.
For an overwhelming majority of sensor and industrial IOT applications, we encounter “time series” data. By time series, all we mean is that we take some sort of measurement at a constant interval and put the recorded data into a bucket. We might process this data one sample at a time as it comes in or wait to our bucket fills up to a level before working with the information. A key feature here is that we have some measurement (temperature, pressure, voltage level) that is captured fixed rate. What we end up with is a data set that spans some amount of “time”. We do not have infinite resolution in the measurement “amplitude” nor can we take measurements infinitely fast. For example, if we take voltage readings over time, our “step” size might be 1milli-second with 1milli-volt resolution in our amplitude. The details of how fast and with how much precision is application dependent.
Figure 2: A Time Series Cartoon
In Figure 3, notice that the "dots" are not connected to indicate that we have a discrete set of data. Many times we fill in the space between the dots on a chart to get a better visualiztion of the signal but what we have to work with is a discrete bucket of data.
Let’s take a look at an example using the LPC55S69 on the “Mini-Monkey”. The Mini-Monkey circuit has a digital microphone connected via an I2S interface to the MCU and a 240x240 pixel display connection via SPI. Using the display, we can visualize the time series (my voice). As a demonstration, I grabbed of a bucket of 256 samples from the microphone via the I2S interface and rendered raw time series data on the display. The microphone on the Mini-Monkey (Knowles Acoustic SPH0645LM4H-B) was setup to output data at a rate of 32KHz. The resolution in amplitude from this device is 18-bits. Since my OLED screen is 240 pixels high, I divided down the amplitude of the samples so they would fit.
All I am doing is collecting data into "buffer" and then continually displaying the information on the screen. It is an easy way to visualize what is going on. Now, instead of a using microphone measuring acoustic pressure, you could sample something else. A velocity measurement, a voltage signal, etc. The time series data set is your starting point. Now it is time to start doing something with the numbers and that is where PowerQuad can help. Most signal processing algorithms boil down to simple, repetitive operations over arrays of data. Just about everything can be boiled down to a multiplication and add. This is why you may have heard quite a bit about multiply and accumulate units (MAC) in DSP engines. It is a ideal use case for a coprocessor.
The PowerQuad at its core has the logic to handle the most common “building blocks”. Sometimes when you have a time series, you process the data in a manner to preserves all of the “time information”. Meaning, the get information out the “signal processing black box” that is still a set of datapoints correlated to some block of time. They just might be filtered or modified in some way. For example, maybe you have a a signal where you want to remove 60Hz noise. You might consider a digital FIR or IIR filter. Other times you “transform” your data into information that is “correlated” to something else, such as a rate or “frequency”. We will be exploring both of these application in future articles but the PowerQuad help with both of these use cases.
The LPC55S69 can bring in time series data via several interfaces. In this article I measured acoustic pressure with a digital MEMs microphone over a digital audio port (I2S). You could also take measurements with the analog to digital converter. For example, I have a little breakout board for an ADXL1001BCPZ accelerometer I built last year:
Figure 4: ADXL1001BCPZ Accelerometer Board (Left)
This ADXL1001BCPZ is high bandwidth accelerometer useful for machine monitoring and vibration analysis applications. Many common MEMS accelerometers do not have a high enough bandwidth to capture all the dynamic information in a vibrating system. The -3dB bandwidth of the ADXL1001 stretches to 11KHZ, making it ideal of vibration problems. Low-cost accelerometers used for simple motion detection and orientation have a very low bandwidth and may not be able to capture the dynamics you are looking for in a vibration application. Furthermore, many of the MEMs device that can measure in multiple axis do not have the same bandwidth and noise performance on all axes. We can use the internal ADC in the LPC55S69 to sample the accelerometer over time and build up a time series to understand how something is vibrating. While microphones can pick up sound traveling in air, accelerometers can be used to understand sound traveling through a physical structure. Using signal processing techniques, we even combine information from multiple sensors (measuring the same thing in different ways) to better understand a problem.
In the neck of the woods where I grew up, there were lots of experienced auto mechanics who could quickly identify problems without even opening the hood. The first method to debug a problem was to take the car for a drive or start the motor and “listen”. Many of these individuals were well trained could know exactly what an issue was is simply by listening. All mechanical systems vibrate. *How* they vibrate is dependent on their size, shape, material properties, and operating conditions. These mechanical vibrations couple to the air and we can “hear” what is going on. If you have some situational awareness of the mechanical system, you know how something *should* sound when the system is operating normally. If a component starts failing, the mechanical system changes and it will vibrate differently. Because the “boundary conditions” of the system changed, the nature of the sound produce changes. We can instrument the machine with sensors, say an accelerometer, and capture the time series. Using some math (DSP) and our a-priori knowledge of how the system is supposed to behave, it is possible to see predict failure before it occurs.
Our global industry is driven by large and expensive electro-mechanical machines. All the things we consider essential for life, say Oreo cookies and toilet paper, are produced in large factories with large, high dollar value processes. It makes absolute sense to automate the measurement and analysis of high value machines in as the money saved from unplanned downtime is incredible. The LPC55S69 can be a good fit for many “smart sensor” applications as it can be packed in tight spaces, consume little power and be able to do a 1st level data reduction at the sensor. Instead of transmiting large amounts of data from a system, the LPC55S69 can allow for significant signal processing to reduce a complex time series into other metrics that can be analyzed at an enterprise level to determine if a failure will occur. The LPC55S69 with the PowerQuad is a great fit for the Industrial IOT.
LPC55S69 PowerQuad Application - Power Line Communications and Metering
A completely different but interesting use-case for the LPC55S69 PowerQuad is Power Line Communications (PLC). There are many sensor applications where you need to transmit and receive data, but you only have access to DC or AC power lines. Many new smart meters attached to you home employ this technology. PLC uses sophisticated techniques such as Orthogonal Frequency Division Multiplexing (OFDM) to transmit data on a power line. OFDM is an interesting technique as it allows you to send data bits down a communications channel *in parallel* across several frequency bands. It is tolerant to noise as you can achieve high bit rates by using many parallel channels/bands where each band contains slowly moving data.
A core requirement of any OFDM solution is being able to compute Fast Fourier Transforms (FFT) in real time on an incoming time series. If you can efficiently compute an FFT, it is straightforward to encode/decode data on both the transmitting and receiving ends of the system. Using bins of the FFT, data is encoded using the real and imaginary components (amplitude and phase) to make up bits of a data "word". Once you encode data in the "bins", you can use an inverse FFT to get a time signal to output to a digital to analog converter. Decoding is essentially figuring out when you signal starts and then using an FFT to get the "bins". Once you have your frequency bins, you look at amplitude/phase information to reconstruct your data word.
Figure 5: OFDM Time Series, Frequency Domain Symbol Spectrum and QAM Symbols.
This is a gross simplification of the OFDM process but accelerators such as the PowerQuad are a key element to making it work The LPC55S69 is well suited to this particular application as most of the complexities of the algorithm could be implemented using the PowerQuad leaving your computational resources (such as the Cortex M33) to implement your metering and measurement application. All of this can be done while consuming very little active energy in a small package. At one time, you would have needed a power-hungry IC to perform this process.
Moving Forward with the PowerQuad
I hope you are now interested in some of use cases of the LPC55S69 and the PowerQuad engine. In the coming articles we can going to dive into some of the different aspects of the PowerQuad engine and demonstrate some processing on the Mini-Monkey platform. Stay tuned and feel free to check out the LPC55S69. And in case you missed it, here are some other LPC55S69 blogs/videos:
FreeMASTER, from NXP, is a powerful real-time debugging and data visualization tool that can help you create engaging demo interfaces for your embedded application. Join NXP for this four-part on-demand training series as we’ll provide an overview of the software, it’s features, capabilities, available examples, application use cases and how to easily get started.
The Mini-Monkey is now officially “out the door”. I just sent the files to Macrofab and can’t wait to see the result. Before I talk a bit about Macrofab, we will look at what going to get built. A few weeks ago, I introduced a design based upon the LPC55S69 in the 7mm VFBGA98. The goal was to show that this compact package can be used with low cost PCB/Assembly service without having to use the more expensive build specifications. The Mini-Monkey board will also be used to show off some of the neat capabilities of the PowerQuad DSP engine in future design blogs. Here is what we ended with for the first version:
Figure 1. Mini-Monkey Revision A
Lithium-Polymer battery power with micro-USB Charging
High-speed USB 2.0 Interface
SWD debug via standard ARM .050” and tag-connect interface
3 push buttons. One can be used to start the USB ROM bootloader
External Power Input
11 dedicated IO pins connected to the LPC55S69. Functions available:
Dedicated Frequency Measurement Block
State Configurable Timers (Both input and output)
Additional ADC Channels
The HS-SPI used for the IPS display is also brought to IO pins
I am a firm believer in not trying to get anything perfect on the 1st try. It is incredibly inexpensive to prototype ideas quickly so I decided to try to get 90% of what I wanted in the first version. As we will see, it is inspesive to iterate on this design to work in improvements. Without too much trouble, I was able to get everything I wanted on 2 signal layers with filling in a power reference on the top and bottom sides. If this was a production design, I would probably elect to spend a bit more to get two solid inner reference planes by using a 4-layer design. Once a design hits QTY 100 or more, the cost of using a 4-layer stack-up can be negligible. A 4-layer stack-up makes the design much easier to execute and compliant with EMI, RFI requirements. For most of my “industrial” designs where I know that it won’t be high quantity, I always start at 4-layer unless it is a simple connector board.
For this 1st run, I wasn’t trying to push the envelope with how much I could get done with low cost design rules and a 2-layer stack-up. The VFBGA leaves quite a bit of space for fanning out IO. Quite a bit can be done on the top layer without vias. I had a few IO that ended up in more difficult locations, but routing was completely quickly.
Figure 2. Mini-Monkey VFBGA Fanout
As you can see, I did not make use of all the IO. If I had used a 4-layer board I would be simpler to get quite a bit more of the IO fanned out. Moving to smaller vias, traces and a 4-layer stack-up would probably allow one to get all IO’s connected. For this design, I was trying to move quickly as well as use the standard “prototype” class specs from Macrofab. This means 5 mil traces, 10 mil drills with a 4-mil annular ring. If you can push to 3.5mil trace/space, NXP AN12581 has some suggestions.
I did want to take a minute to talk about Macrofab. I normally employ the services of a local contract manufacturer but this time I elected to this online service a try. After going through the order process, I must say I was thoroughly impressed! The 1st step is to upload your PCB design files. I use Altium Designer PCB package and Macrofab recommends uploading in OBD++ format. Since this format has quite a bit more meta-data baked than standard Gerbers, the online software can infer quite a bit about your design.
Figure 3. Macrofab PCB Upload
The Macrofab software gives you a cool preview of your PCB with a paste mask out of the gate. Note that this design is using red solder mask as that is what is included in the prototype class service. Once you have all the PCB imported, you can now upload a Bill of Materials (BOM).
Figure 3. Macrofab BOM Upload
Macrofab provides clear guidance on how to get your BOM formatted for maximum success. Once the BOM is uploaded, the online tool searches distributors and you can select what parts you want to use. The tool also allow one to leave items as Do No Place (DNP). I was impressed that it found almost everything I wanted out of the box. Pricing and lead time are transparent.
Next up is part placement:
Figure 4. Macrofab Part Placement
Using the ODB++ data, the Macrofab software was able to figure out my placements. I was thoroughly impressed with this step as it was completely automatic. The tool allows you to nudge components if needed. Once placements are approved, the tool will give you a snapshot of the costs.
Figure 5. Cost Analysis and Ordering
What I liked here was how transparent the process was. Using the prototype class service, a single board was $152. This is an absolute steal when you consider that all the of the setup costs, parts and PCBs are baked in. If you consider the value of your time, this is an absolute no brainer. I also like that it gives you a cost curve for low volume production. In the future, I am going to have a hard time using another service that can’t give me much data with so little work.
I ended up ordering 3 prototype units. Total cost plus 2-day UPS shipping was $465.67. Note, I did end up leaving one part off the board for now: the 1.54” IPS display. This part requires some extra “monkeying” around as it is hot bar soldered and needs some 2-sided tape. I decided to solder the 1st three prototypes on my bench to get a better feel for the process of using this display. However, I am more than happy to push the BGA and SMT assembly off to someone else.
It looks like board are going to ship on the 1st of May. I’ll post a video and update when they come in. So far, the experience with Macrofab has been quite positive and I am eager to see the results. Once I get the design up and running, I’ll post documentation to bitbucket.
The first step in designing a PCB with a new MCU is to add the part into your component libraries. Component library management can a source of passionate disagreements between design engineers. My own view on library management is rooted in many years of making mistakes! These simple mistakes ultimately caused delays and made projects more difficult than they needed to be. Often time these mistakes were also driven by a desire to "save time". Given my experience, there are a few overarching principles I adhere to.
The individual making the component should also be the one who has to stay the weekend and cut traces if a mistake is made. This obviously conflicts of the “librarian/drafter” model but I literally have seen projects where the librarian made a mistake on a 1000+ pin BGA that cost >$5k. This model was put in a library and marked as “verified”. The person making the parts needs some skin in the game! In this case, the drafting teams claimed they had a processing that included a double check but *no one in that process knew they context on how the part was going to be used*.
Pulling models from the internet or external libraries is OK as a starting point but it is just that, A starting point. You must treat every pin as if it was wrong and verify. Since many organizations have specific rules on how a part should look, you will need to massage the model to meet your own needs. Software engineers shake their head at this rule. "Why not build on somebody else's libraries? It is what we do!". Well, A mistake in a hardware library can take weeks if not months to really solve.... The cost, time and frustration impact can be huge. We hardware engineers can't simply "re-compile".
I don’t trust any footprint unless I know it has been used in a successful design. The context of how a part is used is very important (which leads to #4).
I believe the design re-used is best done at a schematic snippet level, not an individual part. After all, once I get this Mini-Monkey board complete, I will never again start with just the LPC55S69. I want all the “stuff” surrounding the chip that makes it work!
To the casual observer, these principles seems onerous and time consuming but I have found that the *save me time over the course of the project*. Making your own parts may seem time consuming but it *does not have to be*. There are tools that can make your life simpler and the task less arduous. Also making your own CAD part is useful for a few other reasons:
You have to go through a mental exercise when looking at each of the pins. It forces you brain to think about functionality in a slightly different way. When starting with a new part/family, repeated exposure is a very good way to learn.
Looking at the footprint early on gets your brain in a planning mode for when you do get started.
One could argue that this is “lost” time as compared to getting someone else to do the CAD library management it but I really feel strongly that it saves time in the long run. I have witnessed too many projects sink time into unnecessary debugging due to the bad CAD part creation. I feel the architect of the design needs to be intimately involved and take ownership of the process.
The LPC55S69 in the VFBGA package has only 98 pins. With no automation or tools, it would not take all that long build a part right from the datasheet. However, it is on the edge of being a time consuming endeavor. Also, when I build schematic symbols, I tend to label the pins with all possible IO capabilities allowed by the MCU pin mux. This can make the part quite large but it also helps see what also is available on a pin if I am in in a debug pinch. Creating pins with all this detail can be quite time consuming. I use Altium Designer for all of my PCB design and it has some useful automation to make parts more quickly. NXP’s MCUXpresso tool also has a unique feature that can really help board designers get work done quickly.
Creating the Pin List
Built into MCUXpressois a pins tool that is *very* useful in large projects with setting up the pin mux’s and doing some advanced planning. While it is primarily a tool for bootstrapping pin setup for the firmware, It can also use useful to drive the CAD part creation process. Simply create a new project and start the pins tool:
The pins tools gives you a tabular and physical view of pin assignments. Very useful when planning your PCB routing. We will use the export feature to get a list of all the pins, numbers and labels.
The pins tool generates a CSV file that you can bring into your favorite editor. Not only do I get the pin/ball numbers, I get all of the IO options available via the MCU pin mux.
Altium Designer requires a few extra columns of meta-data to be able import the data into a grouping of pins in the schematic library editor. At this point you could group the pins to your personal preference. I personally like to see all pin function of the schematic but does create rather large symbols. The good news here is that by using MCUXpresso and Altium you can make this a 10-minute job, not a 3 hour one. Imagine going through the reference manual line by line!
Viola! A complete symbol. It just took a few minutes of massaging to get what I wanted. Like I stated previously, a 98 pin package is not that bad to do manually but you can imagine a 200 or 300 pin part (such as the i.MX RT!)
The VFBGA package is 7mmx7mm with a 0.5mm pitch. There are balls removed from the grid for easier route escaping when use this part with lower cost fabrication processes.
Once again, with a quick look at NXP documentation and using the Altium IPC footprint generator, we can make quick work of getting an accurate footprint.
The IPC footprint generator steps you through the entire process. All you need is the reference drawing.
A quick note about the IPC footprint tool in this use case. The NXP VFBGA has quite a few balls removed to allow of easier escaping. The IPC footprint generator can automatically remove certain regions, I found that this particular arrangement needed a few minutes of hand work to delete the unneeded pads given the unique pattern.
By using Altium and NXP’s MCUXpresso tool together, I was about to get my CAD library work done very quickly. And because I spent some time with the design tools, I became more familiar with the IO’s and physical package. This really helps get the brain primed for the real design work.
At this point in the proces I have a head start on the schematic entry and PCB layout. Next time we are going to dive in a bit to see what connections we need to bootstrap the LPC55S69 to get it up and running. We will take a look at some of the core components to get the MCU to boot and some peripheral functions that will help the Mini-Monkey come alive!
Now that we have discussed the LPC5500 series at a high level and investigated some of the coolfeatures, it is time to roll up our sleeves work on some real hardware. In this next series of articles, I want to step through a simple hardware design using the LPC55S69. We are going to step a bit beyond the application notes and going through a simple design using Altium Designer to implement a simple project.
Many new projects start with development boards (such as the LPC55S69-EVK) to evaluate a platform and to take a 1st cut at some of the software development work. Getting to a form-factor compliant state quickly can just as important as the firmware efforts. Getting a design into a manufacturable form is a very important step in the development process. With new hardware, I like to address all of my “known unknowns” early in the process so I almost always make my own test PCBs right away. The LPC5500 series devices are offered in some easy to use QFP100 and QFP64 packages. Designers also have the option of a very small VFBGA98 package option. Many engineers flinch when you mention BGA, let alone a “fine pitch” BGA. I hope to show you that it is not be bad as you may think and one can even route this chip on 2 layers.
Figure 1. The LPC55S69 VFBGA98 Package. QFP100 comparison on the bottom.
The LPC55S69 is offered at an attractive price but packs a ton of functionality and processing power into a very small form-factor that uses little energy in both the active and sleep cases. Having all of this processing horsepower in a small form-factor can open new opportunities. Let’s see what we can get done with this new MCU.
The “Mini-Monkey” Board
In this series of “how to” articles, I want to step through a design with the LPC55S69 in the VFBGA and *actually build something*. The scope of this design will be limited to some basic design elements of bringing up a LPC55S69 while offering some interesting IO for visualizing signal processing with the PowerQuad hardware. Several years ago, I posted some projects on the NXP community using the Kinetis FRDM platform. One of the projects showcased some simple DSP processing on an incoming audio signal.
The “Monkey Listen” project used an NXP K20D50 FRDM board with a custom “shield” that included a microphone and a simple OLED display. For this effort I wanted to do something similar except using the LPC55S69 in the VFBGA98 package with some beefed-up visualization capabilities. There is so much more horsepower in the LPC55S69 and we now have the potential to do neat applications such as real time feature detection in an audio signal, etc. Also given the copious amounts of RAM in the in the LPC55S69, also wanted to step up the game a bit in the display. The small VFPGA98 package presents with an opportunity to package quite a bit in a small space. So much has happened since the K20D50 hit the street!
I recently found some absolutely gorgeous IPS displays with a 240x240 pixel resolution from buydisplay.com. They are only a few dollars and have a simple SPI interface. I wired a display to the an LPC55S69-EVK for a quick demonstration:
Figure 2: The LPC55S69EVK driving the 240x240 Pixel 1.54” IPS display.
It was difficult for me to capture how beautiful this little 1.54” display is with my camera. You must see it to believe it! Given the price I figured I would get a boxful to experiment with for this design project!
Figure 3: 240x240 Pixel 1.54” IPS display from buydisplay.com
The overarching design concept with the “mini-monkey” is to fit a circuit under the 1.54” display that uses LPC55S69 with some interesting IO:
LIPO Battery and Charger circuitry
Digital MEMs microphone
Access to the on-chip ADC
I want to pack some neat features beneath the screen that can do everything the “Monkey Listen” project can, just better. With access to the PowerQuad, the sky is the limit on what kinds of audio processing that can be implemented. The plan is to see how much we can fill up underneath the display to make an interesting development platform. I started a project in Altium designer and put together a concept view of the new “Mini-Monkey” board to communicate some of the design intent:
Figure 4: The “Mini-Monkey” Concept PCB based upon the LPC55S69 in the VFBGA98 package
While this is not the final product, I wanted to give you an idea of where I was going. The “Mini-Monkey” will be a compact form fact board that can be used for some future articles on how to make use of the LPC5500 series PowerQuad feature. There will be some extra IO made available to enable some cool new projects to showcase the awesome capabilities of the LPC55S69. Got some ideas for the "Mini-Monkey"? Leave a comment below!
In the next article we will be looking at the schematic capture phase and how we can use NXP’s MCUXpresso SDK to help automate some of the work required in Altium Designer. I will be showing some of the basic elements to getting an LPC55S69 design up and running from scratch. We will then look at designing with the VFBGA98 package and get some boards built. I hope I now have you interested so stay tuned. In the meantime, checkout this application note on using the VFBGA package on a 2-layer board:
An absolute gem in the LPC family is the “State Configurable Timer” (SCT). It has been implemented in many LPC products and I feel is one of the most under-rated and often misunderstood peripherals. When I first encountered the SCT, I wrote it off as a “fancy PWM” unit. This was a mistake on my part as the SCT is an extremely powerful peripheral that can solve many logic and timing challenges. I have personally been involved in several design efforts where I could remove the need for an additional programmable logic device on a PCB by taking advantage of the SCT in an LPC part. At its core, the SCT is a up/down counter that can be sequenced with up to 16 events. The events can be triggered by IO or by one of 16 possible counter matches. An event can then update a state variable, generate IO activity (set, clear, toggle), or start/stop/reverse the counter.
Consider an example which is similar to a design problem I previously used the SCT for.
Given a 1 cycle wide Start input signal
i.) Assert a PowerCtrl signal on the 3rd Clk cycle after the start. ii.) After 2 Clk cycles the assertion of PowerCtrl, output exactly 2 pulses on the Tx output pin at a programmable period. iii.) 5 Clk cycles after ii.), de-assert PowerCtrl iv.) After 2 Clk cycles of the de-assertion of PowerCtrl, output a 1 cycle pulse to the Complete pin.
This task could be done in pure software if the incoming CLK was slow enough. Most timer/counter units in competing MCUs would not be able to implement this particular set of requirements In my use case (an acoustic transmitter), I was able to implement this completely in the SCT with minimal CPU intervention and no external circuitry. This is a scenario where I might consider an external CPLD or FPGA but the SCT would be more than capable of implementing the behavior. I highly recommend grabbing the manual for the LPC55 family and read chapter 24. If you have never used a peripheral like the SCT, I highly recommend learning out about it.
Programmable Logic Unit
In addition to the SCT, there is a small amount of programmable logic in the LPC55 family. The PLU is an array of twenty 5-input Look up tables (LUTs) and four flip-flops. From the external pins of the LPC55xx, there are 6 inputs to the PLU fabric and 8 outputs. While this is not a large amount of logic, it is certainly enough to replace some external glue logic you might have in your design. There is even a free tool to draw your logic schematically or describe using the Verilog HDL.
I often find I need a just handful of gates in a design to glue a few things together and the PLU is the perfect peripheral for this need.
LPC Boot ROM
Another indispensable feature that has been in the LPC series since the beginning is a bootloader in ROM. For me, it is a must have as it means I can program/recover code via one of many interfaces without a JTAG/SWD connection. For factory/production programming and test, it saves quite a bit of hassle. The boot rom allows device programming over SPI, UART, I2C or UART. I typically use the UART or USB interface with FlashMagic. This feature has certainly benefited me on *every* embedded project, especially when it comes to production programming and test. There have even been some handy times to recover a firmware image in field. Many designs included some sort of bootloader and having an option that is hard coded in ROM is a great benefit that you get for free in the LPC family.
It is difficult to capture all the benefits of the new LPC55 family, but we hope you are interested. The LPC55 family is offered many convenient IC packages, is low power (both active and sleep) and is packed with useful peripherals. The LPC55S69 development board is available at low cost. Combining the low cost hardware tools with the MCUXpresso SDK, you can start LPC55 development today. From here we are going to start looking at some interesting how-to’s and application examples with the LPC55 family. Stay tuned and visit www.nxp.com/LPC55S6x to learn more.
One killer feature in some of the other LPC parts (for example the LPC4300 series and the LPC54000 series) is the *dual* USB interface. Dual USB enables some very interesting use cases and It is something that sets the LPC portfolio apart from its competitors. For the LPC5500 MCU series, High-Speed USB and Full-Speed USB with on-chip PHY features are fully supported, providing up to 480Mbit/s of speed. Let’s examine a scenario I comonly encounter.
In my projects, I like to have both USB device and USB host capabilities on separate connectors. Instead of using USB On-the-Go (OTG) with a single connector, it has been my experience the many deeply embedded and industrial projects benefit from separate connectors. Consider the arrangement in figure 1.
Figure 1: Dual USB with FAT File System, SDIO and CDC.
On the device side, I almost always implement a mass storage class device along with a communications class device. The mass storage interface is connected to the SDIO port through the FATFs IO Layer so a PC can access sectors on the SD card. FatFS is my go library for embedded FAT file systems. It is open source and battle tested. While I choose to always pull the files from author’s site, MCUXpresso SDK has FatFS built in. With this file it can be easily copied between a PC and the LPC5500 system. Data logging and configuration storage is now built into your application. The CDC interface can provide a virtual COM port interface to implement a basic shell.
I use the USB host port for mass storage as well. Like the SDIO interface, I connect the host drivers (examples in the MCUXpresso SDK) to through FatFS IO layer so my system can read write files on a thumb drive. One very useful application in my projects is a secondary bootloader. There have been several products I have worked on that required field updatability, but the users do not necessarily have access to a PC.
To update the system, data files and new firmware can be placed on a thumb drive and inserted into the LPC5500 system. A bootloader can then perform necessary programming to update the internal flash. In additional firmware updates, the host port could also be used to copy device configuration information. A technician would just carry a USB “key” to update units. Having both USB device and host using the two LPC55S69 USB interfaces can unlock many benefits.
With the SDIO interface and USB host, one is not limited to the more common SD cards and thumb drives. There are other options for more robust physical interfaces. Instead of a removable SD card, a soldered down eMMC can be used. For the USB host interface, there are rugged “DataKey” options available. Also note that that the DataKeys come with an SDIO interface as well.
Figure 2: Rugged Memory Options. DataKey (Left) and eMMC (Right)
One last tidbit is that the SDIO interface can also be used to connect to many high speed WIFI chipsets. It is an option that is easy to forget about.
Copious amounts of RAM
While I certainly came up in a time where RAM was sparse, I love having access to a large amount lot of it. At 360KB of RAM, there is no shortage of RAM in the LPC55S69! Relating to the USB and file storage application, large RAM buffers can be important for optimizing for transfer speeds. It is common to write SD cards and thumb drives in 512-byte blocks. This transfer size however is not always the most optimum case for overall speed. The controller in the memory cards has to erase internal NAND flash in much larger sector sizes resulting in slow write performs It has been my experience that queueing up data until I have at least 16KB can improve overall transfer speeds but up to an order of magnitude. In most of my use cases, I implement a software cache of at least 16KB to speed transfer of large files. Larger caches can yield better results. These file system caches can consume quite a bit of memory, so it is very helpful that the LPC5500 series has quite a bit of RAM available.
Given the security features of the LPC55S69, the extra RAM can make integration of SSL stacks for IOT applications much simpler. One example is the use of WolfSSL for implementation of SSL/TLS. While it targets the embedded space, SSL processing can be complicated and require a significant amount of stack and heap. In one particular use case I had with an embedded IOT product, I needed 35k of Stack and about 40kB of heap to handle of the edge cases when dealing with connections to the internet over TLS. The large reserve of RAM in the LPC55S69 easily allows for these larger security and encryption stacks.
Another use for the large memory capability is a graphics back-buffer. It would be simple to hook a high-resolution IPS to the LPC55S59 and be able to store a complete image back buffer in memory. For example a 240x240 IPS display with 16-bit color depth would require 112.5KiBytes of RAM! There is plenty of RAM left in the LPC55S69 for your other tasks. In fact, you could dedicate one of the CPUs in the LPC55S69 to handling all the graphics rendering. The copious amount of RAM enables neat applications such as wearables, industrial displays and compact user interfaces.
Figure 3. A 240x240 IPS Display with SPI Interface from BuyDisplay.com
One other important aspect to the RAM in the LPC55S69 is its organization. It is intelligently segmented (with 272Kb continuous in the memory amp) via a bus matrix to allow the Arm® Cortex®-M33 cores, PowerQuad, CASPER and DMA engine access to memory with minimal contention between bus masters.
Figure 4. LPC55S69 Memory Architecture.
The LPC5500 Series offers a lot in a small, low power package. The large amount of internal SRAM and dual USB interface enables many applications and makes development simpler. Stayed tuned for part 3 of the LPC5500 series overview. I will be further examining some interesting peripherals in the LPC5500 series that set it apart from its competition.
Most of my life, programming and embedded microcontrollers has been a passion of mine. Over the course of my career I have gained experienced on many different architectures including some that are very specialized for specific applications. Even with current diverse market of specialized devices, I continue to find the general-purpose microcontroller market the most interesting. I believe this stems from how I first fell in love with computing. It can be traced back to the 7th grade when we were learning “Computer Literacy” with the Apple IIe computer. During the course, students learned how to code programs in the BASIC language. Projects spanned everything from simple graphics, printing and games. Simultaneous to that experience, I learned that my other 7th grade passion, playing the Nintendo™, was connected to the activities in computer literacy. Through a popular gaming magazine, I discovered that the chip that powered the Nintendo was the device that powered the computers at school, the venerable “6502”. That was the real moment of epiphany. If a CPU could be both a gaming system and a word processor, it could really *do anything* I wanted. It wasn’t long before I was digging into the intricate details of the 6502 to power my creations. The 6502 was my 1st general purpose CPU.
Fast forward 30 years … The exact same principal applies today. We have an incredible amount of power in small packages. There is a lot you can accomplish with seemly little. I am always on the lookout for new parts that may appear to be “vanilla” on the surface but have some hidden gems that really help me accomplish cool projects. The NXP LPC5500 series really appealed to my sensibilities as I immediately saw features that make it relevant to today’s design challenges. In the coming weeks I want to highlight some features of the LPC5500 series. This is not intended to be an all-encompassing review of the LPC5500 series, but I hope to hit on some highlights that could be beneficial to your design challenges. In this article we are going to focus a bit on the LPC55S69 device and its core platform. There is a lot under the hood!
First – It is actually 4 processors in 1!
From the block diagram in figure 1, one can see that there are two Arm®Cortex®-M33 cores. This by itself is an extremely useful feature given the low cost and low active power aspects of this device. I have made good use of the other LPC families with asymmetric cores (such as the LPC43xx device with a Cortex®-M4 and -M0). Having a 2nd core is very useful in offloading common tasks. In my experience with the LPC43xx, I used the Cortex®-M0 as a dedicated graphics co-processor to offload UI tasks from the Cortex®-M4 while was doing other time critical DSP operations.
In the case of the LPC55S69, both cores are Cortex®-M33. The Cortex®-M33 is a new offering from ARM based upon the ArmV8-M Instruction set architecture. Like the Cortex-M4, it has hardware floating point and DSP instructions but also includes TrustZone. TrustZone enables new security states to ensure your critical code can be protected. Another notable new feature is a co-processor interface for streamlining integration with dedicated co-processors. This feature is germane to the LPC5500 series as there are 2 coprocessors that we are about to talk about. You can learn more about the Cortex®-M33 here.
I can’t count the number of design scenarios where I wished I had an extra programmable CPU that could handle a task that might be extremely time critical but not actually need a lot of code space. For example, I have used OLED displays that have a non-standard I/O interface that needs bit-banged. It became a great opportunity to have the 2nd core do the work. You could even turn that 2nd core into a small graphics co-processor.
Figure 1. The LPC55S6x MCU Family Block Diagram
I mentioned four processors. So, where are the 3rd and 4th processors? Number three is hidden in the “DSP accelerator” block. The Cortex®-M4 core of which many other LPC microcontrollers are built upon have DSP specific instructions that can accelerate certain math functions. I have given seminars at the Embedded Systems Conference on using the DSP instructions in a general-purpose CPU scenario. The LPC55S69 DSP accelerator (A.K.A . PowerQuad) is a separate core whose sole purpose is to accelerate DSP specific tasks. While PowerQuad is not a pure general purpose CPU, it can perform tasks that would significantly burden one of the Cortex-M33 cores. In many cases you can get a 10x improvement over convention software implements of certain algorithms. PowerQuad covers all the common use cases such as Fast Fourier Transforms (FFTs), IIR filters, convolution, trigonometric functions and matrix math. It has enough “brains” to do almost all the work so your main general purpose CPU(s) are free for other tasks. The PowerQuad is enabled by a very specific new feature in the Cortex-M33 (ARM®v8‑M specifically) that allow for coprocessors to be connected to the CPU through a simple interface. Data transfer to the coprocessor is low latency and can sustain a bandwidth of up to twice the memory interface to the processor.
Lastly, the 4th processor is another specialized core called “CASPER”. CASPER is high performance accelerator that is optimized for cryptographic computations. At its core, CASPER is a dual multiply-accumulate-shift engine that can operate of large blocks of data. CASPER has special access to 2 blocks of RAM so data can be accessed parallel. Applications of CASPER include accelerating cryptographic functions such as public key verification (i.e. TLS/SSL), hash computations or even blockchain. As CASPER is a general math engine, it is also possible to perform DSP operations in parallel with the PowerQuad. With a little bit of imagination, one could achieve quite a bit with minimal intervention from the general-purpose Cortex®-M33 cores.
Figure 2. PowerQuad (Left) and CASPER (right) Accelerators
While the PowerQuad and CASPER processing engines are not technically a 3rd and 4th general purposes cores, they can easily do the work that you might normally require of an entire CPU. We will be talking much more about these features in the future but the key take-away:
The PowerQuad DSP and CASPERaccelerators are a powerful math engines that can allow you to number crunch a rate similar to dedicated DSPs. All this while stillreservingyour generally purpose processors to handle other system tasks.
All of this functionality is delivered on a low power 40nm process technology packaged in approachable footprints at a low price point. Interested yet? I know I am!
EmSA recently released some updates to FAIM support on LPC84x devices in their popular Flash Magic tool. If you are using this unique feature of the LPC84x device series be sure to update to version 12.65 or later to get access to command line support and the latest fixes for some previous bogus errors/warnings that were appearing.
The 5V KE MCU series has been designed to maintain high reliability and robustness in harsh electrical noise environments primarily targeting white goods and industrial applications, but now extending its success to consumer applications where touch sensing, safety and motor control capabilities have become a “must have” in the embedded design.
The KE1xZ MCU family provides a highly scalable portfolio of robust 5V MCUs based on the Arm® Cortex ® -M0+ running up to 72 MHz and supporting up to 256 KB flash with a complete set of analog/digital features. The 1-Msps ADC and FlexTimer modules provide a perfect set of interfaces for BLDC motor control systems. To complement the high integration of this family, the robust Touch Sense Interface (TSI) module provides a high level of stability and accuracy to any HMI System.
NXP’s touch solution helps accelerate time to market with pre-certified and tested hardware components, an optimized software environment and easy-to-use configuration tools. This solution combines specialized touch software with the TSI available on the KE15Z/16Z MCUs, along with a complete set of tools enabling designers to easily add touch to user interface applications including home appliances, smart buildings, machines for industrial control and more.
You can start your design with the FRDM-KE16Z Freedom development board which is designed to work in standalone mode or as the main board for the FRDM-TOUCH module or the FRDM-MC-LVBLDC, the Freedom Development Platform for Low-Voltage 3-Phase BLDC Motor Control, as well as any Arduino ® board. This Freedom board is compatible with DC 5V and 3.3V power supply and features the KE16Z MCU, a device extending the main capabilities of the KE MCU series, but with smaller memory footprint options for broad scalability. In addition, the onboard interfaces include an RGB LED, a 6-axis digital sensor, a 3-axis digital angular rate gyroscope, a temperature sensor and the CAN control and Touch Sensing interfaces supported by the KE MCU series.
For more information about the KE1xZ MCU family, click here.
NXP understands that in addition to offering breakthrough innovations its ongoing investment and commitment to longevity is critical to being your trusted supplier. This paired with the continued demand and broad market use of NXP’s MCU portfolio makes it priority for us to extend the longevity1 on the following parts/families by an additional five years.
LPC5500 Series MCUs
LPC54000 Series MCUs2
LPC4300 Series MCUs
LPC4000 Series MCUs
LPC800 Series MCUs
LPC1100 Series MCUs2
LPC1800 Series MCUs
i.MX RT1010 Crossover MCUs
i.MX RT1015 Crossover MCUs
i.MX RT1020 Crossover MCUs
i.MX RT1050 Crossover MCUs
i.MX RT1060 Crossover MCUs
These additions join the many already extended from 10 to 15 years of which include:
Inspired by the passion to make the devices we interact with every day smarter, NXP drives to deliver a product to resolve your challenges, while also helping to adapt to future needs. With hundreds of devices included as part of this product longevity extension, we look forward to building a bold future together.
Based on the ultra-low-power Arm® Cortex ®-M0+ Core the LPC800 MCU series offers a range of memory, small footprint and low-pin options for basic microcontroller applications. By being fully compatible with the Cortex-M architecture and instruction set, this series efficiently handles 32-bit data, requiring less code, memory and 30% less dynamic power outperforming 8- and 16-bit MCUs.
Unique to this kind of low-end devices, the LPC800 MCU series includes a set of differentiated product features, such as an NFC communication interface, switch matrix for flexible configuration, a programmable logic unit to create combinational/sequential logic networks, capacitive touch and level shifting options along with a significant mixed signal integration bundled in a power optimized and cost-effective microcontrollers.
The LPC800 MCU series is supported by NXP’s MCUXpresso software and tools, including an open source software development kit (SDK), an easy-to-use integrated development environment (IDE), and a comprehensive suite of system configuration tools speeding your development time. In addition, the LPC800 series is supported by free example code bundles for each peripheral, giving 8- and 16-bit MCU users a fast transition into the 32-bit world.
You can start your design today with the different options provided for the LPC800 MCU series. As an example, the LPC804 MCU family has made available the LPCXpresso804 board for the easy evaluation of this family, providing capacitive touch capabilities and the Programmable Logic Unit (PLU) enablement. Also available is a kit is comprised of three boards: the LPC804 MCU board with an onboard debug probe, a PLU shield including LEDs, switches and oscillators for quick and easy design prototyping and a five-button cap touch shield with LED indicators for each button.
The LPC5500 MCU series, world’s first Arm® Cortex®-M33 based MCU series for the mass market, has introduced new levels of performance efficiency, advanced security and functionality in the MCU space. With up to 150Mhz of core frequency (dual-core options included), 32uA/MHz of active power consumption, and tightly couple accelerators for signal processing and cryptography, the LPC5500 MCU series has become a key player in consumer electronics, building control and automation, secure applications and Industrial IoT markets.
This series continues its expansion with its recent launch of the LPC552x MCU family, providing a perfect balance between performance efficiency, security and system integration for general embedded and IoT applications. The LPC552x MCU family combines the high performance efficiency of the Cortex-M33 core with multiple high-speed interfaces including USB HS & FS and SDIO, an integrated power management IC, and rich analog integration leveraging the cost-effective 40nm NVM process technology. This family is pin compatible with the previously launched flagship LPC55S6x MCU family.
To get started with the LPC552x MCU family now, NXP has made available the LPCXpresso55S28 providing an ideal platform for the evaluation and the development of your projects leveraging the LPC552x/S2x technology. This board also includes a high-performance debug probe, NXP ‘s MMA8652FCR1 accelerometer and several options for adding off-the-shelf add-on boards for sensors, connectivity, displays and other interfaces. The LPCXpresso55S28 is fully supported by the MCUXpresso software and tools, providing middleware, drivers and examples of the different features available in the LPC552x/S2x MCU Family so you can take your development rapidly to the market.
Building upon the market success and broad adoption of the Kinetis MCU portfolio, the K32 L3 MCU family leverages the combination of high-efficiency and low-power capabilities of the Arm® Cortex®-M4 while adding another Cortex-M0+ providing new enhancements such as low-leakage power-optimized peripherals, a DC-DC converter, numerous serial communication interfaces and up to 1.25MB flash and 384KB of SRAM memory.
With low-power optimizations featuring exceptional sleep currents as well as fast wake up times and advanced security features including physical tamper detection, authenticated boot and crypto acceleration engines the K32 L3 MCU family addresses the requirements needed in a wide range of low-leakage industrial and IoT applications.
The K32 L3 MCU family is complemented by a comprehensive ecosystem including the MCUXpresso software and tools as well as support from different third-party toolchains.
If you want to know more and start your design today, NXP has developed the FRDM-K32L3A6 Freedom development platform based on the K32 L3 MCU family. This boards consists of the K32L3A6 device with a 32-Mbit external serial flash, the FXOS8700 accelerometer/magnetometer as well as visible light sensors, an SDHC circuit, LEDs and general-purpose push buttons in the popular Freedom form-factor.
Focusing on low-power and fast wake up times, the K32 L2 MCU family – based on the Arm® Cortex®-M0+ - targets power-conscious end nodes and can enable a wide range of general purpose industrial and IoT applications. Today, the K32 L2 MCU family is being brought to market in a scalable set of packages, core performance and memory configurations, from 64KB to 512KB flash, and backed by unmatched enablement, led by NXP’s complementary suite of MCUXpresso software and tools with example projects utilizing IAR, Keil, and GCC based toolchains.
The peripheral set is complemented by high precision mixed-signal interfaces including a configurable 16-bit ADC, a DMA-addressable 12-bit DAC, high-speed comparators and a good variety of serial peripheral interfaces that enable low-power operation modes and fast wake-up times.
So you can start your development now, NXP has made the low-cost FRDM-K32L2B3 Freedom development board available. This board provides easy-access to the main interfaces of the K32 L2B with additional features such as onboard debug probe, segment LCD, and an NXP-based accelerometer/magnetometer.
Innovation means being first. And with more than 20 years of firsts, NXP’s MCU portfolio has grown into a powerhouse of more than 200 Arm®-based MCU families scaling from low-power Cortex®-M0+ to high performance Cortex®-M7 crossover MCUs.
Being the first to license the Arm core in 2002 set the path for many innovations to follow, including:
NXP MCUs have a long legacy of ground-breaking expertise, revolutionizing the MCU landscape time and time again. But this legacy hasn’t been the result of just one person – it’s truly taken a village. From engineers, designers, to partners, distributors, customers and consumers – all of us have had a hand in writing this story. Our collective experiences and input have enabled the team to turn yesterday’s idea into today’s reality, while nurturing tomorrow’s possibilities – with a goal to help you win time and time again.
Defined by exceptional ease of use, design flexibility, advanced integration and unmatched enablement, we’re dedicated to providing you the edge in quality, selection, and price. NXP MCUs offer you the perfect platform to plan for market leadership. And together, we are reimagining our world …
What exciting project are you working on? Tell us more!
Searching for a low-power, well integrated but cost sensitive MCU for your embedded designs? Look no further – the K32 L2 MCU family is for you!
Got a Battery-Operated Application, No Problem
Based on the Arm® Cortex®-M0+ core, the K32 L2 MCU family delivers a unique balance of low-power/low-leakage design with numerous standby and stop low-power modes; memory scalability going from 64KB to 512KB of Flash memory and a high precision mixed-signal integration featuring a high-resolution 16-bit ADC with single and differential pair input mode options and a 12-bit DAC with DMA support to offload the core and leave it available to perform other important tasks.
The K32 L2 MCU family is complemented by a set of low-power serial peripherals supporting asynchronous operation in low power modes, SPI interfaces, I2C, USB full-speed 2.0 supporting crystal-less operation and NXP’s proprietary FlexIO module supporting emulation of additional UART, SPI, I2C, PWM and other serial modules.
The K32 L2B family, which will be available mid-November 2019, features an energy efficient Cortex-M0+ running up to 48MHz, and a rich suite of timers, analog and communication interfaces with options of scaling from 64KB to 256KB of flash memory. Additional features, such as a segment LCD interface supporting up to 24x8 or 28x4 segments have been integrated to address different applications in the consumer, IoT and industrial spaces.
Figure1. K32 L2B Block Diagram
The K32L2A MCU family, coming January 2020, will feature a high-speed Arm Cortex-M0+ running up to 72MHz, flash memory options scaling from 256KB to 512KB and 128KB of SRAM. Key differentiating items include a low-power hardware touch sensing interface, 16-bit ADC with 24-cannels and a cryptographic acceleration unit supporting different algorithms such as DES, 3DES, AES, MD5, SHA-1 and SHA-256.
Figure 2. K32 L2A Block Diagram
Efficient and General Purpose for a Range of Smart Home and Industrial IoT Applications
With a wide range of memory and performance scalability going from 64KB to 512KB and 48 to 72MHz of CPU speed, the K32 L2 MCU family can fit a wide range of applications, such as small appliances, smart/e-locks, handheld meters, electronic scales, lighting control, etc. If you want to leverage the numerous options of low power modes in an application where your system will be dormant for most of the time, waiting for an interruption to power up the device, you can leverage the sub 200nA leakage in stop mode or below 2uA sleep mode with instantaneous wake-up times to get your system up and running in a fast way. These features, along with the wide packaging options going from 5x5mm QFN32 to 14x14mm LQFP100 allows for easy project migration leveraging a common architecture and software framework.
Accelerate Your Time to Market
The entire K32 L2 MCU family is supported by NXP’s MCUXpresso software and tools bringing together the best of NXP’s software enablement by providing free and easy-to-use integrated development environment (IDE), an open source software development kit (SDK) with examples created specifically for the MCU of your choice and a comprehensive suite of system configuration tools that include pins, clocks and peripherals. On top of that, the K32 L2 MCU family will also be supported by the major toolchains including IAR Embedded Workbench® IDE and Arm’s Keil Microcontroller Development Kit.
For easy prototyping and evaluation of the K32 L2 MCU family, hardware development kits will be available for each of the sub-families. The K32 L2B MCUs will be supported by the low-cost FRDM-K32L2B3 development platform. A getting started guide, a set of demos and examples will be available at launch to help you kick-off your design and accelerate your time to production.
LPC845 Breakout Board now has an SDK package available! We are working on updating our getting started information to show how to use this rather than starting from the LPC845 chip SDK. The board is called LPC845BREAKOUT in the SDK builder.
We used the board to teach a class on how to create a custom SDK for your own board. The class got several thumbs up at our recent Seattle Tech Day and Santa Clara Connects events. The materials are here:
Amazon Web Services has released a preconfigured FreeRTOS example for Armv8-M and the NXP LPCXpresso55S69 board. With the addition of board- and device-specific examples, it is even easier to start and use the Arm® TrustZone® features combined with MPU (Memory Protection Unit) on the NXP LPC55xx MCU.
The LPCXpresso55S69 is an ideal development board forevaluating the Arm Cortex®-M33 architecture and security features. The core platform features two Arm Cortex-M33 cores running up to 100 MHz.
With the Arm TrustZone approach to divide into a 'secure/trusted' and 'unsecure/not-trusted' world, it is possible to effectively protect sensitive code and data, such as secure bootloaders, key and encryption management and trusted applications on the 'secure' side, with the ability to run other functionality (for example third-party applications or middleware) at a lesser security level.
FreeRTOS with NXP MCUXpresso IDE and SDK
FreeRTOS can be configured at compile time to run either on the secure side or on the non-secure side. When FreeRTOS is run on the non-secure side the tasks (or threads) can call secure-side trusted functions that, in turn, can call back to non-secure functions, all without breaching the kernel’s prioritized scheduling policy. That flexibility makes it possible for application writers to create non-secure FreeRTOS tasks that interact with trusted secure-side firmware.
Setting up security adds some extra complexity and having these examples available in the FreeRTOS mainline release will help you to add security and TrustZone features to the next LPC55xx MCU design.
The ARM TrustZone is an optional secu=rity feature for Cortex-M33 which shall improve the security for embedded applications running on microcontroller as the NXP LPC55S69 (dual-core M33) on the LPC55S69-EVK.
NXP LPC55S69-EVK Board
As with anything, using and learning the TrustZone feature takes some time. ARM provides documentation on TrustZone, but it is not easy to apply it for an actual board or toolchain. The NXP MCUXpresso SDK comes with three examples for TrustZone on the LPC55S69-EVK, so I have investigated these examples to find out how it works and how I can use it in my application.
Windows 10 with MCUXpresso IDE 10.3.1 (Eclipse based with GNU toolchain for ARM Embedded)
MCUXpresso SDK V2.51. for LPC55S69
Most of the things presented in this article are applicable to any other Cortex-M33 environment with TrustZone.
TrustZone on ARMv8-M
As on the in the ARMv7-M, there is two basic modes the processor can be in:
Thread Mode: this mode is entered by reset or the usual mode in which the application runs. Code in Thread Mode can be executed in privileged (full access) or non-privileged (no restrictions imposed e.g. by an MPU (Memory Protection Unit)).
Interrupt or Handler Mode: this mode is executed with privileged level and this is where the interrupts are running.
TrustZone keeps that model and extends it. The basic concept of TrustZone on ARMv8-M is to separate the ‘untrusted’ from the ‘trusted’ parts on a microcontroller. With this division IP inside the trusted side can be protected while still allowing ‘untrusted’ software to run on the ‘untrusted’ side of the world. Each trusted and untrusted part can have different privileges, such as some hardware (GPIO ports, etc) only could be accessible from the trusted side, but not from the untrusted one.
While without TrustZone it is already possible to restrict memory access with an MPU, the TrustZone concept with ‘secure world’ and ‘non-secure world’ extends the concept to ‘secure’ or ‘trusted’ hardware or peripheral access. A non-secure function only can access secure hardware through an API which verifies if it is allowed to access the hardware through the secure world. So there are ways that the secure and non-secure parts can work together.
Similar to using an MPU, it means that there are several things to consider:
Setting security permissions for memory areas and accessing peripherals
Using secure and non-secure API and transfer functions
Ability to protect the secure world from debugging or memory read-out (reverse engineering)
The other important change in the ARMv8-M architecture that the size of an MPU region has now a granularity of 32 bytes. In ARMv7-M the size had to be a 2^N which I never understood and makes it not usable at all in real world applications (this is probably the reason the MPU is rarely used?).
SAU and IDAU
Because this all cannot be only implemented in the core (provided by ARM), there are extra settings needed on the implementation side by the vendor implementing the ARM core.
Secure Attribution Unit (SAU): this is inside the core/processor
Implementation Defined Attribution Unit (IDAU): this one is outside the processor
The SAU and IDAU work together and are used to grant/deny access to the system (peripherals, memory). Using the SAU+IDAU, the memory space gets separated into three kind:
Secure: Code, stack, data, … of the secure world
Not-Secure: Code, stack, data, … of the non-secure world
Non-Secure Callable: Entry to secure code with a secure gateway vector table
The important (and somewhat confusing) thing is that the SAU settings are first, and IDAU is used to make things ‘unsecure’:
Or in other words: by default things are secure, and with the IDAU the security level is set to a lower one.
Time to have a look at an example! The NXP MCUXpresso SDK already comes with an example showing how to call the non-secure land from the secure one. From the ‘Import SDK example(s) I can select examples demonstrating the TrustZone.
The ‘hello_world’ TrustZone example executes some code on the secure side and finally passes control to the non-secure side to execute the non-secure application. The example follows the pattern of a secure bootloader then calling the non-secure application to start.
The ‘ns’ (non-secure) and ‘s’ secure projects work together. Using secure and non-secure application parts do not make things simpler, and there seems not to be a lot of documentation about this topic. So I investigated that ‘hello world’ example to better understand how it works.
I have configured both to use the newlib (nano) semihost library:
For both project, set the SDK Debug Console to ‘Semihost console’:
Setting Semihost Console
I have both the secure and non-secure projects configured for using the semihost console, but a real UART could be used too.
Both projects are configured to use the Cortex-M33 (this is a setting in the compiler and Linker):
M33 Architecture setting
The non-secure project is configured in the compiler and linker settings as ‘Non-Secure’:
TrustZone Project Settings
There is a setting to prevent debugging:
The non-secure application links in an object file which is part of the secure application:
Linking CMSE Lib Object File
This means that the ‘secure’ project has to be built first.
This is for the ‘secure gateway library’ which is built in the secure project using the –cmse-implib and –out-implib linker commands:
The ‘–cmse-implib’ option requests that the import libraries specified by the ‘–out-implib’ and ‘–in-implib’ options are secure gateway import libraries, suitable for linking a non-secure executable against secure code as per ARMv8-M Security Extensions.
Secure gateway library linker command
The ‘hello_world_ns’ program is linked to address 0x10000: the vector table and code gets placed at this address:
Non-Secure Memory Settings
On the secure side the compiler and linker settings for TrustZone are set to ‘secure’:
Secure Linker and Compiler Settings
The program and vector table is loaded at 0x1000’0000 with a ‘veneer’ table loaded at 0x1000’fe00. More about this later…
Secure Memory Allocation
The non-secure application can be flashed to the device like this:
Program to Flash
This basically is as if the new (non-secure) application has been programmed using a bootloader or similar way to update the application.
To be able to debug the second (non-secure) from the secure application, I have to load the symbols for it in the debugger. The secure one can now be debugged as usual:
Debug secure application
In order to debug the non-secure application code when debugging the secure one, I have to add the symbols to the debugger. I can do this by editing the debug/launch configuration. Double-click on the .launch file or open the debug configuration with Run > Debug Configurations, then use the ‘Edit Scripts’ in the Debugger tab:
Add the following to load the symbols of the other project using the add-symbol-file gdb command. Adapt the path as needed, I have the other project at the same directory level.
Non-Secure functions can only be called from the secure world using function pointers, as a result dividing the secure from t
Behind that function call there are several assembly instructions executed. It clears the LSB of the function address and clears the FPU Single Precision registers, or any registers which could contain ‘secret’ information. At the end it calls the library function __gnu_cmse_nonsecure_call:
non-secure call sequence
The __gnu_cmse_nonsecure_call does push the registers and does more register cleaning and uses the BLXNS assembly instruction to finally enter the non-secure world:
So there are quite a few instructions to be executed to make that transition.
Calling the Secure World from the non-secure World
Calling a secure function from the non-secure side uses an intermediate step (Non-secure Callable):
Calling secure Function from non-secure side (Source: ARM, Trustzone technology for ARMv8-M Architecture)
In the example the non-secure world is calling a printf function (DbgConsole_Printf_NSE) which is located in the secure world:
Calling printf from the non-secure world
The secure functions which are callable from the non-secure world hae to be marked with the cmse_nonsecure_entry attribute:
CMSE stands for Cortex-M (ARMv8-M) Security Extension
Function with cmse_nonsecure_entry attribute
So how does the non-secure world know how to call this function? The answer is that the linker prepares everything to make it possible. For this the non-secure application has to link an object file (or ‘library’) with the ‘veneer’ functions:
Linking CMSE Lib Object File
This object file (or library) is created with the following linker setting on the secure side:
So let’s follow the code from the non-secure to the secure world: The assembly calls a ‘veneer’ function:
calling printf veneer
The veneer is a simply ‘trampoline’ function which loads the address for the ‘non-secure callable’ and does a BX to that address:
BX to non-secure callable
The ‘secure non-callable’ area is in the ‘secure world’ with a SG instruction as the first one to be executed, followed by a branch.
SG Instruction in non-secure callable region
The SG (Secure Gateway) instruction switches to the secure state followed by the B (Branch) instruction to the secure function itself:
Executing Secure Function
Compared to calling the unsecure side from the secure world this was rather fast. The clearing of all the registers because they can contain secret information is done just before the BXNS returns to the non-secure state:
Clearing registers on return to non-secure state
So how is the protection configured? For this the SAU (Secure Attribution Unit) is configured which only can be done on the secure side.
The example uses the following secure and non-secure code and data areas:
#define CODE_FLASH_START_NS 0x00010000
#define CODE_FLASH_SIZE_NS 0x00062000
#define CODE_FLASH_START_NSC 0x1000FE00
#define CODE_FLASH_SIZE_NSC 0x200
#define DATA_RAM_START_NS 0x20008000
#define DATA_RAM_SIZE_NS 0x0002B000
In the example this is configured in BOARD_InitTrustZone(). The following setting configures a region for the non-secure FLASH execution:
/* Configure SAU region 0 - Non-secure FLASH for CODE execution*/
((0U >> SAU_RLAR_NSC_Pos) & SAU_RLAR_NSC_Msk) |
/* Enable region */
((1U >> SAU_RLAR_ENABLE_Pos) & SAU_RLAR_ENABLE_Msk);
The IDAU (Implementation Defined Attribution Unit) is optional and is intended to provide a default access memory map (secure, non-secure and non-secure-callable) which can be overwritten by the SAU.
It probably will take me some more time to understand the details of the ARMv8-M security extensions.There are more details to explore such as secure peripheral access or how to protect memory areas. In a nutshell, it allows to partition the device into ‘secure’/trusted and ‘unsecure’/not-trusted and divides the memory map into secure, non-secure and non-secure-callable with the addition of MPU and controlled access to peripherals. Plus there is the ability to control the level of debugging to prevent reverse engineering.
With the NXP MCUXpresso SDK and IDE plus the LPC55S69 board I have a working environment I can use for my experiments. I like the approach that basically the non-secure application does need to know about the fact that it is running in a secure environment, unless it wants to call functions of the secure world.
I have now FreeRTOS working on the LPC55xx with the FreeRTOS port for M33, but I’m using it in the ‘non-secure’ world. My goal is to get the RTOS running on the secure side. Not sure yet how exactly this will look like, but that’s a good use case I want to explore in the next week if time permits.
The LPC55S69 is of special interest because it is one of the new ARM Cortex-M33 which implements new ARM Trustzone security features: with this feature it is possible to run ‘trusted’ and ‘untrusted’ code on the same microcontroller.
With the SDK installed, I can quickly create a new project or import example projects:
The SDK V2.5.1 comes with a FreeRTOS V10.0.1 port which runs out of the box, using the M4 port.
Debugging FreeRTOS on LPC55S69
In the McuOnEclipse FreeRTOS port I’m already using FreeRTOS 10.2.0, so this is something I have to soon too.
The IDE comes with the NXP MCUXpresso Configuration Tools integrated.
With the graphical configuration tools I can create pin muxing and clock configurations:
Secure and Non-Secure
The SDK comes with demos using secure + non-secure application parts. To make it easy, the projects have TrustZone settings for the compiler and linker:
TrustZone Project Settings
I have started playing with TrustZone, but this is subject of a follow-up article.
Dealing with a ARM Cortex-M33 multicore device for sure is a bit more complex than just using an old-fashioned single Core M0+. Because of the secure and non-secure features, it might be necessary to get things back into a clean state. So this is what worked best for me:
Have a non-secure and simple project present in the workspace. I’m using the ‘led_blinky’ from the SDK examples.
Power the Board with IP5 USB connector (P5: cable with the yellow dot) and debug it with the onboard LPC-Link2 connector (P6).
LPC55S69 Power and Debug
With that project selected, erase the flash using the action in the Quickstart Panel.
Erase Flash Using Linkserver
Select core 0 for the erase operation:
Select core for Flash Erase
This should work without problems.PressOK the dialog:
At this point I recommend to disconnect and re-connect the P6 (Debug) cable.
Now I can program the normal application again:
With this I have a working and known state for my experiments.
The Easter break is coming to an end and has been interesting at least to say. The NXP LPC55S69-EVK is very appealing: the board is reasonably priced and with all the connectors it is a good way to evaluate the microcontroller. The most interesting thing is that it has a dual-core ARM-Cortex M33 with the ARM TrustZone implementation. To be able to run ‘trusted’ and ‘untrusted’ (e.g. user code) on the same device could be one of the standard models of microcontroller going forward, especially in the ‘internet of things’ area. So I think I have to explore this device and board and its capabilities in at least one follow-up article?