Stupid DMA tricks: Transforming data on the fly with the eDMA module

Discussion created by SCOTT MILLER on Aug 18, 2018

I finally got around to trying out an idea I've been kicking around for months and I thought I'd share my results.  I'm going to preface this by saying that it works, but I'm not going to guarantee that it doesn't have the potential for problems.  I haven't seen this done before and I'm not sure it was something the designers had in mind.


The Kinetis eDMA controller is a really powerful and flexible tool - and I've got a much greater appreciation for what it can do after having to work with Atmel's UC3 PDMA module, which is not nearly as capable.


The use case of sending raw buffered data to a peripheral is obvious and the implementation is simple.  But sometimes you don't want to just send data, you need to do a bit of transformation as well.  Well, it turns out that thanks to the eDMA controller's ability to write directly to its own TCD registers, you can do that without CPU intervention, after a bit of setup.


The application I'm working on is an LED controller.  It takes a buffer of 8-bit RGB values and generates a serial data stream for WS2812B style asynchronous addressable LEDs.  The single-wire protocol used by the LEDs runs at 800 kbps and represents a 0 as a short pulse and a 1 as a longer pulse.  If you think of it as a 2400 kbps stream, 0 becomes 100 and 1 becomes 110.  The pulse durations given in the datasheet don't work out to exactly 1/3 of the bit time, but it's still within the timing specs.  It's common to use SPI peripherals to drive these, but on the Kinetis K22FN, at least, the I2S/SAI module is a better fit.


The SAI module has a decent size hardware FIFO and works with odd frame sizes and word alignments.  To use it for LED data, I normally run it with a 24 bit word size and then use a lookup table to convert 8 bits of source RGB data to 24 bits of SAI data.  For simplicity the lookup table uses 32 bit words, with the low byte ignored.  The application prepares a frame of LED data by iterating over the input buffer, running bytes through the lookup table, and storing the results in another, larger buffer.


This works great, but it has the disadvantage of requiring a lot more RAM.  This controller needs to be able to handle thousands of LEDs (optionally split between two channels thanks to the dual-channel SAI*) and there's not enough RAM for the buffer that holds the 3:1 expanded bit pattern.


The solution I came up with goes like this: DMA channel 1 is set for 8-bit transfers and its source is the raw RGB buffer.  It's triggered by the SAI's DMA request.  DMA channel 2 has its source address initially set to the start of the 768-byte lookup table, which is aligned to a 256-byte page boundary.  Unlike the original lookup table with each 24 bits in one 32-bit word, this table has the first 8 bits of each entry in the first page, the next in the second page, and the last in the third page.  That means entry 5, for example, is made up of the bytes at offsets [5], [0x105], and [0x205].

The first DMA channel's destination address is set to the low byte of the second channel's source address.  Channel 2 has a minor loop byte count of 3, a source offset of 256, and a source last adjustment offset of -768.


When the SAI requests data, channel 1 fetches the first byte of LED data and places it in channel 2's TCD, and then links to channel 2 on completion.  I've got the system configured for round robin mode to try to avoid any issues with preemption, but I'm not sure it'd matter if the priorities are set right.


Channel 2 activates and fetches the first 8 bits of expanded data from the first page of the lookup table, and writes to the SAI FIFO.  As the minor loop continues it advances one full page, gets the next 8 bits, and so on.  When the major loop is complete, the source last adjustment of -768 rewinds it back to the first page.  The low byte of the source address is still whatever it was, but it'll get overwritten by the next activation of channel 1.


One minor drawback of this scheme, at least for LEDs, is that it can't load a value not in the lookup table.  That means I can't buffer a trailing 0, which would silence the SAI's output and end the LED frame without CPU intervention.  That could be accomplished with another channel linked on major loop completion but I chose to just catch the completion with an interrupt and shove a 0 into the FIFO from there.


I'm keeping the lookup table in RAM for now partly for speed, but also because it gives me the ability to rewrite the table on the fly when the brightness setting is changed, and the resulting output will be scaled by the brightness factor automatically, again with no CPU time after the initial setup.


If it turns out that having DMA channels write directly to eDMA registers isn't kosher, I think the same thing could be accomplished by keeping working copies of the TCD in RAM and using the scatter/gather option to load the new settings on major loop completion.


This feels a little like my days of learning 6502 assembly and writing self-modifying code to make up for the lack of a 16-bit index register, but hey, it works.  Given the right lookup tables, it should be possible to make the eDMA module Turing-complete and have it execute its own instruction set - but I think that will have to be a project for another day.


Happy eDMA hacking!



* About the dual-channel SAI operation - you can load the two FIFOs with linked DMA channels, but if your data is interleaved it's more efficient to set the destination address modulo to 8 and the destination offset to 4.  Because the FIFO data registers have consecutive addresses, the destination will alternate between the two.