I have a legacy product with limited lifespan remaining that needs to support a modern security algorithm. The algorithm (SHA512) is quite large, so my current device that is based on MCF51JM128 ran out of space when I tried adding it (got the "Overflow in segment" error).
In order to have the shortest path to a solution, I purchased a MCF5225x Demo kit to see if I can just add the SHA512 algo to the demo firmware to see if it fits. If that worked, I would spin a new board using the V2 Coldfire as most items in the rest of the device's codebase can transfer with little headache.
I tried it with the demo board, but even though I'm going from 16KB SRAM to 64KB and from 128KB flash to 512KB it still doesn't fit. Using CW7.2, code size shows 92192 and data shows 33901. The Link Error is "Overflow in segment...0x00007bd0 -- Overflow of: 0x00015590."
Is this a license restriction or am I out of space already?
Sorry to bring up a question on such a relic, but I really don't want to start from scratch with an entirely new processor for a dying product.
Thanks in advance!
Yes, sounds like a license limit. That should be documented in there somewhere. You should be able to buy a license and be good to go.
I would have thought that the "shortest path" would be to get the code to fit in the existing CPU. Changing CPUs, rewriting the hardware drivers and making a new PCB seems like a lot of work. You might as well use a modern ARM-based CPU at that rate. From my experience, the MCF51 chips use peripherals from the HC08 and HC16 families, while MCF52 uses different peripherals. You'll need rewriting there.
The MCF51MM should get you 256k and 32k RAM. That might be an easier chip to port to.
How much spare space do you have in your existing 128k product? Have you ever had to "shrink" the code to fit? If not, then you might be able to save a lot. In a previous product I worked on I managed to shrink 25k of code into 16k of Flash (actually shrunk 17k to 16k a lot of times as I wrote it).
How large is your SHA512 implementation? Are you sure it needs to be that large? There are a lot of sample code implementations that don't look all that complicated. If you're using an existing library (we've used tomcrypt), then it can bring in a lot of extra code if you're not careful with its configuration. SHA512 is a Hashing function. Do you need more than that from the library (like something to do encryption)?
Tom
Mike and Tom,
Thank you both for taking the time to reply.
Tom, I'm with you on staying with the MCF51 for sure. I own a license for CW6.3, so unfortunately, I assumed the problem wasn't licensing originally. One of those I already dealt with this and put it out of my mind things. I should know better! I would LOVE if that were simply the problem. I really don't want to go to another processor for a dying product.
The hashing function is indeed taken from the LibTomCrypt library. I cut TONS of code out of it and only kept what was absolutely needed. The total code for just the SHA512 lib is now 79162 and data is 32320. I even considered making a firmware that ONLY does the hashing function in order to have a proof of concept on the MCF51 MCUs, but even that wouldn't compile. With that amount of code cut out in the MCF51, I got Code down to 92218 and data to 35532.
Unfortunately, I installed CW7.2 now (and kept 6.3), but the PEMicro BDM hardware changed from USB BDM Multilink to Multilink Universal and that has now broken CW6.3. I was hoping I could keep both installed.
As an aside question, is it possible to keep both and use both pieces of hardware?
Mike,
I purchased DLP-BASIC-NL license for CodeWarrior 6.3 on 1/17/2019, so that should have been attached to the CW6.3. Not sure why that error would have been code size restriction. I'll see if I can roll back to a functioning CW6.3 and start checking on the licensing route.
Again, THANK YOU both for taking the time for this. So sorry to bring up relics, but I appreciate the wisdom!
Hi,
First of all, sorry for the later reply.
About P&E debug tools compatibility issue, I am not sure if USB BDM Multilink debugger for ColdFire V1 installation could affect USB-ML-CFE dubugger tool driver. To be honest, I don't have related operation experience to install both CodeWarrior software (Classic IDE) to debug ColdFire V1 and ColdFire V2~4 products. Sorry for that.
About CodeWarrior software license issue, I would recommend to use NXP online service (submit a ticket). My software license relevant teammate is willing to support that ticket.
Thank you for the understanding.
Mike
> The total code for just the SHA512 lib is now 79162 and data is 32320
There's something insanely wrong there. I've just compiled src/hashes/sha2/sha512.c and got this:
ltc/src/hashes/sha2% gcc -c -o test.o -I../../headers sha512.c
(compiled without errors)
ltc/src/hashes/sha2% ls -lFtr test.o
-rw-r--r-- 1 - - 10392 Nov 24 17:30 test.o
ltc/src/hashes/sha2% size test.o
text data bss dec hex filename
7006 368 0 7374 1cce test.o
ltc/src/hashes/sha2% nm test.o
U crypt_argchk
0000000000000020 r K
U memcmp
U memcpy
0000000000000000 t sha512_compress
0000000000000000 D sha512_desc
0000000000001431 T sha512_done
0000000000001181 T sha512_init
0000000000001260 T sha512_process
000000000000160a T sha512_test
U __stack_chk_fail
U strlen
00000000000000e0 d tests.0
7,374 bytes is a lot less than 111k.
"sha512.c" looks like "standalone code". You have to call the functions with the right data structures, but it looks like you could link just it into your code. It gets better: How about 4175 bytes?
ltc/src/hashes/sha2% gcc -DLTC_SMALL_CODE -c -o test.o -I../../headers sha512.c
ltc/src/hashes/sha2% size test.o
text data bss dec hex filename
3807 368 0 4175 104f test.o
Maybe I've missed something really big and obvious, but there looks to be a lot of wasted stuff in your "cut down build". Can you run a "nm libtomcrypt.a" and see where all the memory is going? I assume you do have something that lets you runs basic programming tools like "nm" and "size" and "objdump"? If you haven't, then fire up a Linux VM and compile it in there to find what's going on. You don't need an M68K compile to analyse this. There's another thing with libtomcrypt. It has very good optimised code for X86 and ARM and others. It has nothing for M68k, so don't ask for any "fast kernels" as there aren't any. I don't think anything supports the crypto processor in the MCF51 you're using either (or maybe you could find an NXP library that does and get speed and save space).
> USB BDM Multilink to Multilink Universal and that has now broken CW6.3
We're using obsolete P&E CF Multilink pods. But we found we could use Multilink Universals without changing any drivers. But we're using old Linux stuff and not CW. I think you need to roll back the drivers, or restore a backup or set up a separate PC (or VM) to run the different IDEs.
Another thing to watch out for with code sizes, and this comes from way back. The first way to blow a small C program up in size is to PRINT something with "printf()". Unless you have an "integer only print library" that has to pull in the floating point print stuff (even if you're not using it) and on the Coldfire, that has to pull in the Floating Point Emulation code. Those cab blow 30-60k. I've worked on code that has its own custom small integer "printf()" library. It only takes one enabled "debug print" in a library to drag all of that in and blow the size.
Tom
Tom,
I'm so sorry and you are absolutely right. The SHA512 portion is only 9740 Code and 640 Data. It is the rest of the algorithm that is killing the space. It is an implementation of ED25519. I just tossed all the 25519 files in one folder and the combination of it is the size I mentioned. Also, that's where I did the cutting. I wrongfully assumed the whole thing was related to SHA, but it was only a small portion of the overall picture. The sc.c file alone is 42698 of Code.
I've completed rolling back to CW6.3. It is worth noting, Mike, that the license file in the CW folder shows a Perpetual Mode, Node-locked license, so there shouldn't be a size restriction, right?
I knew you had to be doing more than just SHA512. That's just a small part of Ed25519, the majority of which is all the elliptic stuff. The majority of the TIME taken is the elliptic multiplication as well. ASN1 is huge too.
You're trying to do this on an MCF51JM128 (or maybe use an MCF51MM256). They run at 50MHz.
Ed25519 is running SHA2-512 and I think Ed25519 is a 256-bit key. It's all set up to run nicely on Gigahertz X64 chips in this decade.
We're running "old ltc/src/ecc/ecc" here with the 128-bit "SECP128R1" key. That's running on a 150MHz MCF5235 (triple your clock rate) with SHA1 running on the MDHA hardware in the chip to make it faster (we wrote that). We're also using "TFM" (tomsfastmaths). Most of the time is in a very inefficient multi-precision multiply (fp_mul and fp_mul_comba). I tried to use the EMAC to speed those up, but couldn't get it work.
It takes our chip about 2.5 SECONDS to encrypt a short symmetric key. That's time when it is locked solid inside the libtomcrypt library, and so the device is unable to execute any other code. Like the loop that pats the watchdog! Yours may be locked up for 20 seconds or more at your clock rate and with that wider encryption scheme. Can your system handle that?
Fortunately, our product has a relatively old ARM core in there as well (800MHz Cortex-A8), and it can do the same thing (also running libtomcrypt) in 28ms. About 1000 times faster!. So the MCF5235 asks the ARM chip to do that work for it. It might even be worth your putting a small ARM chip on your board to offload the encryption work on an SPI or I2C port.
The way we cut LTC down to size was to build with a file like this, which is pulled in before ltc/src/headers/tomcrypt.h:
// LibTomCrypt configuration
#define ENDIAN_BIG
#define ENDIAN_32BITWORD
// Ciphers
#define LTC_NO_CIPHERS
#define LTC_RIJNDAEL
// Modes
#define LTC_NO_MODES
#define LTC_CTR_MODE
#define LTC_GCM_MODE
// Hashes
#define LTC_NO_HASHES
#define LTC_SHA1
#define LTC_HASH_HELPERS
// MACs
#define LTC_NO_MACS
// PRNGs
#define LTC_NO_PRNGS
#define LTC_SPRNG
// PK
#define LTC_NO_PK
#define LTC_MECC
#define LTC_ECC_SHAMIR
#define LTC_NO_ECC_TIMING_RESISTANT
// Other
#define LTC_NO_MISC
#define LTC_NO_PKCS
#define LTC_DER
#define LTC_NO_TEST
#define LTC_NO_FILE
Or you can pass all of the above in on the compiler command line as "-DLTC...", but that makes for a messy makefile.
The above tells it to include nothing (from each of the types) and to then only include the ones we want. You could start off with everything set to "LTC_NO_xxx" and see how big that is.
I've just summed all the Text sizes for the compiled object files. The biggest totals for the ones we're using are AES at 34k, followed by the ASN1 coder at 30k! ASN1 is only used to encode and decode the keys. You might be able to get rid of that. The ECC takes 18k. The "register and search for these things by name" takes 11k.
Tom
I got a version to compile on the MCF51JM128 (by ripping all the USB code out too). Got it down to 92254 Code and 34119 Data. Obviously as soon as the code ran it completely locked up as you said. That would make the device not feasible as I need to have it responding to CAN messages throughout the process. The ARM processor idea is great, but if I'm redoing the layout to add an ARM processor, I might as well port everything over to the ARM processor.
As you can tell I was hoping for a simple solution...just doesn't look like there is one here.
Tom, thank you so much for your time and effort!
> as the code ran it completely locked up as you said
I would be interested in a measurement of how long it took for one pass. Did it take 10 seconds, 30, more than a minute? For something that slow you can time it externally with the stopwatch on your phone (or a sundial :-). Or did a watchdog or something kick in so you couldn't measure it?
> That would make the device not feasible as I need to have it responding to CAN messages throughout the process
There are simple ways around that. Does your code poll for CAN messages or does it receive them under interrupts? If the latter, it might be able to respond to the CAN messages from the interrupts. You should always try to receive CAN using interrupts rather than polling. There are a lot of broken products out there that only use polling and can't handle back-to-back CAN messages without dropping them.
Otherwise, if the "polling loop" is too complicated to do that, you could change the code to call the polling loop from a periodic interrupt (say once per millisecond), and run the encryption in a separate "real background polling loop". That way all the normal code runs in "real time". The M68k and Coldfire are particularly suited to this as they have multiple interrupt levels in hardware that supports different priority levels for different code and interrupts.
That would only work if you don't need a fast response to the encryption. In our device, we "cache" two pre-computed encryptions in EEPROM so we get "instant access" to new ones, and then replace them in a "background process". Which was a lot of work, and then we called the other CPU to do it instead, so all that was unnecessary.
Tom
Tom,
Again, thank you for all your insight.
> Does your code poll for CAN messages or does it receive them under interrupts?
CAN messages are received under interrupts. During the time of crunching on the ed25519 algorithm there is a periodic outbound CAN message at two second intervals. No response is required, so no issues there.
> how long it took for one pass
I couldn't get it to run a full single pass as it would error out. It would typically generate an error on the first large function (in this case sc_reduce). The error varies, but the most common was "Access Error" case 4 "Error on instruction." If I skipped the sc_reduce function altogether, it would just do the same in the ge_scalarmult_base function.
I have since ordered a development kit for the NXP LPC546xx ARM processor as I need to move on from this problem. I know it will be A LOT of work to port things over, but I have to start somewhere. I don't see a perfect path using the V1 Coldfire in any case. Even if I get it to run and it takes 40 seconds (which I can live with), I can't fit the rest of my code, and I would need a board redesign to make another processor work.
> I couldn't get it to run a full single pass as it would error out.
It might have been overflowing the stack. You probably don't have all that much RAM in the existing CPU and the stack is probably sized for its normal functions. LTC might be written with the implicit assumption of "desktop machine, 16G of RAM" and by default assume "a lot of stack". We're running LTC on an MCF5235 with 16M of external DRAM and 64k of internal SRAM. The stack is in the internal SRAM and we allocate about 30k to it. We've monitored the use (a long time ago, before LTC) and have seen 10k used. So that might be your problem.
> I can't fit the rest of my code,
You might be able to fit it into the MM variant, and it has 32k SRAM (up from 16k in your existing one, so you could allocate a 16k stack) and 256M Flash (so an extra 128M for LTC). The pinout shouldn't be that different and the voltages and signals should be pretty much the same.
Tom
Hi,
Please refer below picture about each version CodeWarrior for ColdFire software difference:
The Evaluation version CodeWarrior software has limited C code and data size, total should below 128KB.
Thanks for the attention.
Mike