Hi,
I'm using the LTC libraries with the K82 part. I have noticed that if I use a 32 bit key and 64 bytes of data and feed it to the library that the first 16 bytes come back correct but the residual bytes do not. Here is an example.
Key = 63 68 69 63 6B 65 6E 20 74 65 72 69 79 61 6B 69
2C 20 69 73 20 76 65 72 79 20 79 75 6D 6D 79 21
Data IN = 49 20 77 6F 75 6C 64 20 6C 69 6B 65 20 74 68 65
20 47 65 6E 65 72 61 6C 20 47 61 75 27 73 20 43
68 69 63 6B 65 6E 2C 20 70 6C 65 61 73 65 2C 20
61 6E 64 20 77 6F 6E 74 6F 6E 20 73 6F 75 70 2E = "I would like the General Gau's Chicken, please, and wonton soup."
Data OUT = 49 30 D0 2C 0B 75 CF 5C 15 24 B4 0D A6 5C F9 7B
2E B6 1A 0F 3B 20 13 7E E2 8D 89 95 F0 21 F7 C6
D6 38 A0 13 96 90 57 B0 AC 2E A9 80 91 EE 53 47
F7 64 0D A4 9D A9 20 B4 BD 23 A1 C8 17 76 EF E6
It should be:
49 | 30 | d0 | 2c | 0b | 75 | cf | 5c | 15 | 24 | b4 | 0d | a6 | 5c | f9 | 7b |
2b | b4 | ac | ab | b1 | 8b | fa | 4c | 19 | 3f | b1 | 8b | 37 | 5a | 2b | 5d |
0f | b3 | bc | 50 | 63 | d6 | ce | 43 | 62 | f6 | fc | e2 | 9a | fa | 56 | ab |
93 | c4 | 7e | ac | 1f | bc | dd | 92 | b2 | 84 | 84 | 87 | 76 | 21 | 84 | a2 |
If I run this through LTC_AES_DecryptCbc it comes back ok but that does not mean we are compliant. I have no Idea why this is happening.
I use the WolfSSL libraries that call into the LTC libraries
code looks like this:
wc_AesSetKey(&temp->session->aes, (const byte *) temp->session->aes.key, temp->session->keylen, NULL,
AES_ENCRYPTION);
wc_AesCbcEncrypt(&temp->session->aes, BufOut, &((struct CmdEncDec *) (&frameIN->data))->data,
((struct CmdEncDec *) (&frameIN->data))->len);
Chris
I have tested WolfSSL's AES256 and don't get the result that you do (mine are correct).
However there are loads of setting that could be used and you may be using software or mmCAU, which means that it is presumably a configuration that is not correct somewhere.
- Is your implementation offloading parts to the Crypto Acceleration Unit (CAU)? And if so, are these Freescale/NXP supplied routines hooked to the WolfSSL or from another source?
Your single block encryption is correct but when cycling you have an error (the error is symmetrical for encryption and decryption). This means that your initialisation vector is OK for the first use (it will be all zeros and is embedded in the WolfSSL instance).
When the second block is encrypted it presumably has a problem with its new initialisation vector, resulting in errors from that point in. The WolfSSL AES instance saves the initialisation vector as 4 long words (rather than as 16 bytes as is often done) and there is a risk that something might be using these long words incorrectly and think that they are bytes in big-endian format. Therefore check carefully big and little endian configurations and incompatibilities between the WolfSSL AES storage and possibly passing conventions to NXP mmCAU assembler code (if used).
There are unfortunately many combinations possible for the encryption libraries and so they need to be carefully configured to actually work correctly (and optimally on a particular chip).
Regards
Mark
Supporting Kinetis professionals: http://www.utasker.com/
Seems like the problem is the IV is reset to the current cashed value on every call to the CBC encrypt and decrypt functions. I believe this is wrong procedure.
/* Write IV data to the context register. */
ltc_set_context(base, &iv[0], LTC_AES_IV_SIZE, 0);
Chis
It is normal to reset the IV before encrypting or decrypting each plaintext message [a message being a multiple of 16 bytes considered to belong together] (although one could also prime to any predefined value at both tx and rx). The initial IV affects the first block output but doesn't explain errors between each block in the ciphertext as you have shown.
If you have WolfSSL try this (direct calls) to ensure that the LTC plumbing is not screwing things up. If it also gives the same results the problem is in the WolfSSL part (setup or something). [WolfSSL's set key call also resets the IV].
Aes aes_encrypt_context;
wc_AesSetKey(aes_encrypt_context, &Key, (256/8), NULL, AES_ENCRYPTION);
wc_AesCbcEncrypt(aes_encrypt_context, &DataOUT, &DataIN, sizeof(DataIN));
Regards
Mark
That's what I expect to happen the SetKey also resets the IV. This is consistent with libgcrypt as well. The LTC however, resets the IV on every call to decrypt or encrypt which does not to seem to be compliant. I wish someone from NXP would step up to this and explain why they crafted the ltc driver (fsl_ltc.c) the way they did. Is it compliant with the standard? Probably not.
Chris
Please clarify:
- "The LTC however, resets the IV on every call to decrypt or encrypt"
Do you mean every call to encrypt or decrypt a message? Or a block?
Resetting for every "block" is equivalent to ECB (Electronic Code Block) mode (OpenSSl says "We strongly recommend that you not use it under any circumstances").
Resetting for each "message" is OK (CBC mode feed backs the output of each block as new IV for the subsequent block).
This is the output (first 64 bytes) from AES256 in CBC using your key and plain text:
#aes256
0x49 0x30 0xd0 0x2c 0x0b 0x75 0xcf 0x5c 0x15 0x24 0xb4 0x0d 0xa6 0x5c 0xf9 0x7b
0x2b 0xb4 0xac 0xab 0xb1 0x8b 0xfa 0x4c 0x19 0x3f 0xb1 0x8b 0x37 0x5a 0x2b 0x5d
AES256 passed CBC
This is the output (first 64 bytes) from AES256 in ECB using your key and plain text:
#aes256
0x49 0x30 0xd0 0x2c 0x0b 0x75 0xcf 0x5c 0x15 0x24 0xb4 0x0d 0xa6 0x5c 0xf9 0x7b
0x2e 0xb6 0x1a 0x0f 0x3b 0x20 0x13 0x7e 0xe2 0x8d 0x89 0x95 0xf0 0x21 0xf7 0xc6
AES256 failed CBC
It does look to match with your problem. Therefore you must really mean that IV is being reset "between" blocks, which is of course false for ECB mode.
Either you are not using the mode that you think or someone has really killed the code in the LTC plumbing.
Also you never said whether you are making use of the mmCAU or not. As I noted, when used correctly, you can get almost 900% improvement in performance over the SW library. This is where NXP's main interest lies so I would think that there are hooks in there somewhere.
Regards
Mark
Please clarify:
- "The LTC however, resets the IV on every call to decrypt or encrypt"
Do you mean every call to encrypt or decrypt a message? Or a block?
Every call to E/D. I(t calls ltc_symmetric_update(base, key, keySize, kLTC_AlgorithmAES, kLTC_ModeCBC, kLTC_ModeDecrypt); which if you drill down in the LTC it does an all register base->CW = (uint32_t)kLTC_ClearAll . It also call ltc_set_context(base, &iv[0], LTC_AES_IV_SIZE, 0); each call to E/D as well. This is not consistent with libgcrypt either.
This does not seem consistent with libgcrypt which I have to be able to pace on the other end. Likewise, calls to the CBC E/D with more than 16 bytes does not produce the correct results against multiple web based calculators as shown in my first post.
BTW: "This is the output (first 64 bytes) from AES256 in CBC using your key and plain text:" didi you mean 64 bits?
libgcrypt seems to hold those previous run bits between calls to E/D until you call SetIV. At that point they get cleared.
I am using the K82 and the LTC is directly operating on the engine registers of the k82 verified.
"It does look to match with your problem. Therefore you must really mean that IV is being reset "between" blocks, which is of course false for ECB mode." Did you mean false for CBC here? I was under the impression that all AES was 16 byte block size but for CBC, the carry over of those previous sub results were kept in the computation of the next 16 bytes and so on and so on until reset.
Also, according to the K82 reference manual:
42.3.5.1 AES CBC mode use of the Mode Register
The AES CBC mode uses the Mode Register as follows:
• The Encrypt (ENC) field should be 1 for encryption and 0 for decryption.
• The ICV/TEST bit is not used in these modes.
• The Algorithm State (AS) field is used only in CBC mode to prevent IV update in the
context for the last data block when set to "Finalize" (2h). <<<< this is not being set to 2 in the LTC but is being set to 0
Chris
I meant 32 bytes and not 64 bytes, and false for CBC.
The code changes between versions are standard as far as I can see ([fixes or new bugs?] and who changed it ?) - I couldn't find the routines that you were referring to in SDK 1.3 and don't want to try anything with KDS 2.x at the moment since it seems a disastrous way to work on real developments.
Fortunately, using the underlying libraries directly, means I am not affected.
Regards
Mark
Mark,
If you get a copy of the SDK 2.1 for FRDM-K82F and look in the \SDK_2.1_FRDM-K82F\devices\MK82F25615\drivers\fsl_ltc.c file, you will see the difference between 1.3 and this mess.
Chris
If you have WolfSSL I still don't really understand why you are going through the LTC interface?
They have a define FREESCALE_MMCAU which can be enabled to hook directly to the mmCAU. It even works ;-)
Mark
Mark,
The Wolfssl wants to call the ltc function calls because of the K82. I didn't implement this. I'm not sure that the mmCAU works with the K82 part.
Chris
The K82 has an mmCAU integrated because it is intended for such operations.
I have a report of comparisons (benchmarks) using various solutions at http://www.utasker.com/docs/uTasker/uTasker_Cryptography.pdf
where you can see that the K82F at 150MHz performs AES256 encrytions/decryptions about 5x faster that in SW.
I haven't investigated yet why it is actually slower than a 120MHz K64F, which does the same AES265 encryption 35% faster than the 150MHz K82F.
Correction: The 150MHz K80F performs AES256 16% faster than a 120Mz K64, which makes sense. I was comparing initially with a longer plaintext by mistake. This makes it 8.5x faster than in SW.
In any case, if throughput is of importance, the mmCAU should be used otherwise the HW potential is not being utilised fully.
Regards
Mark
Ok so now I'm really confused. The LTC code is called in the Wolfssl SDKs for the K82 from NXP. I see that Chapter 40 of the RM covers mmCAU and chapter 42 covers the LP Trusted Crypt unit which references AESA. I did not see benchmarks for the LTC vs mmCAU. They are both available but I am assuming the the LTC is faster or they would not have used it for the SDK. See https://www.wolfssl.com/wolfSSL/Blog/Entries/2016/10/20_New_NXP_Kinetis_K8X_LP_Trusted_Crypto_(LTC)_...
Chris
I haven't used the LTC module in the K82 yet so don't have a benchmark to know whether it is faster than the mmCAU or not.
Finally however I generated a FRDM-K82F KDS2.1 build (all 350 MBytes of it) to take a look at the LTC interface. The WolfSSL (I think this can only be used for evaluation and needs licensing to work with later) does indeed have a hook for "FREESCALE_LTC" [I wonder why they still call it Freescale since the addition it is only a few months old...?].
When I followed this I didn't actually see any code with the IV being reset between blocs because the AESA handles CBC internally from what I can see. Therefore it may be that if one generates the build today the error has been fixed (?)
There is however no comment on changes and no version management in the system so it may be pot luck as to whether the code is good or not.
Once I have checked out the LTC performance I'll update the benchmark. (It is FIFO based so should be pretty fast without any further code optimisations; I found that I could get almost double throughput by optimising away some of the fluff that the libraries add at the mmCU interface but I doubt that it will be as much with the LTC).
Regards
Mark
Chris
I now have comparisons for encryption (I didn't do decryption just yet but understand that the speed results are identical) with the LTC in the K82F.
AES256 on K82F at 150MHz mmCAU (set encryption key / 32 byte encryption) 4.4us / 7.0us (4.57MB/s)
AES256 on K82F at 150MHz LTC (set encryption key / 32 byte encryption) 2.3us / 5.5us (5.82MB/s)
Since these are short lengths there is a certain amount of overhead involved getting in and out of the data pump part. So when I use 25kBytes instead of just 32 bytes the throughput improves to 10.56MB/s (compared to 5.3MB/s with mmCAU). This compares to pure SW (mbedTLS 621kB/s and WolfSSL 522kB/s).
Comparing these with the WolfSSL benchmarks (in your link) there is some difference:
- WolfSSL benchmark SW 247kB/s
- LTC 12.2MB/s
I believe that they do their benchmarks with AES128 (and not AES256) which should be specified to allow real comparisons. I get 11.64MB/s with LTC with AES128, which is pretty similar. In the case of the software implementation I do use aligned buffers and have an improved copy function based on DMA that is called from the code which may explain some improvements but also the accuracy of the measurements can never be fully guaranteed.
In fact I wouldn't expect slower results than the benchmark since I have optimised the code to avoid various subroutine calls. Below is the encryption routine that I have used which should be somewhat tighter than the original - although the data pump should be kept fully fed in each case.
Regards
Mark
register unsigned long *ptrPlainTextInput = (unsigned long *)ptrTextIn;
register unsigned long *ptrCipherTextOutput = (unsigned long *)ptrTextOut;
LTC0_MD = (LTC_MD_ENC_ENCRYPT | LTC_MD_AS_UPDATE | LTC_MD_ALG_AES | LTC_MD_AAI_CBC); // set the mode
if ((iInstanceCommand & AES_COMMAND_AES_RESET_IV) != 0) { // if the initial vector is to be reset before start
LTC0_CTX_0 = 0; // zero IV
LTC0_CTX_1 = 0;
LTC0_CTX_2 = 0;
LTC0_CTX_3 = 0;
}
while (ulDataLength != 0) {
unsigned long ulThisLengthIn;
unsigned long ulThisLengthOut;
if (ulDataLength > 0xff0) {
ulThisLengthIn = 0xff0;
}
else {
ulThisLengthIn = ulDataLength;
}
ulDataLength -= ulThisLengthIn;
LTC0_DS = ulThisLengthIn; // write the data size
ulThisLengthIn /= sizeof(unsigned long); // the number of long words
ulThisLengthOut = ulThisLengthIn;
while (ulThisLengthIn-- != 0) { // copy to the input FIFO and read from the output FIFO
while ((LTC0_FIFOSTA & LTC_FIFOSTA_IFF) != 0) { // if the input FIFO is full we must wait before adding further data
if ((LTC0_FIFOSTA & LTC_FIFOSTA_OFL_MASK) != 0) {// if there is at least one output result ready
*ptrCipherTextOutput++ = LTC0_OFIFO;
ulThisLengthOut--;
}
}
LTC0_IFIFO = *ptrPlainTextInput++; // long word aligned
if ((LTC0_FIFOSTA & LTC_FIFOSTA_OFL_MASK) != 0) { // if there is at least one output result ready
*ptrCipherTextOutput++ = LTC0_OFIFO;
ulThisLengthOut--;
}
}
while (ulThisLengthOut != 0) {
if ((LTC0_FIFOSTA & LTC_FIFOSTA_OFL_MASK) != 0) { // if there is at least one output result ready
*ptrCipherTextOutput++ = LTC0_OFIFO;
ulThisLengthOut--;
}
}
while ((LTC0_STA & LTC_STA_DI) == 0) {} // wait for completion
WRITE_ONE_TO_CLEAR(LTC0_CW, LTC_CW_CDS); // clear the data size
WRITE_ONE_TO_CLEAR(LTC0_STA, LTC_STA_DI); // reset the done interrupt
LTC0_MD = (LTC_MD_ENC_ENCRYPT | LTC_MD_AS_UPDATE | LTC_MD_ALG_AES | LTC_MD_AAI_CBC); // re-write the mode
}
WRITE_ONE_TO_CLEAR(LTC0_CW, (LTC_CW_CM | LTC_CW_CDS | LTC_CW_CICV | LTC_CW_CCR | LTC_CW_CKR | LTC_CW_CPKA | LTC_CW_CPKB | LTC_CW_CPKN | LTC_CW_CPKE | LTC_CW_COF | LTC_CW_CIF)); // clear internal registers
It appears there are multiple versions of the LTC in the wild. There is one from the 1.3 SDK and another from the 2.0/2.1 SDK. They are diifferent and I was using the one from 2.0/2.1 which came from the new bull shit SDK generator. I for one thought the Processor Expert was a better way to keep things organized. Two many links that you can't check in to GIT because it's all links to the SDK somewhere else not in the workspace.
Chris and interested AES users
Various libraries are being encapsulated into the uTasker project's Crypto functions which both avoid mistakes (saving project time in many cases) and increase performance in some situations (mainly Flash since the libraries [also CMSIS DSP] in their original form sometimes link in data that is not needed and unnecessarily consume (sometimes large) program space).
Here I have some performance and Flash/RAM comparisons between WolfSSL and mbedTLS (I don't guarantee accuracy and will discussing with the manufacturers about further optimisation possibilities - WolfSSL has a uTasker port in their code which is also being analysed for potential improvements).
It is interesting to note that the mmCAU not only increases performance by a factor of about 4 for encryption and decryption but it can also save almost 10k of Flash space (code and tables)!!
When no mmCAU is available there is a further interesting increase in performance by locating tables in RAM (rather than Flash) and a saving in Flash by generating the content rather than taking it from a const look-up table. This is due to the faster RAM access possible in the K64 but there is of course a trade off between the improved performance and the RAM consumption.
From the tables it is clear to see what range of performance can be expected with different libraries and especially with and without mmCAU.
Where AES256 is needed in designs without mmCAU and there is adequate RAM available (about 8.5k free) it certainly makes sense in relocating tables to SRAM (or enabling the mbedTLS option) since it can double throughput!
Expect very similar results on K8x parts:
http://www.utasker.com/kinetis/TWR-K80F150M.html
http://www.utasker.com/kinetis/FRDM-K82F.html
Regards
Mark
Professional Kinetis Services: http://www.utasker.com/services.html
For completeness, these are the same comparisons when OpenSSL is used. Is also has a W option to roll out some loops which costs around 4k of Flash and has a performance advantage of about 20% - its tables are fixed in Flash so the mbedTLS option still has the lead in SW implementation:
When mmCAU is available all of the library components can be disabled (removed) and the mmCAU accessed directly.
The uTasker AES256 interface allows this (as stated, when there is an mmCAU available) and the figures below are achieved:
It may be possible to shave off a few more ns by running all code from SRAM but this is tending to the limit and is a further substantial improvement of around 35% on the fastest library option.
Regards
Mark