How is it possible to reach the announced 50 Gbps SEC speed and how to test it?
Using tcrypt and openssl, the maximum speed by the caam drivers is 200 MBps (1.6 Gbps).
Even using the configuration attached at benchmark, 200 MBps is the maximum speed we reached, same as the author and NXP tech at that thread reached
Please refer to the following test result provided by the testing team.
root@localhost:~# zcat /proc/config.gz | grep -i CONFIG_CRYPTO_TEST CONFIG_CRYPTO_TEST=m root@localhost:~# zcat /proc/config.gz | grep -i caam CONFIG_CRYPTO_DEV_FSL_CAAM_COMMON=y CONFIG_CRYPTO_DEV_FSL_CAAM_CRYPTO_API_DESC=y CONFIG_CRYPTO_DEV_FSL_CAAM_AHASH_API_DESC=y CONFIG_CRYPTO_DEV_FSL_CAAM_SECVIO=m CONFIG_CRYPTO_DEV_FSL_CAAM=y # CONFIG_CRYPTO_DEV_FSL_CAAM_DEBUG is not set # CONFIG_CRYPTO_DEV_FSL_CAAM_JR is not set CONFIG_CRYPTO_DEV_FSL_DPAA2_CAAM=y
root@localhost:~# insmod /lib/modules/$(uname -r)/tcrypt.ko sec=1 mode=406 [ 908.131031] [ 908.131031] testing speed of async sha512 (sha512-caam-qi2) [ 908.138117] tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): [ 909.134225] 115812 opers/sec, 1852992 bytes/sec [ 909.146750] tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 111616 opers/sec, 7143424 bytes/sec [ 910.157269] tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): [ 911.154226] 115974 opers/sec, 7422336 bytes/sec [ 911.166753] tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 36053 opers/sec, 9229568 bytes/sec [ 912.177287] tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 36833 opers/sec, 9429248 bytes/sec [ 913.185279] tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): [ 914.182227] 115755 opers/sec, 29633280 bytes/sec [ 914.194752] tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 11818 opers/sec, 12101632 bytes/sec [ 915.205296] tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 22633 opers/sec, 23176192 bytes/sec [ 916.213261] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): [ 917.210221] 106755 opers/sec, 109317120 bytes/sec [ 917.222756] tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 6264 opers/sec, 12828672 bytes/sec [ 918.233305] tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 12600 opers/sec, 25804800 bytes/sec [ 919.241324] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 35766 opers/sec, 73248768 bytes/sec [ 920.249290] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): [ 921.246230] 95541 opers/sec, 195667968 bytes/sec [ 921.258758] tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 3233 opers/sec, 13242368 bytes/sec [ 922.269374] tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 6560 opers/sec, 26869760 bytes/sec [ 923.277349] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 21252 opers/sec, 87048192 bytes/sec [ 924.285300] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): [ 925.282226] 78298 opers/sec, 320708608 bytes/sec [ 925.294765] tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 1616 opers/sec, 13238272 bytes/sec [ 926.305893] tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 3376 opers/sec, 27656192 bytes/sec [ 927.313485] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 11698 opers/sec, 95830016 bytes/sec [ 928.321338] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 29107 opers/sec, 238444544 bytes/sec [ 929.329297] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): [ 930.326236] 50629 opers/sec, 414752768 bytes/sec
You are correct, the LX2160ASECRM, Figure 1. "SEC block diagram" shows that there are 16 DECO/CCB Tiles, which means LX2160A has 16 AESA (AES Accelerator), each is capable of doing ~3Gbps. Total about 50Gbps.
Ideally you should have multiple independnet streams/tcrypt files to run in parallel to achieve aggregate ~50G performance. If there are dependences between the parallel operation, please review LX2160ASECRM, section 184.108.40.206 "Order of job completion" to understand the impact and implcation of "out-of-order" situation.
Job Descriptors submitted through different Job Rings are not guaranteed to complete in the order they were submitted by software, even if the job descriptors reference the same shared descriptor and use SERIAL or WAIT sharing. (See Shared descriptors for details about shared descriptors and sharing concepts).
As shown in the example in Figure 4, it is possible for results to be written into the output ring in a different order than the order in which the corresponding jobs appear in the input ring (see jobs 6 and 7, where 7 is a short job submitted after 6, which requires more processing time). Because jobs are assigned to DECOs as the DECOs become available, successive Job Descriptors generally execute in parallel in different DECOs. Therefore, the order of job completion is affected by the time required to process the data, the presence of a shared descriptor, and the sharing mode.
Since more than one DECO can process jobs from the same Job Queue, two or more jobs from the same Job Queue may be in execution simultaneously. The only way to guarantee that jobs input on a single Job Ring complete in the order they were added to the input ring is to both:
Note that the majority of the processing information can be included in the job descriptor with the shared descriptor enforcing serialized processing.
WAIT sharing can be used if all that is required is to guarantee that certain commands in one job are complete before another job is started. Types of sharing and impact on job completion ordering are further described in Specifying different types of shared descriptor sharing.
Keep in mind the SEC engine different sharing option. The details are described in LX2160A SEC RM, section 7.3.2 Specifying different types of shared descriptor sharing.
If you choose SERIAL, you’re stuck with what a single DECO can perform. In SERIAL, we literally keep running all jobs with the same shared descriptor on the same ONE DECO, no matter how many DECOs may be idle. If you choose WAIT, then the next job of using the same shared descriptor can start on the next job early on another DECO. And, potentially finish next job before first job completes on the other DECO.
There are scenarios where we might as well be share-serial:
-for encap, if you choose AES-CBC mode with chained IV, then the share-wait logic has to wait for all the encryption to complete so that the chained IV can be shared.
-For decap prior to LX2160, antireplay played pretty much the same role. With the “Shared Resource” that’s on 2160, we’ve removed that bottleneck as well.
If you use AES-ECB(Electronic codebook), the message is divided into blocks, and each block is encrypted separately. In this case you don't worry about "waiting" and you can use the "always" options.
Let me discuss with the AE team, will update you later.