Porting the Deepseek to the i.MX 8MP|93 EVK

3 项奖励

This sharing introduces how to porting the deepseek to the #i.MX8MP i.MX93EVK with the Yocto BSP by llama.cpp

The main test model used in this document is the Qwen model that is distilled and quantized based on the deepseek model. For other versions of the deepseek model, you can refer to the steps in the document to download different models for testing.

1. Set up the demo

ON PC

a. Prepare the cross-compiling.
See the i.MX Yocto Project User's Guide for detailed information how to generate Yocto SDK environment for cross-compiling. Get the User's Guide. To activate this Yocto SDK environment on your host machine, use this command:

:$ source <Yocto_SDK_install_folder>/environment-setup-cortexa53-crypto-poky-linux

b. Cross-compile the llama.cpp
eg: i.MX93

:$ git clone https://github.com/ggerganov/llama.cpp
:$ mkdir build_93
:$ cd build_93
:build_93$ cmake .. -DCMAKE_SYSTEM_NAME=Linux -DCMAKE_SYSTEM_PROCESSOR=aarch64 -DCMAKE_C_COMPILER=aarch64-poky-linux-gcc -DCMAKE_CXX_COMPILER=aarch64-poky-linux-g++
:build_93$ make -j8
:build_93$ scp bin/llama-cli root@<your i.MX93 board IP>:~/
:build_93$ scp bin/*.so root@<your i.MX93 board IP>:/usr/lib/

c. Get the DeepSeek model on the huggingface

eg: Dowload the DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf model
Download the required Deepseek model in the huggingface.

ON Board

a.Test the Deepseek on the i.MX93 board

:~/# ./llama-cli --model DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

b. Results shown below:

2. Results Analysis
The effects of different models on different boards were tested. It should be noted that the biggest obstacle limiting the running of the model on the board is memory.The test results including CPU and memory usage are as follows:

a. i.MX8mp + DeepSeek-R1-Distill-Qwen-7B-IQ4_XS

b. i.MX93 + DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M

After testing, the speed at which i.MX8MP runs DeepSeek-R1-Distill-Qwen-7B-IQ4_XS to generate tokens is about 1 token per second. The speed at which i.MX93 runs DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M to generate tokens is about 1.6 token per second. The above test results for the generation speed are only rough test results and are for reference only.

The above icons show the CPU and memory usage of i.MX during the DeepSeek model running. It should be pointed out that the CPU efficiency affects the speed of model token generation. The memory size of the board limits whether the model can run in the corresponding development board. This is a balance between running speed and required memory size. Higher accuracy, such as using a 7B model, will result in a decrease in running speed.