This work is the result of my daughter's idea, she finished it with my guidance.
Cradle-1 Palmsize mini-HPC
World's first full function heterogeneous mini-HPC, this is what it looks like:
Overall: CPU+GPU heterogeneous, 4 nodes, connected by a 100M Ethernet switcher;
Nodes： FreeScale I.MX6 Quad core mini-pc, with 4 ARM Cortex-A9 cores and 1 Vivante GC2000 GPU
OS: Ubuntu 11.10 linaro
OpenCL driver: Vivante GC2000 OpenCL driver
Compiler: C/C++: gcc 4.6.1, Fortan90/95: gfortran 4.6.1,
MPI Parallel Computing: MPICH2 1.4-1
NFS network file system: nfs-kernel-server 1.2.4
SSH security: openssh 1:5.8
The hardware of all nodes are the same, only the software configurations are slightly different. One of them was assigned as the master node, the others are slave nodes. They were TV sticks originally, with android 4.0 installed. The node's hardware specification is:
CPU: 4 1.2G Cortex-A9 cores
GPU: 1 Vivante GC2000 GPU
RAM: 1G DDR
ROM: 8G SD
NIC: usb2.0 100M Ethernet Adapter (this NIC is not the TV stick's component, we added it)
Display Interface: HDMI
Network Switcher: 5 port 100M Ethernet Switcher
Each node has one USB2.0 NIC and one WIFI interface, the WIFI is used as the backup connection for NIC connection. Network configurations are:
IP Address assignment: (baby1 - baby4 are the four computing nodes)
baby1: 100M NIC 192.168.10.1 WIFI 192.168.0.111
baby2: 100M NIC 192.168.10.2 WIFI 192.168.0.112
baby3: 100M NIC 192.168.10.3 WIFI 192.168.0.113
baby4: 100M NIC 192.168.10.4 WIFI 192.168.0.114
Cradle-1 has 16 1.2G ARM Cortex-A9 cores and 4 Vivante GC2000 GPU cores, the total computing power of these 20 computing devices is more than 100GFLOPS, more powerful than an ordinary desktop. The whole machine is only a little bigger than a palm, and the total power consumption is less than 15 watts.
The overall architecture of Cradle-1 is almost the same as Chinese Tianhe-1A or the Titan in the oak ridge lab. they used the same set of software, LINUX+OPENCL+OPENMPI. Cradle-1 supports C/C++, Fortran90/95. And almost all kinds of parallel computing algorithms can run on it, the only difference is the scale.
We coded a MPI parallel computing program for large matrix multiplication with 4 processes, each process had 5 threads, four threads for the four CPU cores, and one thread for GPU computing.
Coded a simple OpenCL program to display OpenCL driver information
On a notebook, using remote desktop access function to obtan the node baby1's desktop. This is the sign in desktop of baby1 node. Baby 1 has X11VNC server installed.
sign in baby1, open a terminal
Ran a MPI testing program, ensuring that all babies (baby1 - baby4) were working
Any comments? please mail to firstname.lastname@example.org