The difference probably lies in the status flag 'N'. For MULU, the result cannot be negative (in contrast to the description of the N-flag's valaue after the operation: "Set if result is negative; cleared otherwise"), so the flag should be 0 always.
For MULS the result can be negative in some calculations.
I didn't test this myself, but you could try the following value pairs:
0xFFFFFFFF * 0xFFFFFFFF = 0xFFFFFFFE00000001MULU: 0x00000001, N=0MULS: 0x00000001, N=00x80000000 * 0x00000001 = 0x0000000080000000MULU: 0x80000000, N=0MULS: 0x80000000, N=1
It gets interesting when 32-bit overflow comes into play, like this:
0x80000000 * 0x00000002 = 0x0000000100000000MULU: 0x00000000, N=0MULS: 0x00000000, N=1?
I'm unsure what the processor does in this case. If the N-flag simply represents bit 31 of the resulting 32-bit, N will become zero. If the N-flag is combined as the XOR of the bits 31 of the two 32-bit input values, N will become 1.
I would assume that the multiplier unit will consider all values to be unsigned, so the results will be identical in all cases. Creating the 2-s complement of all negative inputs (negating all bits and adding 1), and then potentially doing the same for the result simply is too expensive (time and transistors). I seem to remember that when in university we once have proven that doing so gives identical results in all applicable value ranges, but I'm not sure anymore.
HTH,
Johan