Performs a carry-less multiplication of two 64-bit polynomials over the finite field GF(2^k) - in each of the 4 128-bit lanes.
The immediate byte is used for determining which halves of each lane
should be used. Immediate bits other than 0 and 4 are ignored.
All lanes share immediate byte.