Implemented according to the ISA spec. Validated with silion. In
particular the sign extend is important for the signed variants and the
unsigned variants seem to overflow lanes (hence why there is no mask()
in the unsigned varints. FP16 -> FP32 continues using ARM's fplib.
Tested vs. an MI210. Clamp has not been verified.
Change-Id: Ifc09aecbc1ef2c92a5524a43ca529983018a6d59