This commit reverses the runqueue_cache bit order when the architecture
has a CLZ (count leading zeros) instruction. When the architecture
supports CLZ, it is faster to determine the most significant set bit of
a word than to determine the least significant bit set. Unfortunately
when the instruction is not available, it is more efficient to determine
the least significant bit set.
Reversing the bit order shaves off another 4 cycles on the same54-xpro.
From 147 to 143 ticks when testing with tests/bench_sched_nop.
Architectures where no CLZ instruction is available are not affected.