Because each vector load is fragmented into 64 byte cache-aligned
chunks, and one page-table walk is issued per fragment on tlb miss,
walks start to accumulate on a pending queue, which is processed in a
blocking way (no pending walks can be issued while one is being
processed). This adds noticeable latency on vector loads when VLEN is
sufficiently large.
This commit fixes the issue by allowing walks to be squashed if a TLB
lookup hits just before starting the walk on `startWalkWrapper`. This
idea was taken from the ARM walker.