Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I threw together a quick risc-v vectorized implementation:

    size_t run(char *str) {
            uint8_t *p = (uint8_t*)str;
            long end = 0;
            size_t res = 0, vl;
            while (1) {
             vl = __riscv_vsetvlmax_e8m8();
                    vuint8m8_t v = __riscv_vle8ff_v_u8m8(p, &vl, vl);
                    end = __riscv_vfirst_m_b1(__riscv_vmseq_vx_u8m8_b1(v, '\0', vl), vl);
                    if (end >= 0)
                            break;
                    res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
                    res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
                    p += vl;
            }
            vl = __riscv_vsetvl_e8m8(end);
            vuint8m8_t v = __riscv_vle8_v_u8m8(p, vl);
            res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
            res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
            return res;
    }

Here are the results from the above, the switch and the table c implementation, ran on my mangopi mq pro (C906, in order rv64gc with rvv 0.7.1, and a 128 bit vector length):

    switch: 0.19 Bytes/Cycle
    tbl:    0.17 Bytes/Cycle
    rvv:    1.57 Bytes/Cycle (dips down to 1.35 after ~30 KiB)
Edit: you can go up to 2/1.7 Bytes/Cycle, if you make sure the pointer is page aligned (and vl isn't larger than the page size), see comments


To be fully correct, you'd need the load to be a fault-only-first load (which rvv does have), otherwise that could fail if the null byte was just before the end of allocated memory.


I just found your rvv intrinsics-viewer [0], that'll be so helpful.

I tried building one, my self, but my miserable web skills didn't allow me to lazily load the instructions, which made it too slow for actual use.

Can I share your project on lemmy?

[0] https://dzaima.github.io/intrinsics-viewer


Go ahead! I'm not much of a web dev either, but decided to struggle through it to, mainly, just have better searching. (originally for intel & ARM intrinsics, which are also available if downloaded offline)


I'm not sure I fully understand fault-only-first load, but reading the description of vle8ff.v I think I only need to exchange the load inside of the loop?

How does the normal load deal with faults?

I'll update the parent comment, it slowed down the speed from 2/1.7 to 1.57/1.36 Bytes/Cycle.


You'd probably want to have a new __riscv_vsetvlmax_e8m8 at the start of each loop iteration, as otherwise an earlier iteration could cut off the vl (e.g. page unloaded by the OS), and thus the loop continues with the truncated vl.

The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.


> You'd probably want to have a new __riscv_vsetvlmax_e8m8 at the start of each loop iteration, as otherwise an earlier iteration could cut off the vl (e.g. page unloaded by the OS), and thus the loop continues with the truncated vl.

Oh, yeah, that was a big oversight, unfortunately, this didn't undo the performance regression.

> The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.

I don't quite understand how that plays out.

The reference memcpy implementation uses `vle8.v` and the reference strlen implementation uses `vle8ff.v`.

I think I understand how it works in strlen, but why does memcpy work without the ff? Does it just skip the instruction, or repeat it? Because in either case, shouldn't `vle8.v` work with strlen as well? There must be another option, but I can't think of any.

Also, does this mean I can get the original performance back, if I make sure to page align my pointers and use `vle8.v`?


The memcpy doesn't use a vlmax, it uses a hand-chosen vl. The load won't fault on any elements not loaded (here, elements past the vl), so the memcpy is fine as it only loads items it'll definitely need, whereas your original code can read elements past the null byte.

And yeah, aligning the pointer manually would work (though then it wouldn't be portable code, as the spec does allow for rvv implementations with VLEN of up to 65536 (8KB per register; 64KB with LMUL=8), which'll be larger than the regular 4KB pages).


Ah, this makes a lot more sense now. I thought the "fault" was about the kernel interrupting when a new page needs to be loaded into physical memory, which would also happen for memcpy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: