I've been looking into it lately and there's a few things about why 3ds would be different.
First, I tested fast instructions vs sanatized muls (i tested instruction dpps, mulps, etc) I have yet to find a game that breaks due to inaccuracy. So HLE is possible if the benefit in speed seems like a good trade off. CEMU does actualy decompile to glsl shader and caches them.
Second, PICA branches are easily decompiled in 99% of the cases since they are mostly used when nesting is not possible (most probably there's a limit in nesting structures within PICA).
Third, caching gpu shaders can be done with the extension (ARB_GET_PROGRAM_BINARY) https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_get_program_binary.txt which is mostly supported in most drivers which support 3.0.
This does not mean throwing the current JIT since there should always be an LLE option. I'm currently working on a recompiler called "Marssel" In the style of dynarmic. Currently, I'm analyzing the flow on PICA shaders and trying to eliminate deadcode and fix stale branching with constant propagation. My first goal is to make a more robust x86/x64 JIT based on https://github.com/asmjit/asmjit I suggest you take a look at it is way more robust than xbyak (it has a decent register allocator, for starters, an ARM backend, runtime VEX encoding if available, multitarget jitting (x86 x64 depending on current machine) and is lightweight[200kb]). After that, I'll use the same frontend to generate GLSL and/or SPIR-V as well.