[code] Candidate places for OpenCL?


#1

Are there any obvious places/parts where OpenCL would be appropriate for speeding things up?

I noticed y2r is a massive one (by log spam alone). Any other large stream/raster/texture/block-compression processes that are as embarrassingly parallel hiding about that have sufficient volume to be worth it? I see quite a bit going on in the mixer and audio_core, but if the volume isn’t large enough it’s not worth it.


#2

y2r probably isn’t one. afair decoding is still done on the CPU and not in the OS (otherwise video playback would already be a lot faster probably).


I think it would work for:

  • Vertex shaders*,**
  • Software rendering*
  • Texture decoding*,**
  • Texture / Surface conversions**
  • Possibly audio but the overhead in OpenCL is probably too large *

* = I think rather than working on OpenCL we should first employ MT / OpenMP
** = Would require OpenCL / OpenGL interaction


So overall: Not a fan of OpenCL for most things. I believe it adds to much complexity into the design early on, the overhead is probably larger than the benefits in a lot of cases too.
As Citra is currently doing all of the above tasks single threaded, we should first try to switch to better CPU use by multi-threading them.

Additionally the vertex shaders currently do a ton of memory accesses. We should fix those (by adding a register allocator) and use AVX512 if available.
Similar things go for the software renderer: It’s absolutely horrible and slow right now. It should be a simple JIT, multithreaded with a lot of SIMD (and a fallback interpreter using the same emitter code).


#3

Thanks for the list.

I pretty much exclusively write OpenCL CPU-profile anymore so my take on OpenMP/MT vs OpenCL is different. Either way it’s trivial to slap it into a few places and measure.


#4

Worth it. Just OpenCL’ing y2r made SMT: Devil Summoner: Soulhackers playable on an Intel HD4000 in a Surface Pro 1.

That’s about what I expected. I expect better gains as I move onward eliminating double-fors.

Dungeons are still sluggish, but the battle, attack, and cutscene FMVs are no longer a problem.


#5

My experience with the OpenCL AMD implementation is that natively compiled code and just OpenMP gave better perf (which makes sense as OpenCL is more abstract). So the only use case I see for OpenCL is GPU acceleration. Hence my comment about it.

Can you link the code for the OpenCL y2r?


#6

Sure, I’ll clean it up a smidge and push it out onto github. I only had to deal with 2 of the cases for SoulHackers so it was pretty trivial (done YCoCg both-ways before in OpenCL so I knew what I was looking at).

Note: I gave zero shits about the patterns employed in general, so it’s very adhoc and not PR worthy, I don’t dig the code style - at all.

Also, HD4000, pretty much the best possible OpenCL scenario, and SoulHacker’s which is pretty much video-everything, even minor gains add up there. SMT IV is totally unplayable on the same hardware, not that that matters since SMT IV is pretty much unplayable off of an actual 3ds.

My experience with the OpenCL AMD implementation is that natively compiled code and just OpenMP gave better perf (which makes sense as OpenCL is more abstract).

The only 100% win scenario for OpenCL is JIT like cases where “If I could compile each unique case,” would be awesome. OpenCL CPU profiles are complete awesome for those cases, even if they’re not parallel heavy. The rest of it is, as you describe, a wild myriad that may very well lose to static cases.


#7

That’s very interesting, do you have concrete numbers?

Y2R is just YUV->RGB conversion. I coded it in a way that shouldn’t be too slow in an optimized build. It would probably benefit a lot from using SIMD, but I haven’t bothered looking at it because it rarely seemed like a bottleneck compared to the CPU video decoding.