Sure, I'll clean it up a smidge and push it out onto github. I only had to deal with 2 of the cases for SoulHackers so it was pretty trivial (done YCoCg both-ways before in OpenCL so I knew what I was looking at).
Note: I gave zero shits about the patterns employed in general, so it's very adhoc and not PR worthy, I don't dig the code style - at all.
Also, HD4000, pretty much the best possible OpenCL scenario, and SoulHacker's which is pretty much video-everything, even minor gains add up there. SMT IV is totally unplayable on the same hardware, not that that matters since SMT IV is pretty much unplayable off of an actual 3ds.
My experience with the OpenCL AMD implementation is that natively compiled code and just OpenMP gave better perf (which makes sense as OpenCL is more abstract).
The only 100% win scenario for OpenCL is JIT like cases where "If I could compile each unique case," would be awesome. OpenCL CPU profiles are complete awesome for those cases, even if they're not parallel heavy. The rest of it is, as you describe, a wild myriad that may very well lose to static cases.