I’ve spent about 30% or so of my total productive hours for the last year working on a single programming problem: a high performance GPU implementation of a complex new algorithm. I”ve had the general idea in my head for over two years, and I’ve evaluated perhaps a dozen approaches, none of which reached even 1% work efficiency in the general case.
A few months ago I finally saw a path that could get up to 25% efficiency and beyond, but it was really complicated with dozens of subpasses which each required highly tuned specific parallel algorithms and lots of inner loop critical code.
Four days ago I finally saw a way to do all of that in a single pass (< ~ 1k lines of kernel code) while still meeting all of the constraints, and just today I got this new codepath compiling into a cubin that looks reasonably on target.
I’ve spent about 30% or so of my total productive hours for the last year working on a single programming problem: a high performance GPU implementation of a complex new algorithm. I”ve had the general idea in my head for over two years, and I’ve evaluated perhaps a dozen approaches, none of which reached even 1% work efficiency in the general case.
A few months ago I finally saw a path that could get up to 25% efficiency and beyond, but it was really complicated with dozens of subpasses which each required highly tuned specific parallel algorithms and lots of inner loop critical code.
Four days ago I finally saw a way to do all of that in a single pass (< ~ 1k lines of kernel code) while still meeting all of the constraints, and just today I got this new codepath compiling into a cubin that looks reasonably on target.
Do you work at Google or something?