Comparison of Async Compute (graphic and compute workload concurrently (aka overlapping workload)) capabilities of Maxwell and Pascal. This is my own deduction of the capabilities of NV GPUs.
For Maxwell (this example is based on GTX980 with 16 SMs)SMs will be partitioned to work on both graphic and compute workload concurrently. For example, 10 SMs to work on graphic workload and 6 SMs to work on compute workload. If the compute workload completes first, then it will need to wait until the graphic workload is completed (so 6 SMs will be idling).
If the SM need to be switched to work on different workload, it will require context switch (stop existing work on that particular SM, flush memory, transfer new data, process workload).
This does mean that Maxwell can process both graphic and compute at the same time. However, it will require smart and correct partitioning so correct number of SMs can be assigned to work on workload. (Say if the application is 80% graphic 20% compute, then 13 SMs can be assigned to work on graphic, 3 SMs to work on compute for a GTX980)
My theory: Any driver that can improves the "Async Compute" performance is probably only making the algorithm to partition of SMs more intelligent in deciding the best ratio.
For PascalAs above, SMs will be partitioned to work on both workload. However, "enhancement" has been made so idling SMs can be reassigned to work on graphic workload. It has the capability to pause (no need to flush memory <-- not detailed in the whitepaper) current running workload and switch to new workload in a very short time (sub 100 microsecond), so the heavier workload can be finished faster using more SMs.
What It MeansIt means that Maxwell/Pascal performance for DX11 and DX12 is generally the same for async compute. NVIDIA doesn't require DX12 to do async compute. In fact, they demo-ed the Async Compute capability in DX11 in the recent Pascal presentation. The scheduler can partition the SMs correctly based on the requirement.
And Pascal async compute capability is better than Maxwell. However, Maxwell's best case is similar to Pascal's best case.
SchedulingAs for the AMD's (Graphic Engine and ACEs), NVIDIA has the Gigathread Engine (Also Hardware Scheduler) that serves a similar purpose. Most of NVIDIA's scheduling magic is done in Gigathread (very little information on this block).
Say, in a company to distribute work (Comparing R9 390X vs GTX980)
9 Department Heads (1 GE + 8 ACEs) ---> 44 group leaders (Scheduler in each CU) --> one group leader in charge of 4 groups (each group has 16 people)
Boss (Gigathread) ---> 16 Departments (SMs) --> 4 Department Heads each (Warp Schedulers) -> 1 dept head in charge of 1 group (each group has 32 people)
Do correct me if there's anything wrong in this
