Welcome Guest ( Log In | Register )

Bump Topic Topic Closed RSS Feed

Outline · [ Standard ] · Linear+

 NVIDIA GeForce Community V15 (new era pascal), ALL HAIL NEW PASCAL KING GTX1080 out now

views
     
Demonic Wrath
post Jun 4 2016, 10:34 AM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

QUOTE(JohnLai @ Jun 3 2016, 11:51 PM)
............Does Pascal driver actually has async compute support enabled in first place?

Nvidia keeps on claiming async compute support still not enabled yet in the driver.
*
I might be wrong but it doesn't seem to need a "special driver" to enable async compute capability. You can read from GTX1080's whitepaper on their approach in Maxwell and Pascal on the async compute.

You can see this benchmark. It shows that NVIDIA's card is capable of async compute even in DX11. In DX11 mode, even a GTX980 wins Fury X performance.
» Click to show Spoiler - click again to hide... «


Note that AOTS is very suitable for GCN's architecture. It shows what GCN is capable of and its pure compute performance. If a Fury X uses all its CU (compute units) at 100% (8.6 TFlops), it will beat GTX980Ti (6.1 TFlops). On best case, it can even perform similar to GTX1080 (8.8 TFlops).

But in games, it would be limited by scheduling efficiency, pixel fillrate (ROPs), geometry performance etc. Async Compute solves the scheduling efficiency issue for AMD so it can utilize it's massive amount of CUs better. In the end, how much utilization and how balanced the hardware is the important part. (Refer Intel vs AMD CPUs - more cores is not equal more performance)

IMHO, NVIDIA's card is more balanced and utilized better. That's why GTX980Ti can perform better or on par with FuryX even when it's theoretical peak shader performance is lower compared to AMD's.

Any games that are built for a specific vendor architecture will run better when using the vendor's hardware. You can see when tessellation factor is increased to 64X (overkill) or Gameworks enabled, it will run better on NVIDIA hardware.
Demonic Wrath
post Jun 5 2016, 11:55 AM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

QUOTE(adilz @ Jun 4 2016, 08:26 PM)
Bro, sorry had to correct you here. Async compute is one the new features available in DX12, but not in DX11. Nvidia previous Maxwell GPU does not support Async Compute in DX12, like AMD Fiji GPU. For the case of Ashes of Singularity, it can run in DX11 or DX12. And in DX12, Async Compute can be enabled and disabled. It was AoTs benchmark which highlighted the Maxwell Async Compute issues. There are quite a number of analysis, but here are few, generally between GTX 980 Ti vs Fury X.
*
There seems to be some misunderstanding on what the DX12 async compute enabled or disabled means..

Most of the "async compute" is referring to the scheduling method. Not whether the GPU can have both graphic and compute tasks processed at the same time. What DX12 enables is allowing work to be dispatched concurrently. If no work is being dispatched to the compute unit, it will idle.

Traditionally, DX11 has a single hardware work queue. The CPU will see only a single queue to submit task to. Queue is basically a list of pending task waiting to be send to GPU compute units to be processed.

Say you have 3 streams of tasks.

CODE
Stream ABC - graphic
Stream DEF - compute
Stream GHI - compute

Stream ABC is independent of DEF so theoretically it can work in parallel on different compute units.

In DX11 with a single work queue,

Step 1: the CPU will submit ABC | DEF | GHI to this work queue sequentially. The GPU can only know if the task can be processed concurrently when it reaches the scheduler.
CODE
DX11 CPU to GPU Hardware queue: ABC | DEF | GHI


Step 2: The GPU scheduler will dispatch to idle compute units (from the top to bottom) to process it
CODE
A
B
C and D concurrently since the scheduler knows it is independent of each other
E
F and G concurrently since the scheduler knows it is independent of each other
H
I

This causes some of the compute units to be under occupied.

In DX12, there's now 3 types of hardware queue (graphic, compute, copy).

Step 1: The CPU will send the task to its respective queues (graphic task to graphic queue etc.) Remember, the task needs to be independent of each other so it doesn't rely on other data to process.

Now the queue becomes:
CODE
Graphic hardware queue: ABC
Compute hardware queue 0: DEF
Compute hardware queue 1: GHI


Step 2: The GPU scheduler will then dispatch to idle compute units (from top to bottom) to process it
CODE
A, D, G concurrently
B, E, H concurrently
C, F, I concurrently


This improves the utilization of the GPU since there is more work to feed the GPU.

On NVIDIA
However, NVIDIA has a more intelligent to handle the work queue.

Again, we'll use the example from above. Stream ABC, DEF, GHI.

In DX11,
Step 1: the CPU will still submit the work queue sequentially.
CODE
DX11 CPU to GPU queue: ABC | DEF | GHI


Step 2: Once the GPU has received the task list, it will check for dependency and distribute to another hardware queue.
CODE
GMU to GPU Hardware queue 0: ABC
GMU to GPU Hardware queue 1: DEF
GMU to GPU Hardware queue 2: GHI

Step 3: The GPU scheduler will then dispatch to idle compute units (from top to bottom) to process it
CODE
A, D, G concurrently
B, E, H concurrently
C, F, I concurrently

This improves the utilization of the GPU since there is more work to feed the GPU.

However, in DX12 with "async compute" enabled,

Step 1: The CPU will send the task to its respective queues (graphic task to graphic queue etc.)
CODE
Graphic hardware queue: ABC
Compute hardware queue 0: DEF
Compute hardware queue 1: GHI

Step 2: NVIDIA driver now don't give a **** on the different hardware queues submitted from CPU. So it still uses its own scheduling method. As with DX11, once NVIDIA GPU has received the task list, it will check for dependency and distribute to another hardware queue.
CODE
GMU to GPU Hardware queue 0: ABC
GMU to GPU Hardware queue 1: DEF
GMU to GPU Hardware queue 2: GHI

Step 3: The GPU scheduler will then dispatch to idle compute units (from top to bottom)
CODE
A, D, G concurrently
B, E, H concurrently
C, F, I concurrently


Summary
For NVIDIA, separate hardware queues from CPU to GPU doesn't do anything to improve its performance since it can already distribute work efficiently. If you're seeing slight performance drop when Async Compute is enabled, it is probably because of the redundant overhead created in Step 2 (check for dependency and distribute to another hardware queue). NVIDIA could probably do something to the scheduler so the GPU can skip this step and work similar to AMD's. But the performance gain is minimal, so NVIDIA prefer to work on other parts of the architecture to improve performance.

For AMD, separate hardware queues from CPU to GPU will improves its performance i.e. more works can be feed into the compute units.

In diagram:-
» Click to show Spoiler - click again to hide... «


» Click to show Spoiler - click again to hide... «


Finally in DX12, AMD GCN architecture is more efficient compared to Maxwell async compute capability. Why? Because it can assign work dynamically to the compute unit and the context switching of the CU is independent to draw-call.

Maxwell SMs can only do context switch during draw-call. If either the graphic stalls or the compute stalls (i.e. the SM cannot finish the allocate work in a single draw call), it will have to wait until the next draw call.

Pascal, however, doesn't require waiting for draw-call to do context switch. If any workload stalls, the GPU can dynamically allocate more processor to work on the task, independent of the draw-call timing.

So in summary, GCN, Maxwell and Pascal can still work on graphic and compute at a same time. It depends on whether the scheduler can dispatch work concurrently. NVIDIA GPU can fully dispatch work concurrently in DX11, but AMD GPU's ability to dispatch work concurrently is limited in DX11. However in DX12, it can fully dispatch work concurrently.

This post has been edited by Demonic Wrath: Jun 5 2016, 12:39 PM
Demonic Wrath
post Jun 6 2016, 03:05 PM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

QUOTE(adilz @ Jun 6 2016, 02:35 AM)
Thanks for the explanation Bro. My basis was just on DX12 Async Compute feature. Though Pascal has improved a lot for Async Compute, but Maxwell (regardless if it has its own load balancing, pre-emption etc that may be similar to what Async Compute), still technically does not support DX12 Async Compute. It may do some sort of its own 'proprietary async compute", but it is still not the "DX12 Async Compute", one that uses the DX12 API. So for owners of Maxwell card like me (GTX 970), its pretty much a disappointment. More so when new game developers are lauding the Async Compute feature, not just PC, but for Playstation 4 and Xbox One.
Those developers are lauding of the Async Compute feature because it is a required feature to extract AMD's architecture (PS4 and XBOX One). If not, AMD GPU's utilization will be low due to scheduling issues (not enough work to feed the compute units).

Nothing to be disappointed about... it's just like you don't need to eat Panadol if you're feeling well...NVIDIA doesn't need "DX12 async compute queues" to have a high utilization of it's cores.. less idling cores = more efficient = more peformance... remember, the final result is to utilize the cores fully.

Even if the utilization of the cores are efficient, some games will run faster on AMD card while some on NVIDIA card. This is because different bottlenecks in different game. For example, if the game is heavily tessellated, it will always run faster on NVIDIA cards. Gameworks run faster on NVIDIA cards because it focuses on NVIDIA's strength on geometry and pixel fillrate. Likewise, AMD have been encouraging developer to use more compute because AMD cards are stronger on compute/shader arithmetic (TFlops) compared to NVIDIA.
Demonic Wrath
post Jun 6 2016, 04:30 PM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

QUOTE(TheHitman47 @ Jun 6 2016, 03:15 PM)
Gameworks is not comparable in this case at all. that thing is not even open source to begin with.  sweat.gif
*
Open source or not, my point is that NVIDIA coded Gameworks in a way that is beneficial to their architecture (obviously) (for example extreme tessellation factors). This causes a larger performance hit on AMD cards due to their weaker geometry capability.

IF AMD coded GPUOpen to exploit it's compute performance to a higher degree (lots of compute, minimal geometry, minimal pixel), NVIDIA GPU will definitely run slower. I'd expect GTX1080 to run at the same performance at Fury X.

It highly depends on how the game dev code the game and which component in the GPU they want to saturate.

AOTS is largely compute performance bound. Why this conclusion? Because the FPS scaling correlates with the TFLops. NVIDIA GPUs having lower performance in this game is not due to async compute capability or whatsoever, it is mainly due to difference in compute performance.

Reason for GTX1080 beating FuryX in AOTS is because it has higher ROP performance and similar compute performance.

Reason why R9 390X beating GTX980 in AOTS is because 390X has higher compute performance (5.9 > 4.9 TFlops).

Reason why R9 390 beating GTX970 in AOTS is because 390 has higher compute performance (5.1 > 3.9 TFlops).

How to test this?
1. If anyone with a GTX1080 has some free time, downclock it to 1000MHz and you should be getting R9 390 (2560 cores) performance in AOTS (either in DX11 or DX12). Why 390? Because it has the same amount of cores and ROPs.
2. If you overclock GTX980 to 1440MHz (same TFLops as R9 390X), you should be getting R9 390X performance in AOTS.
3. If you overclock GTX970 to 1532MHz (same TFLops as R9 390), you should be getting R9 390 performance in AOTS.

If NVIDIA Maxwell or Pascal are getting less performance compared to AMD GCN at the same TFLops, it means that it has problem with concurrent graphics + compute, which it does not.

Note: To test AOTS async compute scheduling, don't test at 4K since other components (for example memory bandwidth) will bottleneck the performance.

This post has been edited by Demonic Wrath: Jun 6 2016, 04:41 PM
Demonic Wrath
post Jun 7 2016, 11:12 AM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

Expected GTX1070 price
Normal Edition MSRP USD379 x 5 = RM 1,895
Expect AIB to be around RM 2,000.
Founder Edition MSRP USD449 x 5 = RM 2,245

Demonic Wrath
post Jun 7 2016, 04:56 PM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

QUOTE(SSJBen @ Jun 7 2016, 04:45 PM)
People need to understand a few things regarding 1080p vs 4k.

4k monitors/TVs aren't actually 4k in resolution, it's actually 3440x2160 instead of the actual 3840x2160. Stop being duped by Hollywood.

3440x2160 is an EXACT 4 times increament of resolution over 1920x1080. All a monitor or TV need to do is to quadruple the FHD image into UHD without any further calculations. This is different to when 480p was upscaled to 1080p, or 720p going to 1080p. Neither 720p or 480p were linear increase in pixel count when being upscaled to 1080p, which is why 480p often looks like horseshit in FHD (even with the best post-processing scaler).
4K monitors is 3840 x 2160, not 3440 x 2160.

3440 x 2160 is not exactly 4x 1920 x 1080... 3840 x 2160 is exact 4x of 1920 x 1080.

I think you meant it is not 4096 x 2160 (4K) smile.gif

A more accurate would be 2160p. Since {height}p nomenclature is almost always referring to 16:9 aspect screens.
Demonic Wrath
post Jun 7 2016, 05:48 PM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

QUOTE(scchan107 @ Jun 7 2016, 05:42 PM)
Any recommended 1440p@120hz(or 144hz) monitor?

Currently poorfag looking at Dell U2515H  cry.gif
*
Acer Predator XB271HU

Get a Gsync one. Hehe.
Demonic Wrath
post Jun 7 2016, 07:35 PM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

For 27" monitor and at 30" viewing distance, going higher than 2560x1440 will not have any more benefit. The eyes will hardly see the pixels.

1440p is actually already a very optimal resolution. Larger screen will require the user to sit further, cancelling any benefit from increasing the resolution density.
Demonic Wrath
post Jun 8 2016, 05:06 PM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

My advice is not to go for 4K UHD panels for computer monitors. I doubt anyone can really see the pixels difference between 1440p and 4K at monitor viewing distance (24" distance). Of course, if viewing at 1-6" viewing distance it is noticable, but who would view at that range for normal usage..

The optimal should be 1440p @ 27" at 120hz for computer monitor. Larger screen will require you to sit further, negating any benefit of increased pixel density.
Demonic Wrath
post Jun 24 2016, 08:50 PM

My name so cool
******
Senior Member
1,667 posts

Joined: Jan 2003
From: The Cool Name Place

For me I'll only upgrade if the graphic card is at least 4x more powerful than my current graphic card. So far so good biggrin.gif

3 Pages < 1 2 3Top
Topic ClosedOptions
 

Change to:
| Lo-Fi Version
0.0539sec    0.46    7 queries    GZIP Disabled
Time is now: 26th November 2025 - 05:33 AM