Today, at the annual Hot Chips conference, Mike Butler, AMD Fellow and Chief Architect of the Bulldozer core, gave the first detailed public exposition of Bulldozer. We didn't attend his presentation, but we did talk with Dina McKinney, AMD Corporate Vice President of Design Engineering, who led the Bulldozer team, in advance of the conference. We also have a first look at some of the slides from Butler's talk, which reveal quite a bit more detail about Bulldozer than we've seen anywhere else.
The first thing to know about the information being released today is that it's a technology announcement, and only a partial one at that. AMD hasn't yet divulged specifics about Bulldozer-based products yet, and McKinney refused to answer certain questions about the architecture, too. Instead, the company intends to release snippets of information about Bulldozer in a directed way over time in order to maintain the buzz about the new chip—an approach it likens to "rolling thunder," although I'd say it feels more like a leaky faucet.
The products: New CPUs in 2011
Regardless, we know the broad outlines of expected Bulldozer-based products already. Bulldozer will replace the current server and high-end desktop processors from AMD, including the Opteron 4100 and 6100 series and the Phenom II X6, at some time in 2011. A full calendar year is an awfully big target, especially given how close it is, but AMD isn't hinting about exactly when next year the products might ship. We do know that the chips are being produced by GlobalFoundries on its latest 32-nm fabrication process, with silicon-on-insulator tech and high-k metal gate transistors. McKinney told us the first chips are already back from the fab and up and running inside of AMD, so Bulldozer is well along in its development. Barring any major unforeseen problems, we'd wager the first products based on it could ship well before the end of 2011, which would be somewhat uncommon considering that these product launch time windows frequently get stretched to their final hours.
One advantage that Bulldozer-based products will have when they do ship is the presence of an established infrastructure ready and waiting for them. AMD says Bulldozer-based chips will be compatible with today's Opteron sockets C32 and G34, and we expect compatibility with Socket AM3 on the desktop, as well, although specifics about that are still murky.
AMD has committed to three initial Bulldozer variants. "Valencia" will be an eight-core server part, destined for the C32 socket with dual memory channels. "Interlagos" will be a 16-core server processor aimed at the G34 socket, so we'd expect it to have quad memory channels. In fact, Interlagos will likely be comprised of two Valencia chips on a single package, in an arrangement much like the present "Magny-Cours" Opterons. The desktop variant, "Zambezi", will have eight cores, as well. All three will quite likely be based on the same silicon.
The concept: two 'tightly coupled' cores
The specifics of that silicon are what will make Bulldozer distinctive. The key concept for understanding AMD's approach to this architecture is a novel method of sharing resources within a CPU. Butler's talk names a couple of well-known options for supporting multiple threads. Simultaneous multithreading (SMT) employs targeted duplication of some hardware and sharing of other hardware in order to track and execute two threads in a single core. That's the approach Intel uses its current, Nehalem-derived processors. CMP, or chip-level multiprocessing, is just cramming multiple cores on a single chip, as AMD's current Opterons and Phenoms do. The diagram above depicts how Bulldozer might look had AMD chosen a CMP-style approach.
AMD didn't take that approach, though. Instead, the team chose to integrate two cores together into a fundamental building block it calls a "Bulldozer module." This module, diagrammed above, shares portions of a traditional core—including the instruction fetch, decode, and floating-point units and L2 cache—between two otherwise-complete processor cores. The resources AMD chose to share are not always fully utilized in a single core, so not duplicating them could be a win on multiple fronts. The firm claims a Bulldozer module can achieve 80% of the performance of two complete cores of the same capability. Yet McKinney told us AMD has estimated that including the second integer core adds only 12% to the chip area occupied by a Bulldozer module. If these claims are anywhere close to the truth, Bulldozer should be substantially more efficient in terms of performance per chip area—which translates into efficiency per transistor and per watt, as well.
One obvious outcome of the Bulldozer module arrangement, with its shared FPU, is an inherent bias toward increasing integer math performance. We've heard several explanations for this choice. McKinney told us the main motivating factor was the presence of more integer math in important workloads, which makes sense. Another explanation we've heard is that, with AMD's emphasis on CPU-GPU fusion, floating-point-intensive problems may be delegated to GPUs or arrays of GPU-like parallel processing engines in the future.
In our talk, McKinney emphasized that a Bulldozer module would provide more predictable performance than an SMT-enabled core—a generally positive trait. That raised an intriguing question about how the OS might schedule threads on a Bulldozer-based processor. For an eight-threaded, quad-core CPU like Nehalem, operating systems generally tend to favor scheduling a single thread on each physical core before adding a second thread on any core. That way, resource sharing within the cores doesn't come into play before necessary, and performance should be optimal. We suggested such an arrangement might also be best for a Bulldozer-based CPU, but McKinney downplayed the need for any special provisions of that nature on this hardware. She also hinted that scheduling two threads on the same module and leaving the other three modules idle, so they cold drop into a low-power state, might be the best path to power-efficient performance. We don't yet know what guidance AMD will give operating system developers regarding Bulldozer, but the trade-offs at least shouldn't be too painful.
Sos: http://www.techreport.com/articles.x/19514
Bulldozer 20 Questions
QUOTE
“Will Bulldozer implement new versions of Hypertransport?” – Rheo
No, Bulldozer takes advantage of the same version of HyperTransport™ (HT) technology as our existing AMD Opteron™ 4000 and 6000 series processors, HyperTransport 3.1.
“Is there any”programmable-tangible” improvement in synchronization between cores in the same module? In other words, will I get tangible performance improvement if I can partition my multi-threaded algorithm to pairs of closely interacting threads, and schedule each pair to a module?” – Edward Yang
That is a very interesting question.
For the majority of software, the OS will work in concert with the processor to manage the thread to core relationships. We are collaborating with Microsoft and the open source software community to ensure that future versions of Windows and Linux operating systems will understand how to enumerate and effectively schedule the Bulldozer core pairs. The OS will understand if your machine is setup for maximum performance or for maximum performance/watt which takes advantage of Core Performance Boost.
However, let’s say you want to explore if you can get a performance advantage if your threads were scheduled on different modules. The benefit you can gain really depends on how much sharing the two threads are going to do.
Since the two integer cores are completely separate and have their own execution clusters (pipelines) you get no sharing of data in the L1 – and there is no specific optimizations needed at the software level. However, at the L2 cache level there could be some benefits. A shared L2 cache means that both cores have access to read the same cache lines – but obviously only one can write any cache line at any time. This means that if you have a workload with a main focus of querying data and your two threads are sharing a data set that fits in our L2, then having them execute in the same module could have some advantages. The main advantage we expect to see is an increase in the power efficiency of the cores that are idle. The more idle other cores are, the better chance the busy cores will have to boost.
However, there is another consideration to this which is how available other cores are. You need to weigh the benefits of data sharing with the benefit of starting the thread on the next available core. Stacking up threads to execute in proximity means that a thread might be waiting in line while an open core is available for immediate execution. If your multi-threaded application isn’t optimized to target the L2 (or possibly the L3 cache), or you have distinctly separate applications to run, and you don’t need to conserve power, then you’ll likely get better performance by having them scheduled on separate modules. So it is important to weigh both options to determine the best execution.
“How much extra performance will we see when running two-threaded applications on one Bulldozer Module compared to two cores in different modules?” – Simon
Without getting too specific around actual scaling across cores on the processor, let me share with you what was in the Hot Chips presentation. Compared to CMP (chip multiprocessing – which is, in simplistic terms building a multicore chip with each core having its own dedicated resources) two integer cores in a Bulldozer module would deliver roughly 80% of the throughput. But, because they have shared resources, they deliver that throughput at low power and low cost. Using CMP has some drawbacks, including more heat and more die space. The heat can limit performance in addition to consuming more power. Ask yourself, would you rather have a 4-cylinder engine that delivered 300HP or a 6-cylinder engine that delivered 360HP and consumed less gas? The cylinder to horsepower ratio for 4-cylinder is obviously higher (75HP/cylinder vs. the V6’s 60HP/cylinder), meaning that each cylinder can give you more performance. However, looking at the overall enginge, you are getting less total output; and you are getting that lower output at a higher cost (higher gas consumption).
“Current and forthcoming Nehalem EX based servers from IBM and HP top out at 8 sockets and 64 cores. What kind of vertical scalability can we expect from Bulldozer-based servers?” – David Roff
Bulldozer will fit into the current “Maranello” and “San Marino/Adelaide” platforms. “Maranello” is our high performance platform that will support up to 4 CPUs. Combining a “Maranello” platform with the upcoming 16-core “Interlagos” processors, the total core density of a 4P system will reach as many as 64 cores.
The 8P x86 market today is pretty small. According to IDC, last year it accounted for roughly 7,915 total servers, down 26% from the year before (Source: IDC Quarterly Server Tracker, Q4 2009). If you want to say that 2009 was a bad year, from 2007 to 2008 the 8P x86 market was essentially flat as well, so that isn’t a growth engine. Part of what is impacting that market is the core and memory densities of today’s systems. People bought 8P servers to get to 48 cores (8 x 6-core) or to get to large memory footprints. Today’s 4P systems are meeting those needs at a lower price, with lower power consumption and lower latency. When we get to 2011 with “Bulldozer,” you’ll see an increase up to 64 cores, and we expect the total memory footprint will increase again.
The bottom line is, you’ll get the 64 cores that you want, you’ll just have to spend a lot less to get them; is that OK?
“As far as power usage goes, from what I understand BD is supposed to be taking power management features to a level of granularity that hasn’t been seen yet with consumer/business grade CPUs. Will those new features be available to current MC users or will a platform upgrade be necessary? Can you elaborate on any new power saving features that would make a business want to consider BD at this time?” – Jeremy Stewart
Current “Maranello” platforms with AMD Opteron™ 6100 Series processors already have the hooks embedded in them for any “Bulldozer”-level power efficiency features. When we specified the platforms for today’s processors, we did so with “Bulldozer” in mind.
As we have said already in this blog, we expect the shared architecture to provide us with a great deal of power savings – there are a lot of circuits that are essentially being duplicated in today’s multicore processors. Having a new “from the ground up” design allowed us to take a very close look at the circuits and determine which ones are ripe for consolidation and which ones really need their own dedicated resources.
We started with inherently power-efficient microarchitecture and implementation that included dynamic sharing of shared resources, minimized data movement and took advantage of extensive clock and power gating. From there, we added active management support that allows us to digitally measure activity in order to estimate power. Support for chip-level core power gating was also added to the processor.
These new features join existing AMD Opteron processor technologies such as AMD PowerNow!™, AMD CoolCore™, low voltage DDR-3 memory support and more, all working in concert to help create a power efficient system.
Even though you’ll see processors with 33% more cores and larger caches than the previous generation, we’ll still be fitting them into the same power and thermal ranges that you see with our existing 12-core processors.
http://blogs.amd.com/work/2010/08/10/20-qu...ulldozer-style/No, Bulldozer takes advantage of the same version of HyperTransport™ (HT) technology as our existing AMD Opteron™ 4000 and 6000 series processors, HyperTransport 3.1.
“Is there any”programmable-tangible” improvement in synchronization between cores in the same module? In other words, will I get tangible performance improvement if I can partition my multi-threaded algorithm to pairs of closely interacting threads, and schedule each pair to a module?” – Edward Yang
That is a very interesting question.
For the majority of software, the OS will work in concert with the processor to manage the thread to core relationships. We are collaborating with Microsoft and the open source software community to ensure that future versions of Windows and Linux operating systems will understand how to enumerate and effectively schedule the Bulldozer core pairs. The OS will understand if your machine is setup for maximum performance or for maximum performance/watt which takes advantage of Core Performance Boost.
However, let’s say you want to explore if you can get a performance advantage if your threads were scheduled on different modules. The benefit you can gain really depends on how much sharing the two threads are going to do.
Since the two integer cores are completely separate and have their own execution clusters (pipelines) you get no sharing of data in the L1 – and there is no specific optimizations needed at the software level. However, at the L2 cache level there could be some benefits. A shared L2 cache means that both cores have access to read the same cache lines – but obviously only one can write any cache line at any time. This means that if you have a workload with a main focus of querying data and your two threads are sharing a data set that fits in our L2, then having them execute in the same module could have some advantages. The main advantage we expect to see is an increase in the power efficiency of the cores that are idle. The more idle other cores are, the better chance the busy cores will have to boost.
However, there is another consideration to this which is how available other cores are. You need to weigh the benefits of data sharing with the benefit of starting the thread on the next available core. Stacking up threads to execute in proximity means that a thread might be waiting in line while an open core is available for immediate execution. If your multi-threaded application isn’t optimized to target the L2 (or possibly the L3 cache), or you have distinctly separate applications to run, and you don’t need to conserve power, then you’ll likely get better performance by having them scheduled on separate modules. So it is important to weigh both options to determine the best execution.
“How much extra performance will we see when running two-threaded applications on one Bulldozer Module compared to two cores in different modules?” – Simon
Without getting too specific around actual scaling across cores on the processor, let me share with you what was in the Hot Chips presentation. Compared to CMP (chip multiprocessing – which is, in simplistic terms building a multicore chip with each core having its own dedicated resources) two integer cores in a Bulldozer module would deliver roughly 80% of the throughput. But, because they have shared resources, they deliver that throughput at low power and low cost. Using CMP has some drawbacks, including more heat and more die space. The heat can limit performance in addition to consuming more power. Ask yourself, would you rather have a 4-cylinder engine that delivered 300HP or a 6-cylinder engine that delivered 360HP and consumed less gas? The cylinder to horsepower ratio for 4-cylinder is obviously higher (75HP/cylinder vs. the V6’s 60HP/cylinder), meaning that each cylinder can give you more performance. However, looking at the overall enginge, you are getting less total output; and you are getting that lower output at a higher cost (higher gas consumption).
“Current and forthcoming Nehalem EX based servers from IBM and HP top out at 8 sockets and 64 cores. What kind of vertical scalability can we expect from Bulldozer-based servers?” – David Roff
Bulldozer will fit into the current “Maranello” and “San Marino/Adelaide” platforms. “Maranello” is our high performance platform that will support up to 4 CPUs. Combining a “Maranello” platform with the upcoming 16-core “Interlagos” processors, the total core density of a 4P system will reach as many as 64 cores.
The 8P x86 market today is pretty small. According to IDC, last year it accounted for roughly 7,915 total servers, down 26% from the year before (Source: IDC Quarterly Server Tracker, Q4 2009). If you want to say that 2009 was a bad year, from 2007 to 2008 the 8P x86 market was essentially flat as well, so that isn’t a growth engine. Part of what is impacting that market is the core and memory densities of today’s systems. People bought 8P servers to get to 48 cores (8 x 6-core) or to get to large memory footprints. Today’s 4P systems are meeting those needs at a lower price, with lower power consumption and lower latency. When we get to 2011 with “Bulldozer,” you’ll see an increase up to 64 cores, and we expect the total memory footprint will increase again.
The bottom line is, you’ll get the 64 cores that you want, you’ll just have to spend a lot less to get them; is that OK?
“As far as power usage goes, from what I understand BD is supposed to be taking power management features to a level of granularity that hasn’t been seen yet with consumer/business grade CPUs. Will those new features be available to current MC users or will a platform upgrade be necessary? Can you elaborate on any new power saving features that would make a business want to consider BD at this time?” – Jeremy Stewart
Current “Maranello” platforms with AMD Opteron™ 6100 Series processors already have the hooks embedded in them for any “Bulldozer”-level power efficiency features. When we specified the platforms for today’s processors, we did so with “Bulldozer” in mind.
As we have said already in this blog, we expect the shared architecture to provide us with a great deal of power savings – there are a lot of circuits that are essentially being duplicated in today’s multicore processors. Having a new “from the ground up” design allowed us to take a very close look at the circuits and determine which ones are ripe for consolidation and which ones really need their own dedicated resources.
We started with inherently power-efficient microarchitecture and implementation that included dynamic sharing of shared resources, minimized data movement and took advantage of extensive clock and power gating. From there, we added active management support that allows us to digitally measure activity in order to estimate power. Support for chip-level core power gating was also added to the processor.
These new features join existing AMD Opteron processor technologies such as AMD PowerNow!™, AMD CoolCore™, low voltage DDR-3 memory support and more, all working in concert to help create a power efficient system.
Even though you’ll see processors with 33% more cores and larger caches than the previous generation, we’ll still be fitting them into the same power and thermal ranges that you see with our existing 12-core processors.
http://blogs.amd.com/work/2010/08/23/%E2%8...ions-round-one/
http://blogs.amd.com/work/2010/08/30/bulld...2%80%93-part-2/
This post has been edited by jinaun: Sep 5 2010, 12:41 AM
Aug 25 2010, 05:07 PM, updated 16y ago
Quote


0.0218sec
0.48
6 queries
GZIP Disabled