[MASS] - UCLK: technical limitations or useful marketing tool?

Massman · 12th August 2009, 09:32

With the performance scaling article in the back of my mind, I already expressed some thoughts regarding the purpose of the Uncore (IMC really) within Intel's marketing scheme. Following thread is a worklog to find out more about the interaction between cpu, memory and uncore frequency.

Let me start by quoting myself in an earlier forum post in which I expressed my first thoughts regarding the Uncore purpose.

Excerpt from forum post:

Quote:

"This (133x4=666MHz) is the memory running at highest possible frequency, at a stock BLCK frequency. But, that's not the most interesting part about this screenshot. Please have a look at the frequency of the uncore: 1995MHz. On a triple channel i7 platform, the minimum frequency of the uncore would've been 2660MHz, as in 2x the memory frequency (1330MHz DDR); on this dual channel platform, we see that the minimum uncore frequency is set at 1,5x the memory frequency. Why? I don't know. At least, not yet ...

After exchanging ideas with several more knowledgable people, it seems to me that the limitation of a minimum uncore frequency being twice the memory frequency is more of marketing purposes than really a technical limitation. To explain this, let's have a look at the AMD platform first, because it's also a dual channel DDR3 platform. AMD representatives acknowledge the fact that to really put the memory frequency to good use, you need a NB frequency of at least 3x (DDR -> 1,5x) the memory frequency. So, with an IMC frequency of 2GHz, you would only need a memory frequency of 667MHz (DDR3-1333). Anything higher would still scale but less and less intensively, at which you can ask yourself the question if you want to spend the extra money on high-rated memory kits. This 'theory' has been developped, tested and confirmed by Tony of OCZ, click for more information.

For Intel-based platforms, the story isn't that much different: the memory controller is now integrated in the processor and the clock frequency of it can form a bottleneck with high-frequency memory. The novelty about the i7 was the newly introduced third memory channel, which should increase the memory bandwidth significantly. Many tests, however, confirm that the extra channel doesn't have that much of an effect in most benchmarks, let alone in daily computing activities. And that is a big problem when trying to sell the product: who wants to pay more for something that doesn't work in the first place? The technique is quite simple: make it look like it works. And that's where the limitation of "uncore >= 2 x memory" kicks in: with an even lower uncore frequency, the added memory channel would have had even less effect than it has now. Less than almost insignificant, that's bad PR. Technically, it seems possible for the uncore to run at a lower ratio than 2:1, but weirdly enough none of the motherboard manufacturers seem to have added this option to their bios, although it would help people reaching 2000CL7 on air cooling since the memory overclock is very often limited by the uncore frequency."
(~ http://www.madshrimps.be/vbulletin/f...5-gd80-65278/)

The question is in fact quite simple: am I being too critical or thinking too much in lines of conspiracy theories to believe that manipulating the Uncore frequency is just a marketing tool rather than battling with technical limitations? The LGA1156 platform shows me that it's perfectly possible to have an Uncore multiplier running lower than 2x the memory frequency and I'm quite reluctant to believe it's because of the missing third memory channel.

Also, it seems that the i5 7xx series only have memory ratios upto 2:10 (or 5x), whereas the i7 8xx series have ratios upto 2x12 (6x) ... to feed the 8 threads which are present on the 8xx, but not on the 7xx? In any case: for more multipliers, you need to pay more. Coincidence or marketing strategy?

In any case, if it's indeed just a marketing tool, and there's no technical limitation regarding Uncore/memory frequency, why has no motherboard manufacturer been trying to figure how to 'crack' the limitation? I mean: most of the performance enthousiasts are ignorant when it comes to finding the right balance between frequency and timings; 90% just applies the "more equals better"-rule and buys $350 2000CL7 memory kits only to find out their CPU isn't capable of running 4GHz uncore on air cooling. Having a motherboard that allows users to downclock the uncore would be a smart move marketing-wise.

Massman · 12th August 2009, 09:37

I came into contact with MS, owner of the famous 'lostcircuits' website which deals with technology much more in-depth than we do. Although the following remarks are posted under my nickname here on this forum, all the credit for these insights go to 'MS'!

Link to discussion: http://www.lostcircuits.com/forum/vi...php?f=3&t=2609

I will try to simplify the concepts for those who need more explanation; drawing some graphs as we speak.

--------------------------------------------------------------------

Quote:

Excerpt from forum post:

"This (133x4=666MHz) is the memory running at highest possible frequency, at a stock BLCK frequency. But, that's not the most interesting part about this screenshot. Please have a look at the frequency of the uncore: 1995MHz. On a triple channel i7 platform, the minimum frequency of the uncore would've been 2660MHz, as in 2x the memory frequency (1330MHz DDR); on this dual channel platform, we see that the minimum uncore frequency is set at 1,5x the memory frequency. Why? I don't know. At least, not yet ...

That one is easy- at least with a few assumptions. In triple channel mode, the uncore clock has to be twice that of the memory because in a triple channel configuration, the combined bit bus width meets the interface between the L3/uncore to the CPU that we assume to be at least 96 bit wide (1/2 of the 192 bit wide memory data path). In other words, every triple-channel memory transaction has to be split into two transactions between the uncore and the core, or else the L3 if it is used for prefetch. Only caveat is: we don't know the width of the uncore-core interface on the Core i7 (central queue) and I have not been able to find any conclusive data on this feature. However, if the uncore (or non-core in Intel's parlance) is interconnected with the core through the assumed 96 bit interface, then a 128 bit memory transaction will take at least 1.33 cycles to transfer. If you make the uncore clock 1.33 x of the memory then you use the entire possible bandwidth of all registers that are involved but you will end up with alignment issues in that the first transaction ends 1/3 into the register, the second one will have to start overlapping at 1/3 and end at 2/3 and so on, making things a bit complicated. Easier is to throw away a bit of frequency and use a 1.5 x clock where you have always full transactions with a boundary at 50% of the register width. You throw away a few bits but management is much easier that way.

Off all the different features on the architecture, that particular interface is one of the less likely items to be changed, whereas the memory controllers are just modular blocks that can be thrown in or deleted at lib.

Quote:

After exchanging ideas with several more knowledgable people, it seems to me that the limitation of a minimum uncore frequency being twice the memory frequency is more of marketing purposes than really a technical limitation. To explain this, let's have a look at the AMD platform first, because it's also a dual channel DDR3 platform. AMD representatives acknowledge the fact that to really put the memory frequency to good use, you need a NB frequency of at least 3x (DDR -> 1,5x) the memory frequency. So, with an IMC frequency of 2GHz, you would only need a memory frequency of 667MHz (DDR3-1333). Anything higher would still scale but less and less intensively, at which you can ask yourself the question if you want to spend the extra money on high-rated memory kits. This 'theory' has been developped, tested and confirmed by Tony of OCZ, click for more information.

Different architectures and interconnect width will result in different requirements for the interaction of the different components. The AMD interconnect, for all I know, is totally different from the Intel architecture, there you have independent memory controllers, whereas, in Intel's approach all three controllers are always doing the same thing (even if the physical addresses may possibly vary. BTW, we found the same thing with respect to memory frequency vs. NB frequency and resulting performance scaling, 2.4 GHz is the minimum to get DDR3 1600 really going.

Quote:

For Intel-based platforms, the story isn't that much different: the memory controller is now integrated in the processor and the clock frequency of it can form a bottleneck with high-frequency memory. The novelty about the i7 was the newly introduced third memory channel, which should increase the memory bandwidth significantly. Many tests, however, confirm that the extra channel doesn't have that much of an effect in most benchmarks, let alone in daily computing activities. And that is a big problem when trying to sell the product: who wants to pay more for something that doesn't work in the first place? The technique is quite simple: make it look like it works. And that's where the limitation of "uncore >= 2 x memory" kicks in: with an even lower uncore frequency, the added memory channel would have had even less effect than it has now. Less than almost insignificant, that's bad PR. Technically, it seems possible for the uncore to run at a lower ratio than 2:1, but weirdly enough none of the motherboard manufacturers seem to have added this option to their bios, although it would help people reaching 2000CL7 on air cooling since the memory overclock is very often limited by the uncore frequency."[/i]
(~ http://www.madshrimps.be/vbulletin/f...5-gd80-65278/)

You are mixing two different things here. One is the synthetic throughput that really depends on the uncore running 2 x the frequency of the memory, the other one being the fact that finally the actual core is starting to saturate with the actual amount of data that is incoming and that the processing units cannot digest the data as fast as they are delivered. That is one main difference to the older Intel architectures including the Core2 where memory bottlenecks were the biggest problem (even though it wasn't really the memory but the AGTL bus and the fact that every request had to be snooped on a bi-directional bus, leading to some 50-70 % of possible memory utilization only.

Quote:

The question is in fact quite simple: am I being too critical or thinking too much in lines of conspiracy theories to believe that manipulating the Uncore frequency is just a marketing tool rather than battling with technical limitations? The LGA1156 platform shows me that it's perfectly possible to have an Uncore multiplier running lower than 2x the memory frequency and I'm quite reluctant to believe it's because of the missing third memory channel.

Actually, if you do the math, then that's what it comes down to. Intel has vast experience with this type of data buffering from the days of the AGP bus (internally or externally), I have done enough reverse engineering on this feature (which was originally developed by HP IIRC). A 2:1 ratio is the easiest because you don't have to do any split-transactions, instead, you transfer the entire width of the register every time. With a 1.5 x ratio it is still easy because you split the register down the middle, so you don't run into alignment issues that you would have with a 1.33 x multiplier where you would be stuck with 3 segments and have to track where the boundary ends up after each transaction and subsequent re-fill. This is actually the tidbit that makes me believe that the interconnect is less than 128 bit wide, otherwise the dual channel memory could be transferred in a single transaction but again, most of what I wrote is based on that one assumption of a limited uncore-core bus-width.

Quote:

Also, it seems that the i5 7xx series only have memory ratios upto 2:10 (or 5x), whereas the i7 8xx series have ratios upto 2x12 (6x) ... to feed the 8 threads which are present on the 8xx, but not on the 7xx? In any case: for more multipliers, you need to pay more. Coincidence or marketing strategy?

Those ratios will change with the actual release of the parts into the market.

Quote:

In any case, if it's indeed just a marketing tool, and there's no technical limitation regarding Uncore/memory frequency, why has no motherboard manufacturer been trying to figure how to 'crack' the limitation? I mean: most of the performance enthousiasts are ignorant when it comes to finding the right balance between frequency and timings; 90% just applies the "more equals better"-rule and buys $350 2000CL7 memory kits only to find out their CPU isn't capable of running 4GHz uncore on air cooling. Having a motherboard that allows users to downclock the uncore would be a smart move marketing-wise.

Because if they did, the processor would run into a buffer overflow. Imagine you have one register of 1/2 the bus width and then you force it to cycle at less than 2 x the speed. That is you have one big mug of coffee that you are trying to drink but the mug is too heavy so you have a small cup as intermediate carrier. The big mug has twice the volume of the small cup. How many times do you have to empty the small cup if the big cup fills up once every minute?

I'll check on the interconnect width again, I might be wrong but I think I remember having this discussion with some Intel folks.

Massman · 12th August 2009, 13:33

Quote:

That one is easy- at least with a few assumptions. In triple channel mode, the uncore clock has to be twice that of the memory because in a triple channel configuration, the combined bit bus width meets the interface between the L3/uncore to the CPU that we assume to be at least 96 bit wide (1/2 of the 192 bit wide memory data path). In other words, every triple-channel memory transaction has to be split into two transactions between the uncore and the core, or else the L3 if it is used for prefetch. Only caveat is: we don't know the width of the uncore-core interface on the Core i7 (central queue) and I have not been able to find any conclusive data on this feature. However, if the uncore (or non-core in Intel's parlance) is interconnected with the core through the assumed 96 bit interface, then a 128 bit memory transaction will take at least 1.33 cycles to transfer. If you make the uncore clock 1.33 x of the memory then you use the entire possible bandwidth of all registers that are involved but you will end up with alignment issues in that the first transaction ends 1/3 into the register, the second one will have to start overlapping at 1/3 and end at 2/3 and so on, making things a bit complicated. Easier is to throw away a bit of frequency and use a 1.5 x clock where you have always full transactions with a boundary at 50% of the register width. You throw away a few bits but management is much easier that way.

Off all the different features on the architecture, that particular interface is one of the less likely items to be changed, whereas the memory controllers are just modular blocks that can be thrown in or deleted at lib.

The first problem in understand the problem here is to understand how the core, uncore and memory are connected to each other. I made a small sketch underneath:

Now, MS seems to repeat a few times that the 96-bit interface between core and uncore is an assumption he makes. Although I'm trailing him by miles when it comes to technical knowledge, I can imagine why he's making this assumption as it makes perfect sense. Look at the following graph:

IF the uncore would be running 1:1 with the memory frequency, the maximum theoretical data throughput would be 96 bit/cycle, which forms a problem as the memory data path equals 192 bit/cycle. So, in order to be able to address the complete memory data path in one clock cycle, we have to increase the Uncore frequency. In this case, it's quite simple:

Doubling the Uncore frequency makes it possible to handle twice the data in one cycle; so instead of 96 bits we now have 96 x 2 = 196 bit per clock cycle, which matches the memory data path. It's a perfect match.

By the way, it's not possible to have the uncore and memory frequency run at 1:1 because ... well, MS uses a beautiful metaphore to explain:

Quote:

Originally Posted by MS

That is you have one big mug of coffee that you are trying to drink but the mug is too heavy so you have a small cup as intermediate carrier. The big mug has twice the volume of the small cup. How many times do you have to empty the small cup if the big cup fills up once every minute?

I believe it's technically actually possible with a couple of workarounds, but huge parts of the memory would have to be refreshed one cycle to keep the data available for the second 96 bit to be transfered. Since the Uncore cán work at 2x the memory frequency, it's just way easier to apply this rule. I guess everyone understands why.

For dual channel configurations, which are 128 bit wide, it's a more complicated problem since there's no easy fix as with triple channel (just x2). Basicly, in a perfect world, increasing the uncore frequency by 1,33 would do the trick as instead of 96 bit per clock cycle you would then be able to address 96 x 1,33 = 128 bit in one clock cycle, which is the full dual channel bandwidth. The problem, however, is that this would make the register management quite difficult as you can see on the graph underneath:

Basicly, the uncore register would have to be aligned at the 1/3 and 2/3 mark. Or, put differently, the system has to make note where the first register output ends (1/3 of the second 96 bit series) and the second output ends (2/3 of the third 96 bit series). It's not technically impossible, but far from an elegant (= efficient) solution. Much easier is to increase the frequency by 1,5: the only alignment is the one at 1/2, which is just splitting up into two pieces.

Any questions?

Massman · 12th August 2009, 14:46

I put together a couple of charts regarding the performance scaling of CPU/MEM/UNC using Lavalys Everest and Pifast (don't shoot me , just wanted to keep the testing relatively short). Since Core i7 pretty much stands for 'multiplier-overclocking' I used simple values to test scaling:

Here are the results of the 'doubling in frequency'

6/12 stands for: 6x memory, 12x uncore ratio. Same rule for the two others.

12/24 stands for 12x cpu, 24x uncore ratio.

12/6 stands for: 12x cpu, 6x memory ratio.

Haven't really had the time to fully analyse the charts, though. Above charts are the direct performance scaling results; afterwards, I also charted the indirect results (basicly comparing the effect of doubling a second frequency; e.g.: double UNC freq when CPU freq has been doubled).

CPU -> UNC = When the CPU frequency has been doubled, what is the effect of doubling the uncore frequency as well.

This chart is basicly a representation of the previous three ones.

To end with, I also compared the platform scaling (= increasing the CPU/MEM/UNC frequency at once). First row is the percentual increase going from (cpu/mem/unc) 12/6/12 to 24/6/24 and the second row is going from 133 to 167MHz BCLK (4G/1G/4G). No chart, but it's quite easy to grasp the significance of the raw data.

Now, the interesting part would be to see how this changes in dual and single channel configurations. Something for later this week.

Massman · 13th November 2009, 17:28

Okay, I'm starting to understand where I went off track. I was talking about uncore-to-dram bus, while in fact I had to talk about cpu-to-imc bus.

Of course, each of the imc-to-dram bus widths is 64-bit, totalling 192-bit in triple channel and 128-bit in dual channel. The cpu-to-imc bus width is 96-bit; in triple channel configuration this means 32-bit per imc (x 3), in dual channel configuration this means 48-bit per imc (x 2). To acquire all data coming from the imc (192-bit in total), the cpu-to-imc clock frequency has to be equal to or higher than 2 x DRAM frequency as ... well, 96 x 2 = 196. So, 1 dram clock, 2 imc clocks to transfer data from dram to cpu.

In dual channel configurations, the situation differs a bit. The imc-to-dram bus widths are still 64-bit, but the total is only 128-bit. As mentioned already, the cpu-to-imc bus width remains 96-bits, split up in 2 times 48-bit coming from both memory controllers. As you can see, each uncore clock, 2/3rd of the memory transfer has been completed, which allows the cpu-to-imc clock to be decreased. The most elegant solution is to make the cpu-to-imc clock 1,5x dram frequency; for each dram clock, 3/2 transfer can be performed, or, 48 x 3/2 = 64 bit. Maybe more simple: for each 2 dram clocks, 3 uncore clocks complete the memory transfer.

Intel always claimed to made the memory acces as efficient as possible, which can be seen in the 1366-i7 design. No clocks or bus width is lost. As memory and imc multipliers are unlocked, it's a big more difficult to understand this, but with the locked imc multipliers on 1156, this IS can be seen perfectly.

LGA1156 Core i7 series: maximum imc multiplier = 2:12. As the memory multiplier needs to be equal to or higher than 1,5x dram frequency, the lowest possible imc multiplier has to be 18x (12 /2 x3). When using the 2:12 memory multiplier, you are as efficient as you can be.

LGA1156 Core i5 series: maximum memory multiplier = 2:10. As the multiplier needs to be equal to or higher than 1,5x dram frequency, the lowest possible imc multiplier has to be 15x (10 /2 x3). At 16x, you have more cpu-to-imc bus width than there's needed for transfer.

So, to answer my own questions:

Quote:

Originally Posted by Massman

1) Wouldn't it make more sense to go for a fixed 15x uncore multiplier? In combination with the 2:10 multiplier, there's a 3:2 ratio again.
2) Does this make the Core i5 inherently less efficient? Less data transfer clock-per-clock?

1) Yes, a fixed 15x imc multiplier would make more sense as it would be more efficient. Why Intel choose 16x ... no idea; my best guess would be that it was more difficult to make an option for 15x in the registry.
2) Yes, memory transfer per energy is lower on Core i5.

Time to read up on Gulftown reports and see how all this is again different with Gulftown

.

Massman · 13th November 2009, 19:36

Gulftown uncore/memory ratios:

1,6x

1,8x

1,5x

jmke · 13th November 2009, 19:44

this should be put in a live article at the site

Massman · 13th November 2009, 19:59

And break NDA while we're at it?

jmke · 13th November 2009, 20:00

up until post #6 there's no NDA to speak off... ? and even #6 is questionable as those screenies don't prove anything performance wise

Massman · 13th November 2009, 20:01

Correct.

12th August 2009, 14:46	#4
Massman [M] Reviewer Join Date: Nov 2004 Location: Waregem Posts: 6,466	I put together a couple of charts regarding the performance scaling of CPU/MEM/UNC using Lavalys Everest and Pifast (don't shoot me , just wanted to keep the testing relatively short). Since Core i7 pretty much stands for 'multiplier-overclocking' I used simple values to test scaling: Here are the results of the 'doubling in frequency' 6/12 stands for: 6x memory, 12x uncore ratio. Same rule for the two others. 12/24 stands for 12x cpu, 24x uncore ratio. 12/6 stands for: 12x cpu, 6x memory ratio. Haven't really had the time to fully analyse the charts, though. Above charts are the direct performance scaling results; afterwards, I also charted the indirect results (basicly comparing the effect of doubling a second frequency; e.g.: double UNC freq when CPU freq has been doubled). CPU -> UNC = When the CPU frequency has been doubled, what is the effect of doubling the uncore frequency as well. This chart is basicly a representation of the previous three ones. To end with, I also compared the platform scaling (= increasing the CPU/MEM/UNC frequency at once). First row is the percentual increase going from (cpu/mem/unc) 12/6/12 to 24/6/24 and the second row is going from 133 to 167MHz BCLK (4G/1G/4G). No chart, but it's quite easy to grasp the significance of the raw data. Now, the interesting part would be to see how this changes in dual and single channel configurations. Something for later this week. __________________

13th November 2009, 19:44	#7
jmke Madshrimp Join Date: May 2002 Location: 7090/Belgium Posts: 79,021	this should be put in a live article at the site __________________

13th November 2009, 19:59	#8
Massman [M] Reviewer Join Date: Nov 2004 Location: Waregem Posts: 6,466	And break NDA while we're at it? __________________

13th November 2009, 20:00	#9
jmke Madshrimp Join Date: May 2002 Location: 7090/Belgium Posts: 79,021	up until post #6 there's no NDA to speak off... ? and even #6 is questionable as those screenies don't prove anything performance wise __________________

13th November 2009, 20:01	#10
Massman [M] Reviewer Join Date: Nov 2004 Location: Waregem Posts: 6,466	Correct. __________________

13th November 2009, 19:36	#6
Massman [M] Reviewer Join Date: Nov 2004 Location: Waregem Posts: 6,466	Gulftown uncore/memory ratios: 1,6x 1,8x 1,5x __________________

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Official Microsoft Windows 7 USB/DVD Download Tool	jmke	WebNews	0	12th December 2009 11:28
Office 2010 Technical Preview screenshots	jmke	WebNews	0	18th May 2009 20:17
nVidia Mobility Modder Tool goes public!	jmke	WebNews	2	14th August 2008 16:49
AMD GPU Clock Tool 0.9.8 with Radeon HD 4800 series support	jmke	WebNews	0	7th July 2008 15:54
Nvidia spent 4 to 5 million on Crysis marketing	jmke	WebNews	0	14th February 2008 14:09
Futuremark Announces 3D Content Creation Tool Chain for OpenGL ES 2.0	jmke	WebNews	0	8th August 2007 15:54
New Microsoft WGA Tool tries to call home	jmke	WebNews	0	7th March 2007 15:10
ATI Tool vs. CCC	Volt	Hardware Overclocking and Case Modding	1	28th September 2006 10:49
ATI produce tool to increase Doom3 scores 'up to 35%' with AA enabled	jmke	WebNews	0	15th October 2005 01:44
registry monitoring tool	kristos	General Madness - System Building Advice	2	18th February 2005 23:59