PDA

View Full Version : How do multi-processor CPUs multi-process?



tashirosgt
2012-May-15, 11:45 PM
I can understand how parallel computing works when it is implemented by programmers. I can also understand how a computer operating system could automatically implement it by putting unrelated tasks ( like writing a message to a log file and playing a YouTube video) in different "threads". But I have the impression that multi-processor CPUs have ways of multi-processing that don't depend on being told how to divide the work. What is involved in the "internal" multi-processing of the CPU? Does it read ahead in the machine language instructions it is given and detect pieces of code that can be executed independently?

swampyankee
2012-May-16, 12:09 AM
No, they always need to be told how to divide the work, either explicitly ("do this on CPU1") or implicitly ("figure out how to divide up this stuff") parallel programming is at least 40 years old -- it started before Illiac IV -- and the techniques are reasonably well established for many types of problems.

There's quite a lot of literature on the subject; try poking around acm.org and netlib.

dgavin
2012-May-16, 12:32 AM
How a multi-core processor works is heavily dependant on the Operating System or Microcode back end that is controlling them.

There is one exception this this, on AMD Phenom 2 X6 processors and greater, then these Multicore CPU's have a built in dynamic reconfiguring of the 6 Cores, into 3 faster cores. (Windows 7 or greater can only take advantage of this). So an X6 running at 3.4ghz per core (on Windows 7), if it detects 3 cores are active, and three are not , then it reconifgures itself into a X3 mode where each of those three combinded duo cores run at 4.4ghz. This is the only instance I know of where the CPU itself does the switching. This dynamic moding does not work on X6 laptops that rely on embedded graphics cards that steal CPU cycles from main processor to beef up thier own GPU rates. That also means NVidia's Phys-x enhancments will also prevent dynamic speed up from functioning, if a machince doesn't have a second Graphics card that phys-x can use.

In the case of windows giving a 4 core processor (OR 4 CPU Mainboard), Processor 0 (first core) is allocated to the Windows Kernel and UI, and then it uses the rest depending on the following: Was CPU Affinaty set?, if so that CPU (Or group of cores if multiple core affinities are set) is used for all things related to the Application, including the windows UI Interfacing, even if it's spawing threads, those threads will use that CPU affinity. If not, it will start applications considering the following, does the application request a preference for the last core? (AVG, Syanteck, or Anti vurus applications do this) If so Start their main process thread on Prosessor 3 (last Core).

If an application does not have an affinity, (only a prference or nothing prefered) it will start on which ever core has the least amout of CPU activity at application start. A notable varience is if the application goes into full screen direct 3d mode, then it is pushed to Processor 0, always (3D games tend to run thier main thread on the windows kernal thread for various resons). After all this:

Started threads will disperse themselves over what ever core is least active when thread first starts, with a small avoidance factor against Processor 0 built in. Notabable exception to this is if appplication uses affinity, then threads will start on those affinity CPU's, UNLESS all those affinity CPU's are maxed out at >90% usage.

djellison
2012-May-16, 04:08 AM
the i5/i7 cpu's do that now as well - a 'turbo' mode when not all the cores are running.

http://en.wikipedia.org/wiki/Intel_Turbo_Boost

Jens
2012-May-16, 04:25 AM
Does it read ahead in the machine language instructions it is given and detect pieces of code that can be executed independently?

I don't think it would "read ahead." I think it's more like, as the processor goes through the code, it is given certain things to execute, i.e. hold the value of a certain variable. And I think what happens is that the different processors usually divide up the tasks, so say processor A is quite buy but B is not, then when the next instruction comes, say "store 1 as the value for variable X," then processor B will do that. I doubt that the processors would divide a task like "play the youtube video" and "edit the MS word file". It would be more like, "put variable X into RAM," "removed variable Y from RAM," etc. And as swampyankee pointed out, this would have to be programmed, either explicitly or implicitly. Otherwise the processors would not understand how to do it.

dgavin
2012-May-16, 05:08 AM
the i5/i7 cpu's do that now as well - a 'turbo' mode when not all the cores are running.

http://en.wikipedia.org/wiki/Intel_Turbo_Boost

Thats good to know.

Though the way AMD handels it is different, it has just two clock rates and six/or three core modes. I think the reson it only works with windows 7 or later, is because it uses some of the built in virtualization of newer windows to accomplish the switch of the number of cores, and the speed, without impacting windows itself. (Sort of what VM's host do with Window VM's as well) Looks like the new AMD FX 8-Core does the same type of boost at the X6

HenrikOlsen
2012-May-16, 10:21 AM
If you're thinking about how the work is split up between different processor cores, it's pretty much per thread.

But each core has multiple subunits that either do different things or are replicated so they can do the same things to multiple numbers in parallel. Scheduling work amongst these is indeed done by "reading ahead" and intelligently scheduling subtasks for routing through the different pipelines of the different subunits.

novaderrik
2012-May-16, 11:02 AM
i always just assumed there was some sort of black magic or alchemy involved..

glappkaeft
2012-May-16, 01:13 PM
If you're thinking about how the work is split up between different processor cores, it's pretty much per thread.

But each core has multiple subunits that either do different things or are replicated so they can do the same things to multiple numbers in parallel. Scheduling work amongst these is indeed done by "reading ahead" and intelligently scheduling subtasks for routing through the different pipelines of the different subunits.

If any one wants to read up on this it's called a Superscalar CPU and has been around since 1965 (Cray CDC 6600). You can find quite a bit detailed information on how the CPU checks if two instructions can be executed in parallel (availability of ALUs, data dependencies, etc.). That todays fast (cycle time ~0.3 nano seconds), wide (highly superscalar) and deep (highly pipelined) processors manage to do so is a credit to the engineers.

lomiller1
2012-May-18, 07:18 PM
Multi-core and multi-threaded (multiple “virtual” cores) CPU’s all require proper coding and OS support to work.

Perhaps what is confusing you is that modern CPU’s can perform multiple operations simultaneously. This is different from multi-processing. In this case the CPU has internal hardware to detect instruction dependencies. You will have a queue of say 50 instructions lined up in the order they occurred in the program. The first obviously isn’t dependant on any instruction that isn’t already complete.

The second may depend on the instruction before it, but certainly doesn’t always. The third may depend on the first two, but again it may not. The hardware inside the CPU goes though this queue and assigns the procession unit (ALU, FPU, etc) it requires. It then goes through the list and looks for the next non-dependant instruction that has available processor units capable of executing it. 2-3 ALU’s isn’t uncommon. They also typically have a large number of “hidden” registers so that loading a register with a value won’t block a subsequent load into that same logical register.
They can even complete some types of dependant instructions. There are predictive hardware/algorithms that try to guess the result of branch type instructions in the queue and execute that side of the branch. It’s also possible to execute both sides of the branch and discard the side that isn’t needed.

“RISC” instruction sets are generally designed with this in mind while older instructions sets will break up their more complex instructions into simpler internal instructions that once again tend to create fewer dependencies. You can also benefit greatly from a compiler that “knows” what a processor can and can’t schedule and organizes it’s machine code in a way that allows the processor to make optimal use of its internal; resources.

glappkaeft
2012-May-18, 08:22 PM
Nowadays even RISC processors transform their instructions into a internal representation. The assumptions made when they where designed 20 years or more ago about what was the optimal instruction set has long since been made obsolete. The even older CISC instructions sets just had to go there first.

lomiller1
2012-May-18, 09:10 PM
Nowadays even RISC processors transform their instructions into a internal representation. The assumptions made when they where designed 20 years or more ago about what was the optimal instruction set has long since been made obsolete. The even older CISC instructions sets just had to go there first.

True, but AFAIK in this case the internal ops are still very RISC like in principle just optimized for the specific processor design whereas CISC instructions have elements that are very different. I.E. they don’t have known start/end points because they are not fixed size and do things like access memory and send its contents to the ALU in a single institution.