Multithreading and Multicore Architectures

The present-day CPUs can usually house more than 31 instructions in the pipeline because some of them are taking parallel paths. That means more than one instruction per clock cycle is executed and you have a super scalar CPU. Doubling the clock speed gets you more than twice the number of instructions executed. Just combine the speed of 3 or 4 Ghz with the capacity of the pipeline and, there you have it, billions of instructions running around. (This should not to be confused with actual “parallel processing,” which is the combining of two or more CPUs to execute a program).

Multithreading is all about improved pipeline action. How do we improve the pipeline capacity? Simple, just add another pipeline. Doing things in parallel speeds them up pretty radically. What’s really happening is the instructions are getting interleaved in the pipelines. While one pipeline is waiting for a memory fetch or something that holds up execution, the other pipeline can take advantage of the time and thus CPU latency times diminish enormously. There is a little problem though. How does the CPU sort out the bits of the program that can be run in parallel? The actual CPUs have become so complex that they can keep track of various threads of the program and put the results back together at a merge point.
Now, imagine you have two or more CPUs working on multiple threads. Is that a super boost for data processing or what? Well, yes, but if you don’t have the support of specialized software, you can’t really appreciate it. In fact, you don’t even have to imagine this. You have this technology in AMD’s Athlon X2 CPUs or Intel’s Core 2 ones right now. The inclusion of more than one interconnected CPU cores in one silicon chip is the present trend and the future will see more and more CPU cores in a single chip.


a Pipeline Story

The pipeline is one of those CPU fundamental features which enable several other very important speed-up schemes. Keep in mind that the pipeline architecture wasn’t introduced until 1989, so the previous CPUs had more “primitive” ways to deal with processing problems.

CPUs have to get instructions and data out of the memory (read cycle) and put data back (write cycle). For this, the CPU fetches an instruction in the first phase. Then, that instruction might require a chunk of data, or even more. That means a single instruction might take two or three read cycles to get the instruction and data into the microprocessor.

Here intervenes the address bus. The microprocessor outputs the address on the address bus, and then reads the instruction. If the instruction calls for data, one or more read cycles take place. But all the while, the microprocessor has to tranquilly sit and wait for the instruction and data to show up.
After the microprocessor gets all the pieces of the instruction and data, it goes to work. Some instructions may take a few steps and now it’s time for the memory to stop and wait for the CPU. A pretty laggy way to process data, I would say.

The introduction of systemic processing partially solved this dubious situation. In this case, the only time the CPU had to wait was for the first instruction and data, and the only time the memory bus was idle was after sending off the last instruction of the program.

As CPUs evolved, their instructions grew up very haphazardly. Some instructions were much longer than others and the long instruction could take multiple memory read cycles to fetch. And on top of that, the size and number of the data was variable. It was time for our nice systemic process to lose it and run amok.
Our generous guys from Intel came up with an ingenious idea that was immediately introduced in their 286 CPU. This was all about prefetching. From then on, there exists a little buffer memory between the memory bus and the CPU. This buffer memory is also known as a prefetch queue. The memory bus would deliver instructions to the prefetch queue and if the CPU got bogged down with a complex instruction, the next instructions would just stack up in the prefetch queue. When the CPU ran into a string of simple instructions, it would draw down the prefetch queue. Either way, both the memory bus and the CPU would run at full speed and not be held up by the other.

Intel thought that this architecture still couldn’t unleash the full power of a CPU and finally introduced the pipelining concept with the 486 CPU. Here’s how it works: take the systemic process and break it up into very small pieces so that each step has to do only very simple tasks. By making more steps in the pipeline, the tasks are extremely simple and even complex instructions can be broken down and executed as quickly as simple ones.

The pipeline features an interesting process that allows instructions to be predicted. The prediction is triggered when branches of complex instructions start to make their loopy way inside the CPU. In this way, the instruction address to follow will depend on the outcome of the execution of the current instructions.

If the pipeline predicts that the program flow will continue in a loop, it can fetch the next instruction in that loop. If the branch prediction turns out to be wrong, the pipeline has to be reset and part of the production line held up until the right instruction works its way down the pipeline. However, these predictions are right the majority of the time, and thus, a performance increase is realized almost every time complex branches of instructions are initiated by software programs.

about RISC and CISC

Reduced Instruction Set Computer (RISC) CPUs were introduced by a series of companies that liked to keep the instruction sets very simple. This was due to the belief that the RISC architecture could make the microprocessor perform faster by incorporating many reduced instructions.
Intel and AMD chose to stick to the CISC (Complex Instruction Set Computer) architecture. CISC CPUs include a huge amount of instructions and lots of those could do some pretty complicated things. The RISC boys couldn’t imagine that Intel and AMD would be able to fabricate their own microprocessors using nanometer technologies that allow CPU complexities of more than 100-million transistors and get that intricate circuitry running at gigahertz rates. The scientific community remembers that it had to migrate from RISC CPUs to CISC ones, when the road of reduced instruction sets came to a dead end. Scientists realized that the CISC CPUs are cheaper, but still powerful and they eventually became supporters of this technology.

In The Beginning, There Was MMX

In 1997, Intel introduced their MMX feature with the Pentium II microprocessor. This is a trade name for Single Instruction, Multiple Data (SIMD) capability. The MMX special instruction set was primarily aimed at the emerging complex multimedia features. 3D graphics processing stormed the second half of the previous decade and Intel thought it was nice to implement some 3D processing instructions in its CPUs.

AMD countered the Intel Pentium II MMX with their 3Dnow! SIMD design which was implemented in their AMD K6 CPU’s. This expanded the SIMD concept with floating point calculations that extended the range of numbers that can be crunched. Intel couldn’t agree with the AMD supremacy and introduced SSE (Streaming SIMD Extensions) in its Pentium III CPUs, further improving on the MMX design. This set of instructions included registers expanding to 128-bit. In order to take full advantage of these special instructions, programmers had to write and organize their pieces of software code in a very strict format to fit into the SIMD registers. Even with all these restrictions, SIMD freed the PC as a graphics processing powerhouse that unseated the specialized graphics silicon from companies like Silicon Graphics.

a Little Story About Multitasking

Intel’s 386 was the first CPU to feature hardware and special instructions that supported true multitasking. Nowadays, it is common to see several applications running simultaneously. But this is only an apparent thing, as no more than one program can be executed at any particular time. This is due to the blazing fast speeds of the actual CPUs, which can let each program run for just a short bit of time, then switch to the next program.

The 386 could stop a program in its tracks and suspend it while other programs run. Then, the OS would switch back to the first program as if nothing had ever stopped it. This is called preemptive multitasking and it’s an important concept to remember for later.

CPU-Memory Interaction

We still haven’t figured out how the microprocessor can keep track of all that memory capacity that put your wallet in some difficulty. To answer that, we need to talk about some of the components of a microprocessor. It’s time to zoom in inside the CPU and start looking for those legendary monsters with dark powers. The first part mentioned the program counter as an important part of the CPU. This program counter is actually a sort of register. The CPU incorporates a series of hardware registers that resemble sticky notes used at reminding people of something. The CPU owes these registers a lot, because they can be easily and quickly referenced by many of the instructions the computer executes.

I’m going to get a little technical now, but it really helps you understand the complexity of a CPU. Several of these registers are dedicated to keeping track of memory addresses and are named accordingly. The program counter indicates the next instruction and the index register is used to automatically step through tables of data. The stack register keeps track of memory addresses to return from program subroutines.

I will only quickly mention the segmented and unsegmented types of memory architecture. The prevailing one is the segmented one because it was promoted by Intel and it still exists in CPUs nowadays. This type of memory architecture allows for a better and faster management of data chunks that have to be processed by the CPU.

A few words about memory protection, too. Intel’s segmented architecture made it impossible for a program to delve into the memory space allotted to another application. If it tries to infiltrate, a memory fault message is generated by the operating system environment and the invading program is forced to crash. You can usually recover from this situation without having to reboot and you know exactly which program was responsible for the mess as the OS clearly indicates it.

Inside a Motherboard

Ever wonder what’s inside your motherboard?? Well, here we go:

The motherboard’s primary goal is to house the computer’s CPU chip and grant every other component quick access to it. Everything you see in your computer is somehow connected to the motherboard. Motherboards also have a standard form factor, but I consider this a minor detail that shouldn’t trouble you. Nonetheless, there are important standard features worth mentioning:

– The socket where the microprocessor is placed determines what kind of CPU the motherboard can support.
-The chipset is part of the motherboard’s logic system and is usually made of two parts – the northbridge and the southbridge. These two “bridges” connect the CPU to other parts of the computer.
– The Basic Input/Output System (BIOS) chip controls the most basic functions of the computer and performs a self-test every time you turn it on. Some newer motherboards feature a dual BIOS security system, which provides a backup in case viruses crash the system or in case of an error during the BIOS update procedure.
The motherboard even features a real time clock chip, which is battery-operated. The battery also powers the BIOS, allowing it to maintain system settings safe.

The Present-day motherboards include slots and ports such as:

Peripheral Component Interconnect Express (PCI-E) – connections for video, sound and video capture cards, as well as a multitude of other cards. PCI Express is a newer protocol (the first was simple PCI) that acts more like a network than a bus. It practically eliminates the need for other ports and this is why the AGP was abandoned.
Accelerated Graphics Port (AGP) – dedicated port for video cards.
Integrated Drive Electronics (IDE) – interfaces for the hard drives.
Universal Serial Bus (USB) or Firewire-for external peripherals.
Memory slots.
Redundant Array of Independent Discs (RAID) controllers allow the computer to recognize multiple hard drives as one drive.

Note: A series of budget motherboards have on-board sound, networking, video or other peripheral support, rather than relying on plug-in cards.

CPU Sockets

In order to connect to a motherboard, CPUs feature a set of pins that have to perfectly fit the socket. These pins form the Pin Grid Array (PGA) type. As microprocessors advance, they also need more and more pins, both to handle new features and to provide more and more power to the chip. Current socket arrangements are often named for the number of pins in the PGA. Nowadays, the most common sockets are:

– Socket 478 – for older Pentium and Celeron processors.
– Socket 754 – for AMD Sempron and some older AMD Athlon64 processors.
– Socket 939 – for newer and faster AMD Athlon64 processors.
– Socket AM2 – for the newest AMD Athlon64 X2 processors.

A couple of years ago, Intel introduced the LGA (Land Grid Array) layout. LGA is different from PGA in that the pins are actually part of the socket, not the CPU. AMD has recently come up with something similar, but this is only for the Opteron server-class CPU’s.

North and South Chipsets

The chipset is the “binder” that connects the microprocessor to the rest of the motherboard. PC chipsets usually consist of two basic parts – the northbridge and the southbridge.
The example that follows is an Intel specific one. AMD has a series of architectural differences and I will include these between parentheses.

The North Bridge

The northbridge connects directly to the processor via the front side bus or FSB (instead, AMD has a special technology named Hyper Threading that is similar to the FSB). Intel-based motherboards have a memory controller integrated in the northbridge chip (AMD includes the memory controller directly into the CPU), which gives the CPU fast access to the memory. The northbridge also connects to the AGP or PCI Express bus and to the memory itself. So, the northbridge is responsible for interconnecting the CPU, RAM, and all of the PC’s most bandwidth-intensive I/O devices (video, sound, networking, and the other devices that sit on the high-bandwidth expansion bus). Segregating these high-bandwidth components from the rest of the system and giving them their own private “hub” enables system designers to keep down the cost of the bridge chip and the motherboard. Thus the northbridge can focus primarily on routing traffic among processor, memory, and bandwidth-intensive I/O components, while letting another chip (the southbridge) focus on slower I/O traffic and miscellaneous functionalities (timing, power management, the BIOS).

Let’s get into some details a little. An Intel-based northbridge is helped by a series of buses (or lanes) in order to facilitate increased speeds.
The memory bus – the RAM’s single lane of communication with the rest of the system. The northbridge’s memory controller manages traffic on the memory bus, and performs all that the memory accesses on behalf of the rest of the components in the system.
The front-side bus (FSB) connects the CPU to the northbridge; this is the only way for the CPU to communicate with the other components.. Instructions, data, and results all travel to and from the CPU over the front-side bus. Because the FSB plays such a critical role in enabling the processor to talk to the other components on the motherboard, it is very important to implement a generous bandwidth for the FSB.
The peripheral component interconnect (PCI) bus connects the northbridge—and hence the CPU and RAM also—to two things: 1) the southbridge chip, and 2) a collection of important and bandwidth-intensive add-on devices like the video card, network card, soundcard, etc.
Note: An AMD motherboard would feature quite the same buses, with the important difference of the CPU-integrated memory controller.

The Northbridge plays an important part in how far a computer can be overclocked, as its frequency is used as a baseline for the CPU to establish its own operating frequency. In today’s machines, the chip is becoming increasingly hotter as computers become faster. It is not unusual for the northbridge to now use some type of heatsink or active cooling.

The SouthBridge

The southbridge is natively slower than the northbridge, and information from the CPU has to go through the northbridge before reaching the southbridge. All of the mission critical jobs are taken by the northbridge, so system designers tend to load the southbridge with a ton of miscellaneous, lower-priority tasks. Whenever system designers want to integrate a new component into the core logic chipset (e.g networking, RAID, sound, etc.), the southbridge should be the first choice.
Present-day southbridges have to manage a series of peripheral controllers:
PCI controller – the southbridge sits on the PCI bus and needs a PCI interface so that it can “talk” to the northbridge and to the other devices on the PCI bus.
IDE controller – the integrated drive electronics (IDE) bus is the standard bus for personal computer mass storage devices (CD-ROM, DVD-ROM, hard disk).
USB controller – the universal serial bus (USB) was introduced to replace a number of aging peripheral interconnect buses and formats, like ISA, PS/2 (for mouse and keyboard), the serial port, and the parallel port.
X-bus interface – the 8-bit X-bus is a legacy bus for use by the PS/2 keyboard and mouse, and the Flash ROM that contains the BIOS code.
DMA controller: direct memory access (DMA) is a catch-term for a family of protocols that do just what the name would imply—they allow components on the motherboard (especially hard drives) to access main memory directly, without having to use the CPU as a go-between the manage and the transfer (as was the case in the days before DMA).
System timer: the system timer generates a clock pulse for the ISA bus and an oscillating tone for the speaker. The latter is responsible for the beeps you hear on boot-up.
Interrupt controllers (AIPIC, NMI, standard IRQs): there are times when a device needs to interrupt the processor and tell it to drop what it’s doing and address some new input or development. For instance, there has to be some way of letting the processor know that the mouse has just moved, so that the code that handles mouse input can be loaded and run. The system’s interrupt mechanism allows different categories of devices to generate different types of interrupts, in order to take control of the CPU.
– Last but not least, the southbridge has to deal with the nonvolatile BIOS memory and the APM and ACPI power management features that keep a motherboard in optimal condition.

Note: there are other connectors that link the motherboard to the power supply or the other components to the motherboard in order to get the needed power.