Multiprocessors

Are we reaching performance limits in uniprocessors ?

Remember, performance enhancements are realized through improvements in both:
Architecture
Technology

Panel session at VTS'99: "The end of Moore's Law era?"

Three say yes (within 10 years), two say no.
Jury is still out on this one.

However, it is generally believed that the physics of the process, e.g. the size of an atom, will impose a hard limit.

With reference to Moore's law:

"All exponentials in nature eventually saturate."
What is the scaling factor of the x-axis and where are we today on the curve?

Multiprocessors

What about improvements in architecture?

We've been talking about these for 12 weeks now.

However, parallel machines appear to be a natural candidate as a successor to the uniprocessor:
A cost effective way to improve performance beyond the uniprocessor is by connecting multiple microprocessors together.
It is un likely that architectural innovations can be sustained indefinitely (analogous to the physical laws that limit technology except in reference to complexity).
There has been steady progress on the major obstacle to widespread use of parallel machines, namely software .

We will focus on the mainstream of multiprocessor design.

Machines with small to medium numbers of processors (<100).

Viable future architectures with more than 100 processors is difficult to predict.

Parallel Architecture Taxonomy

Flynn's classification based on the parallelism in the instruction and data streams:
SISD (Single instruction stream, single data stream) .

The uniprocessor.

SIMD (Single instruction stream, multiple data streams) .

The same instruction is executed by multiple processors using different data streams.

Each processor has its own data memory.
Only a single instruction memory and control processor which fetches and dispatches instructions.

MISD (Multiple instruction streams, single data stream) .

No commercial machine has been built.

MIMD (Multiple instruction streams, multiple data streams) .

Each processor fetches its own instructions and operates on its own data.
Often built using off-the-shelf microprocessors.

Parallel Architecture Taxonomy

SIMD model was popular through the 80s.

MIMD model has clearly emerged as the architecture of choice in recent years.
MIMD offers flexibility .

Can operate as a single-user machine providing high performance for one application.
Can operate as multiprogrammed machines running many tasks simultaneously.

MIMDs can build on the cost/performance advantages of off-the-shelf microprocessors .

Parallel Architecture Taxonomy

Existing MIMD machines fall into two classes:
Centralized shared-memory architectures ( Uniform Memory Access or UMA ).

At most a few dozen processors which share a bus and a single main memory.
Large caches allow the bus and memory organization to satisfy the memory demands of a small number of processors.

Parallel Architecture Taxonomy

Distributed memory architectures .

Supports larger processor counts by distributing the memory and allowing multiple memories to work in parallel.
With increases in processor bandwidth requirements, the number of processors for which this scheme is preferred (over centralized) is decreasing .

Parallel Architecture Taxonomy

Distributing the memory among the nodes has two major advantages :
It is a cost-effective way to scale the memory bandwidth (if most accesses are to local memory in the node).
It reduces the latency for accesses to local memory (due to less contention).

Disadvantage:
Communicating data between processors becomes more complex and has higher latency.

Key characteristics that distinguish distributed memory machines:
How communication is performed.
The architecture of the distributed memory.

Communication and Distributed Memory Architecture Models

Distributed shared-memory (DSM) architectures:

Physically separate memories can be addressed as one logically shared address space.
The address space is shared -- all processors see the same address space .
These machines are referred to as NUMA (non-uniform memory access) in contrast to the centralized UMA machines.

Multicomputer architecture :

Multiple private address spaces that are logically disjoint and cannot be addressed by a remote processor.

These address space organizations have an associated communication mechanism .

For DSM, shared memory can be used to communicate data via load and store operations.
For the multicomputer arrangement, communication is done by message passing using either synchronous (RPC) or asynchronous mechanisms.

Performance Metrics for Communication Mechanisms

Three performance metrics are critical in any communication mechanism :
Communication bandwidth.

Bisection bandwidth is the bandwidth across the "narrowest" part of the interconnection network.

Bandwidth in and out of an individual processor is also important.

It is affected by the architecture within the node and by the communication mechanism.

With respect to the mechanism, when communication occurs, resources are tied up or occupied, preventing other communication.

Occupancy can limit the communication bandwidth.

Performance Metrics for Communication Mechanisms

Communication latency.

Lower is better of course.
Is equal to:

Time of flight is preset and transport latency is determined by interconnection network.

Sender and receiver overhead are determined by communication mechanism.

Complex mechanisms (i.e. for naming and protection) increase latency, particularly those that require the OS.

Latency affects performance either by:

Causing the processor to wait.
Tying up processor resources.

Performance Metrics for Communication Mechanisms

Communication latency hiding.

How well can the mechanism hide latency by overlapping communication with computation or with other communication ?

For example, a system that only allows access to a word at a time may have low latency.

But it may be unable to hide the latency (each word transferred is treated as a cache miss).

Another machine may have a higher latency but allow the processor to do other things while waiting for data.

We'll see examples of latency hiding techniques for shared memory later.

Latency hiding is more difficult to measure than the previous two and is application dependent.

P erformance Metrics for Communication Mechanisms

These performance measures are affected by:
The size of the data items being communicated by the application.

It affects the latency and bandwidth in a direct way.
And the effectiveness of the different latency hiding techniques.

The regularity in the communication patterns.

These affect the cost of naming and protection (communication overhead).

Ideally, we would like a mechanism that performs well with:
Large and small data requests.
Regular and irregular communication patterns.

Communication Mechanism Advantages

Shared-memory advantages:
Compatibility with well-understood mechanisms in use in centralized multiprocessors.
Ease of programming , particularly for systems in which communication patterns are complex or vary dynamically during execution.
Low overhead for communication (hardware used to enforce protection).
The ability to use hardware-controlled caching (which reduces the frequency of remote communication).

Message-passing advantages:
Simpler hardware (especially with respect to building coherent caches).
Explicit communication forces programmers and compiler writers to pay attention to what is costly and what is not.

Is this really an advantage ?

Shared-memory communication is the much more popular today.
Centralized schemes still dominate, but long term trends favor distributing memory.

Challenges of Parallel Processing

Insufficient parallelism : Amdahl's law applies to parallel processing as well.

Any program has a parallel portion and a serial portion.

The parallel portion is the only part that is sped up by having multiple processors.

As with uniprocessors, speedup is limited by the fraction of the original program that can be parallelized.

For example, suppose we want to achieve a speedup of 80 with 100 processors. What is the fraction of the original computation that can be sequential ?

With simplifying assumptions (see text):

Challenges of Parallel Processing

The second major challenge involves the large latency of remote memory access .

This may cost anywhere from 50 clocks to 10,000 clocks !

Dependent on:

The communication mechanism.
The type of interconnection network.
The scale of the machine.

Insufficient parallelism can be attacked in software with new algorithms that have better parallel performance.

Long communication latency can be attacked by the architecture (caching) or the programmer (restructuring the data).

We focus on techniques for reducing the impact of long communication latency.