### CSCI 4717/5717 Computer Architecture

Topic: Symmetric Multiprocessors & Clusters

Reading: Stallings, Sections 18.1 through 18.4

#### Classifications of Parallel Processing

M. Flynn classified types of parallel processing in 1972 ("Some Computer Organizations and Their Effectiveness", IEEE Transactions on Computers) Types of Parallel Processor Systems (Figure 18.2)

- Single instruction, single data stream
- Single instruction, multiple data stream
- Multiple instruction, single data stream
  Multiple instruction, multiple data stream

#### Parallel Processing – Page 1 of 63 CSCI 4717 – Computer Architecture

#### Classifications of Parallel Processing (continued)

 Single Instruction, Single Data Stream (SISD) – Single processor operates on a single instruction stream from a single memory (Uniprocessor)

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

 Single Instruction, Multiple Data Stream (SIMD) – Lockstep operation of multiple processors on single instruction memory with one data memory per processing element. (Vector/array processing)

Parallel Processing – Page 3 of 63



Parallel Processing – Page 2 of 63

Parallel Processing – Page 4 of 6

- Multiple Instruction, Single Data Stream (MISD) – Multiple processors execute different sequences of instructions on a single data set. Not commercially implemented
- Multiple Instruction, Multiple Data Stream (MIMD) – A set of processors simultaneously execute different instructions on different data sets.



#### Multiple Instruction, Multiple Data Stream

- Processors are general purpose
- Each processor should be able to complete process by themselves
- Communications methods

CSCI 4717 – Computer Architecture

- Through shared memory ("Tightly Coupled")
  - Symmetric multiprocessor (SMP) memory access times are consistent for all processors
  - Nonuniform Memory Access (NUMA) memory access times may differ
- Cluster Either through fixed connections or a network ("Loosely Coupled")

CSCI 4717 – Computer Architecture

Parallel Processing – Page 6 of 63

#### Symmetric Multiprocessors (SMP)

A stand alone computer with the following traits

• Two or more similar processors of comparable capacity

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

- Processors share same memory and I/O
- Processors are connected by a bus or other internal connection
- Memory access time is approximately the same for each processor

#### Symmetric Multiprocessors (continued)

- All processors share access to I/O through either: – same channels
  - different channels providing paths to same devices
- All processors can perform the same functions (hence symmetric)
- System controlled by integrated operating system providing interaction between processors
- Interaction at job, task, file and data element levels

Parallel Processing – Page 8 of 63

Parallel Processing – Page 10 of 6

Parallel Processing – Page 12 of

#### Integrated Operating System

Parallel Processing – Page 7 of 63

Parallel Processing – Page 9 of 6

- O/S for SMP is NOT like clusters/loosely coupled where communication usually is at file level
- Can be a high degree of interaction between processes
- O/S schedules processes or threads across all processors

#### Organization of Tightly Coupled Multiprocessor

- Individual processors are self-contained, i.e., they have their own control unit, ALU, registers, one or more levels of cache, and private main memory
- Access to shared memory and I/O devices through some interconnection network
- Processors communicate through memory in common data area

Parallel Processing – Page 11 of 63

#### **SMP** Advantages

Advantages only realized if O/S can provide parallelism

- Performance, but only if some work can be done in parallel
- Availability/reliability Since all processors can perform the same functions, failure of a single processor does not halt the system
- Incremental growth User can enhance performance by adding additional processors
- Scaling Vendors can offer range of products based on number of processors
- Transparent to user User only sees improvement in performance

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

#### Organization of Tightly Coupled Multiprocessor (continued)

- Memory is often organized to provide simultaneous access to separate blocks of memory
- Bus
  - Time-shared or common bus
  - Central controller (arbitrator)
  - Multiport memory

CSCI 4717 – Computer Architecture









#### Advantages

CSCI 4717 – Computer Architecture

- Simplicity not only is it easy to understand, form already used with DMA
- Flexibility adding processor involves simple addition of processor to bus
- Reliability As long as arbitration does not involve single controller, then there is no single point of failure

Parallel Processing – Page 16 of 6



#### Disadvantages

CSCI 4717 – Computer Architecture

- Waiting for bus creates bottleneck
  - Can be helped with individual caches
  - Usually L1 and L2
- Cache coherence policy must be used (usually hardware)

Parallel Processing – Page 17 of 63





# Multiport Memory (continued) • Advantages - Removing bus access bottleneck - Dedicate portions of memory to only one processor • Better security • Better recovery from faults • Disadvantages - Complex memory logic - More PCB wiring - Write through policy should be used for caches

#### **Central Control Unit**

Functions

- Funnels separate data streams between independent modules
- Can buffer requests

CSCI 4717 – Computer Architecture

- · Performs arbitration and timing
- · Pass status and control
- · Perform cache update alerting





SMP Operating System Design Issues

#### Simultaneous concurrent processes

- O/S routines should be reentrant
- O/S tables and other management structures must be expanded to handle multiple processes and processors
- Scheduling

CSCI 4717 – Computer Architecture

- More than just order now, also which processor gets a process
- Any processor should be capable of scheduling too

#### CSCI 4717 – Computer Architecture

Parallel Processing – Page 23 of 63

Parallel Processing – Page 21 of 63

#### Parallel Processing – Page 24 of 63

# SMP Operating System Design Issues (continued)

- Synchronization scheduling of resources now more than just for processes but also for processors
- Memory management
  - Shared page replacement strategy
  - Must understand and take advantage of memory hardware
- Reliability and fault tolerance Must be able to handle the loss of a processor without taking down other processors.

#### Cache Coherence

- One or two levels of cache typically associated with each processor – this is essential for performance
- Problem

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

- Multiple copies of same data in different caches

Parallel Processing – Page 26 of 6

- Can result in an inconsistent view of memory

#### Write Policy Review

· Write back policy

CSCI 4717 – Computer Architecture

- Write goes only to cache
- Main memory updated only when cache block is replaced
- Can lead to inconsistency
- Write through policy

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

- All writes made to cache and main memory
- Inconsistencies can occur unless all caches monitor memory traffic

Parallel Processing – Page 27 of 63

Parallel Processing – Page 25 of 63

#### Software Solutions

- Compiler and operating system deal with problem
- · Overhead transferred to compile time
- Design complexity transferred from hardware to software
- Software tends to make conservative decisions leading to inefficient cache utilization

#### Software Solutions (continued)

- Marked shared variables as non-cacheable - Too conservative
- Instructions added to enable/disable caching for variables. Then compiler can analyze code to determine safe periods for caching shared variables

Hardware Solution

- A.K.A cache coherence protocols
- Dynamic recognition of potential problems at run time
- Because it only deals w/problem when it occurs, more efficient use of cache
- Transparent to programmer and compiler
- Methods
  - Directory protocols
  - Snoopy protocols

CSCI 4717 – Computer Architecture

#### Parallel Processing – Page 29 of

Parallel Processing – Page 28 of

#### Directory Protocols – Central control

- Central memory controller maintains directory of:
  - where blocks are held
  - in which caches they are held
  - what state the data is in

CSCI 4717 – Computer Architecture

• Appropriate transfers are performed by controller

Parallel Processing – Page 31 of 6

#### **Directory Protocols – Write Process**

- Requests to write to a line are made to controller
- Using directory, controller tells all other processors with copy of same data to invalidate
- Write is granted to requesting processor and that processor has exclusive rights to that data
- Request to read from another processor forces controller to issue command to processor with exclusive rights to update (write back) main memory.

Parallel Processing – Page 32 of 6

CSCI 4717 – Computer Architecture



#### Write Invalidate (a.k.a. MESI)

- Multiple readers, one writer
- When a write is required, command is issued and all other caches of the line are invalidated
- Writing processor then has exclusive (cheap)
   access until line required by another processor
- · A state is associated with every line
  - <u>M</u>odified
  - <u>E</u>xclusive
  - <u>S</u>hared
  - <u>I</u>nvalid

CSCI 4717 – Computer Architecture Parallel Processing – Page 35 of



Multiple readers and writers

Write Update (a.k.a. write broadcast)

CSCI 4717 – Computer Architecture

Parallel Processing – Page 36 of 63

#### **Snoopy Protocols – Implementations**

- Performance of these two implementations depends on number of caches and pattern of read/writes
- Some systems use adaptive protocols to use both methods
- Write invalidate most common Used in Pentium 4 and PowerPC systems

Parallel Processing – Page 37 of 6

CSCI 4717 – Computer Architecture

## MESI Protocol

- Each line of a cache has associated with it two bits
   four states
- Modified line in this cache is modified and only valid in this cache
- Exclusive line in this cache is same as that in memory (unmodified) and not present in any other cache
- Shared line in this cache is same as that in memory (unmodified) and may also be present in another cache
- · Invalid line in this cache contains bad data

CSCI 4717 – Computer Architecture

• Write throughs from an L1 cache to an L2 cache makes it visible to the MESI protocol

Parallel Processing – Page 38 of 6

| [                                | M<br>Modified      | E<br>Exclusive     | S<br>Shared                   | I<br>Invalid         |
|----------------------------------|--------------------|--------------------|-------------------------------|----------------------|
| This cache line valid?           | Yes                | Yes                | Yes                           | No                   |
| The memory copy is               | out of date        | valid              | valid                         | -                    |
| Copies exist in other<br>caches? | No                 | No                 | Maybe                         | Maybe                |
| A write to this line             | does not go to bus | does not go to bus | goes to bus and updates cache | goes directly to bus |
|                                  |                    |                    |                               |                      |





#### Defined

- a group of interconnected, whole computers

Clusters

- working together as a unified computing resource
- can create the illusion of being one machine
- Alternative to Symmetric Multiprocessing (SMP)
  - High performance
  - High availability
  - Server applications
- Each computer called a node

CSCI 4717 – Computer Architecture

Parallel Processing – Page 41 of 63

#### **Cluster Benefits**

 Absolute scalability – Almost limitless in terms of adding independent multiprocessing machines

CSCI 4717 - Computer Architecture

CSCI 4717 – Computer Architecture

• Incremental scalability – Can start out small and build as user acquires new machines

Parallel Processing – Page 43 of 63

Parallel Processing – Page 45 of 63

#### **Cluster Benefits (continued)**

· High availability

CSCI 4717 – Computer Architecture

- Loss of one node only causes small decrement in performance
- Software (middleware) handles fault tolerance automatically
- Superior price/performance
  - By using easily affordable building blocks, gets better performance at a lower price than a single large computer
  - Expanding design doesn't depend on PCB redesign

Parallel Processing – Page 44 of 63

Cluster Configurations
High-speed message link options/configurations

Dedicated LAN with at least one having connection to remote client
Shared LAN with other non-cluster machines

Simplest way to classify clusters is based on whether computers share disk(s)

No shared disk – each machine has a local disk
Shared disk in addition to local disk – should

use disk mirroring or RAID







#### Active Standby Configurations

- · Separate Server
  - No shared disk

CSCI 4717 – Computer Architecture

- High performance and availability
- Scheduling software is needed to assign client requests to servers to balance the load
- If a computer fails in middle of application, another can take over
- To do this, must have some method of copying data between at least neighboring computers

Parallel Processing – Page 49 of 63

#### Active Standby Configurations (continued)

- Shared nothing
  - All computers share common RAID, but have partitions all to themselves.
  - If one fails, the cluster is reconfigured to reallocate failed computer's partitions
- Shared disk

CSCI 4717 – Computer Architecture

- All computers have access to all volumes of the same disk
- Must use some type of locking facility to ensure that data can be accessed by one computer at a time

Parallel Processing – Page 50 of 63

Parallel Processing – Page 52 of 63

Parallel Processing – Page 54 of

| Clustering Method          | Description                                                                                                                                          | Benefits                                                                                   | Limitations                                                                                         |
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
| Passive Standby            | A secondary server takes<br>over in case of primary<br>server failure.                                                                               | Easy to implement.                                                                         | High cost because the<br>secondary server is<br>unavailable for other<br>processing tasks.          |
| Active Secondary:          | The secondary server is also<br>used for processing tasks.                                                                                           | Reduced cost because<br>secondary servers can be<br>used for processing.                   | Increased complexity.                                                                               |
| Separate Servers           | Separate servers have their<br>own disks. Data is<br>continuously copied from<br>primary to secondary server.                                        | High availability.                                                                         | High network and server<br>overhead due to copying<br>operations.                                   |
| Servers Connected to Disks | Servers are cabled to the<br>same disks, but each server<br>owns its disks. If one server<br>fails, its disks are taken over<br>by the other server. | Reduced network and server<br>overhead due to elimination<br>of copying operations.        | Usually requires disk<br>mirroring or RAID<br>technology to compensate<br>for risk of disk failure. |
| Servers Share Disks        | Multiple servers<br>simultaneously share access<br>to disks.                                                                                         | Low network and server<br>overhead. Reduced risk of<br>downtime caused by disk<br>failure. | Requires lock manager<br>software. Usually used with<br>disk mirroring or RAID<br>technology.       |

#### Cluster O/S Design Issues – Failure Management

- Two types of management: high availability and fault tolerant
- · High availability
  - Independent processes
  - If one goes down, anything in progress is lost
  - Application layer must handle uncertainty of partially executed transactions
  - Process is taken over by next machine
  - Fault tolerant
  - Redundancies

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

- Mechanisms for handling partially executed transactions

Cluster O/S Design Issues – Failure Management (continued)

• Failover -- Switching applications & data from failed system to alternative within cluster

CSCI 4717 – Computer Architecture

• Failback -- Restoration of applications and data to original system after problem is fixed

Parallel Processing – Page 53 of 63

#### Cluster O/S Design Issues – Load balancing

- Incremental scalability of load with changes in number of nodes
- Automatically include new computers in scheduling
- Middleware needs to recognise that processes may switch between machines

#### Cluster O/S Design Issues – Parallelizing Computation

- Single application executing in parallel on a number of machines in cluster
- Three general approaches to the problem: – Parallelizing compiler
  - Parallelizing application
  - Parametric computing

CSCI 4717 – Computer Architecture

#### **Parallelizing Compiler**

- Determines at compile time which parts can be executed in parallel
- Split off for different computers

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

• Performance depends on compiler

#### Parallelizing Application

Parallel Processing – Page 55 of 6

Parallel Processing – Page 57 of 6

- Application written to be parallel
- Message passing to move data between nodes
- Hard to program

CSCI 4717 – Computer Architecture

- Performance depends on programmer
- · Potential for best end result

# Parametric computing If a problem is repeated execution of algorithm on different sets of data Example: simulation using different scenarios Depends on tools to organize/manage and set of set of

Parallel Processing – Page 56 of 6

Parallel Processing – Page 58 of

Depends on tools to organize/manage and execute

#### Cluster Middleware

Software installed on each node to enable cluster operation:
Provides high availability through load balancing and failover control

- Creates unified image to user
- Single point of entry User logs onto cluster rather than a node
- Single file hierarchy User sees a single file structure
   Single control point single node acts as the interface to the
- Single control point single node acts as the interface to the user
- Single virtual network visible to cluster nodes
- Single memory space programs are allowed to share variables across distributed memory
- Single job management system cluster assigns the jobs, not the user
- Single user interface

#### Parallel Processing – Page 59 of 6



#### Cluster v. SMP

Positive points for both

CSCI 4717 – Computer Architecture

CSCI 4717 – Computer Architecture

- Both provide multiprocessor support to high demand applications.
- Both available commercially SMP has been around longer

#### SMP benefits

- Easier to manage and configure since it is a single machine
- Closer to single processor systems for which nearly all applications are written
- Scheduling is main difference between SMP and single-processor system

Parallel Processing – Page 62 of 63

- · Less physical space
- Lower power consumption
- Well-established

CSCI 4717 – Computer Architecture

#### **Cluster benefits**

- Superior incremental & absolute scalability
- Superior availability through redundancy of all components, not just processors
- Simpler to create from computers than SMP which is designed from PCB level
- With time, clusters are likely to dominate

Parallel Processing – Page 63 of 63

Parallel Processing – Page 61 of 6