TMA
- Jan 2001
Q-1:- Answer the
following questions on the development
of the Tera Computer.
What are the design goals of
the Tera Computer?
Ans:-1.
The Tera is very much a HEP
descendant but is implemented
with modern VLSI circuits and
packaging technology. A 400-MHz
clock is proposed for use in
the Tera system, again with
a maximum of 128 threads per
processor.
Multithreaded von Neumann
architecture can be traced back
to the CDC 6600 manufactured
in the mid-1960s. Multiple functional
units in the 6600 CPU can execute
different operations simultaneously
using a score boarding control.
The very first multithreaded
multiprocessor was the Denelcor
HEP designed by Burton Smith
in 1978. The HEP was built with
16 processors driven by a 10-MHz
clock, and each processor can
execute 128 threads simultaneously.
The Tera architecture features
include not only the high degree
of multithreading but also the
explicit-dependence lookahead
and the high degree of super
pipelining in its processor-network-memory
operations. These advanced features
are mutually supportive. The
first Tera machines are expected
to appear in late 1993.
Tera Design Goals:-
The Tera architecture was
designed with several major
goals in mind.
It needed to be suitable for
very high-speed implementations,
i.e., has a short clock period
and be scalable to many processors.
A maximum configuration of the
first implementation of the
architecture will have 256 processors,
512 memory units, 256 I/O processors,
4096 interconnection network
nodes, and a clock period of
less than 3 ns.
It was important that the architecture
be applicable to a wide spectrum
of problems. Programs that do
not vectorize well, perhaps
because of a preponderance of
scalar operations or too frequent
conditional branches, will execute
efficiently as long as there
is sufficient parallelism to
keep the processors busy. Virtually
any parallelism applicable in
the total computational workload
can be turned into speed, from
operation level parallelism
within program basic blocks
to multi-user time and space
sharing.
Goal was ease of compiler implementation.
Although the instruction set
does have a few unusual features,
they don not seem to pose unduly
difficult problems for the code
generator. There are no register
or memory addressing constraints
and only three addressing modes.
Condition code setting is consistent
and orthogonal. Although the
richness of the instruction
set often allows several ways
to do something, the variation
in their relative costs as the
execution environment changes
trends to be small.
Because the architecture permits
the free exchange of
Spatial and temporal locality
for parallelism, a highly optimizing
compiler may work hard improving
locality and tread the parallelism
thereby saved for more speed.
On the other hand, if there
is sufficient parallelism, the
compiler has a relatively easy
job.
The Tera Multiprocessor and
Sparse Three-Dimensional
Torus:-
The interconnection network
is a three-dimensional sparsely
populated torus of pipelined
packet-switching nodes, each
of which is linked to some of
its neighbors. Each link can
transport a packet-containing
source and destination addresses,
an operation, and 64 data bits
in both directions simultaneously
on every clock tick. Some of
the nodes are also linked to
resources, i.e., processors,
data memory units, I/O processors,
and I/O cache units. Instead
of locating the processors on
one side of the network and
the memories on the other, the
resources are distributed more-or-less
uniformly throughout the network.
This permits data to be placed
in memory units near the appropriate
processor when possible and
otherwise generally maximizes
the distance between possibly
interfering resources.
The interconnection network
of one256-processor Tera system
contains 4096 nodes arranged
in a 16*16*16 toroidal mesh;
i.e., the mesh “wraps
around” in all three dimensions.
Of the 4096 nodes, 1280 are
attached to the resources comprising
256 cache units and 256 I/O
processors. The 2816 remaining
nodes do not have resources
attached but still provide message
bandwidth.
Any plane bisecting the network
crosses at least 256 links,
giving the network a data bisection
bandwidth of one 64-bit data
word per processor per tick
in each direction. This bandwidth
is needed to support shared-memory
addressing in the event that
all 256 processors are addressing
memory on the other side of
some bisecting plane simultaneously.
As the Tera architecture scales
to larger numbers of processors
p, the number of network nodes
grows as p3/2
rather than as the p log p
associated with the more commonly
used multistage notworks. For
example, a 1024-processor system
would have 32,768 nodes. The
reason for the overhead per
processor of p ½ instead
of log p stems from the fact
that the system is limited by
the speed of light.
One can argue that memory
latency is fully masked by parallelism
only when the number of messages
being routed by the network
is at least p*l, where l is
the latency. Since messages
occupy volume, the network must
have a volume proportional to
p*l; since messages occupy volume,
the network must have a volume
proportional to p*l; since the
speed of light is finite, the
volume is also proportional
to l3 and therefore l is proportional
to p ½ rather than log
p.
Superpipelined Support:- Tera
Computer can execute multiple
instruction streams simultaneously
in each processor. In the current
implementation, as few as 1
or as many as 128 program counters
may be active at once. On every
tick of the clock, the processor
logic selects a thread that
is ready to execute and allows
it to issue its next instruction.
Since instruction interpretation
is completely pipelined by the
processor and by the network
and memories as well a new instruction
from a different thread may
be issued during each tick without
interfering with its predecessors.
When an instruction finishes,
the thread to which it belongs
becomes ready to execute the
next instruction. As long as
there are enough threads in
the processor so that the average
instruction latency is filled
with instructions from other
threads, the processor is being
fully utilized.
If a thread were not allowed
to issue its next instruction
until the previous instruction
is completed, then approximately
70 different threads would be
required on each processor to
hide the expected latency. The
lookahead described later allows
threads needed to achieve peak
performance.
In fig. Three operations can
be executed simultaneously per
instruction per processor. The
M-Pipeline is for memory-access
operations, the A-pipeline is
for arithmetic operations, and
the C-pipeline is for control
or arithmetic operations. The
instructions are 64 bits wide.
If more than one operation in
an instruction specifies the
same register or setting of
condition codes, the priority
is M>A>C.
It has been estimated that
a peak speed of 1G operations
per second can be achieved per
processor if deriven by a 333-MHz
clock. However, a particular
thread will not exceed about
100M operations per second because
of interleaved execution. The
processor pipeline is rather
deep, about 70 ticks as compared
with 8 ticks in the HEP pipeline.
Compare the advantages and
potential drawbacks of Tera
Computer.
Ans:-
Advantages
:-
The Tera uses multiple contexts
to hide latency.
The Tera machine performs a
context switch every clock cycle.
Here, both pipeline latency
and memory latency are hidden
in the HEP/Tera approach.
The major focus is on latency
tolerance rather than latency
reduction.
The thread creation must be
very cheap.
With 128 contexts per processor,
a large number(2k) of registers
must be shared finely between
threads.
Tagged memory and registers
with full/empty bits are used
for synchrounization.
As long as there is plenty
of parallelism in user programs
to hide latency and plenty of
compiler support, the performance
is potentially very high.
Drawbacks:-
The performance must be bad
for limited parallelism, such
as guaranteed low single-context
performance.
A large number of contexts
demands lots of registers and
other hardware resources which
in turn implies higher cost
and complexity.
Finally, the limited focus
on latency reduction and cacheing
entails lots of slack parallelism
to hide latency as well as lots
of memory bandwidth; both require
a higher cost for building the
machine.
Explain the
thread state and management
scheme used in Tera computer.
Ans:-
Thread State and Mangement:-
Figure shows that each thread
has the following states associated
with it:
One 64-bits stream status word
(SSW);
Thirty-two-64-bit general-purpose
registers (R0-R31).
Eight 64-bit target registers(T0-T7).
Context switching is so rapid
that the processor has no time
to swap the processor-resident
thread state. Instead, it has
128 of everything, i.e., 128
SSWs, 4096 general-purpose registers,
and 1024 target registers. It
is appropriate to compare these
registers in both quantity and
function to vector registersor
words of caches in other architectures.
In all three cases, the objective
is to improve locality and avoid
reloading data.
Program addresses are 32 bits
in length. Each thread’s
current program counter is located
in the lower half of its SSW.
The upper half describes various
modes (e.g., floating-point
rounding, lookhead disable),
the trap disable mask (e.g.,
data alignment, floating overflow),
and the four most recently generated
condition codes.
The target registers are used
as branch targets. The format
of the target registers is identical
to that of the SSW, through
most control transfer operations
use only the low 32 bits to
determine a new PC. Separating
the determination of the branch
target address from the decision
to branch allows the hardware
to prefetch instructions at
the branch targets, thus avoiding
delay when the branch decision
is made.
One target register(T0) points
to the trap handler which is
nominally an unprivileged program.
When a trap occurs, the effect
is as if a co routine call to
a T0 had been executed. This
makes trap handling extremely
lightweight and independent
of the operating system. Trap
handlers can be changed by the
user to achieve specific trap
capabilities and priorities
without loss of efficiency.