Free Tutorials, Linux Command, Source Code Architecture,  Software Engineering, Intelligent Systems, RDBMS, Computer Accounting,  Operations Research, Discrete Mathematics, Network, SAD Lay Networks Lay Networks
Computer Science Networking Operating Systems Linux and Unix Source Code Script & Languages Protocols Glossary
Web laynetworks.com
Google
 


TMA - Jan 2001

Q-1:- Answer the following questions on the development of the Tera Computer.

What are the design goals of the Tera Computer?

Ans:-1. The Tera is very much a HEP descendant but is implemented with modern VLSI circuits and packaging technology. A 400-MHz clock is proposed for use in the Tera system, again with a maximum of 128 threads per processor.

Multithreaded von Neumann architecture can be traced back to the CDC 6600 manufactured in the mid-1960s. Multiple functional units in the 6600 CPU can execute different operations simultaneously using a score boarding control. The very first multithreaded multiprocessor was the Denelcor HEP designed by Burton Smith in 1978. The HEP was built with 16 processors driven by a 10-MHz clock, and each processor can execute 128 threads simultaneously.

The Tera architecture features include not only the high degree of multithreading but also the explicit-dependence lookahead and the high degree of super pipelining in its processor-network-memory operations. These advanced features are mutually supportive. The first Tera machines are expected to appear in late 1993.

Tera Design Goals:-

The Tera architecture was designed with several major goals in mind.

It needed to be suitable for very high-speed implementations, i.e., has a short clock period and be scalable to many processors. A maximum configuration of the first implementation of the architecture will have 256 processors, 512 memory units, 256 I/O processors, 4096 interconnection network nodes, and a clock period of less than 3 ns.

It was important that the architecture be applicable to a wide spectrum of problems. Programs that do not vectorize well, perhaps because of a preponderance of scalar operations or too frequent conditional branches, will execute efficiently as long as there is sufficient parallelism to keep the processors busy. Virtually any parallelism applicable in the total computational workload can be turned into speed, from operation level parallelism within program basic blocks to multi-user time and space sharing.

Goal was ease of compiler implementation. Although the instruction set does have a few unusual features, they don not seem to pose unduly difficult problems for the code generator. There are no register or memory addressing constraints and only three addressing modes. Condition code setting is consistent and orthogonal. Although the richness of the instruction set often allows several ways to do something, the variation in their relative costs as the execution environment changes trends to be small.

Because the architecture permits the free exchange of

Spatial and temporal locality for parallelism, a highly optimizing compiler may work hard improving locality and tread the parallelism thereby saved for more speed. On the other hand, if there is sufficient parallelism, the compiler has a relatively easy job.

The Tera Multiprocessor and Sparse Three-Dimensional

Torus:-

The interconnection network is a three-dimensional sparsely populated torus of pipelined packet-switching nodes, each of which is linked to some of its neighbors. Each link can transport a packet-containing source and destination addresses, an operation, and 64 data bits in both directions simultaneously on every clock tick. Some of the nodes are also linked to resources, i.e., processors, data memory units, I/O processors, and I/O cache units. Instead of locating the processors on one side of the network and the memories on the other, the resources are distributed more-or-less uniformly throughout the network. This permits data to be placed in memory units near the appropriate processor when possible and otherwise generally maximizes the distance between possibly interfering resources.

The interconnection network of one256-processor Tera system contains 4096 nodes arranged in a 16*16*16 toroidal mesh; i.e., the mesh “wraps around” in all three dimensions. Of the 4096 nodes, 1280 are attached to the resources comprising 256 cache units and 256 I/O processors. The 2816 remaining nodes do not have resources attached but still provide message bandwidth.

Any plane bisecting the network crosses at least 256 links, giving the network a data bisection bandwidth of one 64-bit data word per processor per tick in each direction. This bandwidth is needed to support shared-memory addressing in the event that all 256 processors are addressing memory on the other side of some bisecting plane simultaneously.

As the Tera architecture scales to larger numbers of processors p, the number of network nodes grows as p3/2

rather than as the p log p associated with the more commonly used multistage notworks. For example, a 1024-processor system would have 32,768 nodes. The reason for the overhead per processor of p ½ instead of log p stems from the fact that the system is limited by the speed of light.

One can argue that memory latency is fully masked by parallelism only when the number of messages being routed by the network is at least p*l, where l is the latency. Since messages occupy volume, the network must have a volume proportional to p*l; since messages occupy volume, the network must have a volume proportional to p*l; since the speed of light is finite, the volume is also proportional to l3 and therefore l is proportional to p ½ rather than log p.

Superpipelined Support:- Tera Computer can execute multiple instruction streams simultaneously in each processor. In the current implementation, as few as 1 or as many as 128 program counters may be active at once. On every tick of the clock, the processor logic selects a thread that is ready to execute and allows it to issue its next instruction. Since instruction interpretation is completely pipelined by the processor and by the network and memories as well a new instruction from a different thread may be issued during each tick without interfering with its predecessors.

When an instruction finishes, the thread to which it belongs becomes ready to execute the next instruction. As long as there are enough threads in the processor so that the average instruction latency is filled with instructions from other threads, the processor is being fully utilized.

If a thread were not allowed to issue its next instruction until the previous instruction is completed, then approximately 70 different threads would be required on each processor to hide the expected latency. The lookahead described later allows threads needed to achieve peak performance.

In fig. Three operations can be executed simultaneously per instruction per processor. The M-Pipeline is for memory-access operations, the A-pipeline is for arithmetic operations, and the C-pipeline is for control or arithmetic operations. The instructions are 64 bits wide. If more than one operation in an instruction specifies the same register or setting of condition codes, the priority is M>A>C.

It has been estimated that a peak speed of 1G operations per second can be achieved per processor if deriven by a 333-MHz clock. However, a particular thread will not exceed about 100M operations per second because of interleaved execution. The processor pipeline is rather deep, about 70 ticks as compared with 8 ticks in the HEP pipeline.

Compare the advantages and potential drawbacks of Tera Computer.

Ans:-

Advantages :-
The Tera uses multiple contexts to hide latency.

The Tera machine performs a context switch every clock cycle.

Here, both pipeline latency and memory latency are hidden in the HEP/Tera approach.

The major focus is on latency tolerance rather than latency reduction.

The thread creation must be very cheap.

With 128 contexts per processor, a large number(2k) of registers must be shared finely between threads.

Tagged memory and registers with full/empty bits are used for synchrounization.

As long as there is plenty of parallelism in user programs to hide latency and plenty of compiler support, the performance is potentially very high.

Drawbacks:-

The performance must be bad for limited parallelism, such as guaranteed low single-context performance.

A large number of contexts demands lots of registers and other hardware resources which in turn implies higher cost and complexity.

Finally, the limited focus on latency reduction and cacheing entails lots of slack parallelism to hide latency as well as lots of memory bandwidth; both require a higher cost for building the machine.

Explain the thread state and management scheme used in Tera computer.

Ans:-

Thread State and Mangement:-

Figure shows that each thread has the following states associated with it:

One 64-bits stream status word (SSW);

Thirty-two-64-bit general-purpose registers (R0-R31).

Eight 64-bit target registers(T0-T7).

Context switching is so rapid that the processor has no time to swap the processor-resident thread state. Instead, it has 128 of everything, i.e., 128 SSWs, 4096 general-purpose registers, and 1024 target registers. It is appropriate to compare these registers in both quantity and function to vector registersor words of caches in other architectures. In all three cases, the objective is to improve locality and avoid reloading data.

Program addresses are 32 bits in length. Each thread’s current program counter is located in the lower half of its SSW. The upper half describes various modes (e.g., floating-point rounding, lookhead disable), the trap disable mask (e.g., data alignment, floating overflow), and the four most recently generated condition codes.

The target registers are used as branch targets. The format of the target registers is identical to that of the SSW, through most control transfer operations use only the low 32 bits to determine a new PC. Separating the determination of the branch target address from the decision to branch allows the hardware to prefetch instructions at the branch targets, thus avoiding delay when the branch decision is made.

One target register(T0) points to the trap handler which is nominally an unprivileged program. When a trap occurs, the effect is as if a co routine call to a T0 had been executed. This makes trap handling extremely lightweight and independent of the operating system. Trap handlers can be changed by the user to achieve specific trap capabilities and priorities without loss of efficiency.



Back
Next
FDDI Frequently Asked Questions (FAQ), The function and frame format of FDDI,Aloha,Comparative analysis between two types of ATM Switches,Knockout Switch,Barcher-Banyan Switch,Various popular standards for compressing multimedia data,Distributed Multimedia Survey: Standards, ASCII to hex value chart,Comparative analysis - TCP - UDP, Addressing Formats and QoS parameters, Bellman Ford's Algorithm Lay networks, free, java, java script, asp, vb, linux, ignou, tutorial, Unix commands, System Analysis, System Design, Ipv6, quiz, download, free, Computer Architecture, Object Oriented System, Relational Database Management Systems, Object Oriented System, Operating Systems, Software Engineering, Communications and Networks, Discrete Mathematics, Intelligent Systems, Operations Research, Accounting and Finance on Computersmca, networking, protocols, glossary, assignment, project, tma, programming source code, programming, source code, unix, free
 
Book Mark/Share this site at BlinkBits BlinkList Blogmarks co.mments Delicious Digg Fark Furl it! Google Ma.gnolia Netvouz NewsVine RawSugar Reddit Shadows Simpy Stumble Technorati YahooMyWeb

Copyright © 2000- 2007 Lay Networks All rights reserved. 
This website is best viewed in Firefox 1.0.1 above.

Web Hosting sponsored by Customized Software Company India
Web Site Designed by Web Designing, Flash Animation, Multimedia Presentations, Broacher/catalogue designing, Web Promotion 
Refer to your freind About Us Legal IGNOU Contact Us Feedback Donate to laynetworks.com Download Management Tutorials Tutorials History Search here