Extract From Byte Magazine - Oct 07 High Performance Systems

by **Steve Bildermann** » Wed Oct 08, 2003 3:25 pm

** Warning Much Geek Talk Ahead

As Byte is now a subscription service I thought some people might enjoy reading some extracts.

High Performance Computer Systems (HPCS) have a long history. From the beginning of the special systems in the 1970s, such as the 360/95, 370/195 and the Cray 1, supercomputers have extended the envelope of current technology, usually at considerable expense and technical challenge. In the 1980s, "minisupers" arrived. That name was coined to describe minicomputer like systems with very high performance for the dollar. They were slower than the fastest systems, but nevertheless supercomputers dropped below $1 million for the first time.

In the 1990s, this process seemed to be under a more rational process where advances in semiconductor technology plus some special engineering regularly created faster systems without the drama of the early models. The arrival of clusters groups of microcomputers connected by high speed networks running Linux brought the cost below $100,000 for the first time. The supercomputer proliferation was under way.

By the late 1990s, it looked like evolution rather than revolution would continue to dominate the HPCS class. Faster microprocessors would arrive and be integrated into the next faster cluster or specially built supercomputer, and that would be it until the next microprocessor upgrade. This position was supported by a wide belief that "Moore's Law," more of an self fulfilling observation than a true law, would guarantee faster supercomputers without any extra effort. It was a mirage.

By 2000, the mirage had disappeared. The first clue was the remarkable similarity in size and power between IBM's last 3000 series model from the 70s, a behemoth that required 1400 square feet of space, and one of their 1990s supercomputers for LANL which took up more than 3000 square feet. Right after those systems, later and faster systems suddenly got smaller again, and changed internally.

More clues appeared when IBM announced it was going to build a "Cell processor" for Sony's Playstation 3, and later announced a supercomputer named "The Blue Gene" built around a large number of Cell systems. The final shoe dropped in place recently when IBM announced it was working on terabit/second chip channels and an adaptable multi execution chip called "Trips."

What Happened To Easy Upgrades?

The most general description of what happened would describe the life cycle of any technology, from railroad engines to semiconductors: they have an introduction phase, a growth phase and a maturity phase. The huge expansion of the microprocessor and integrated circuit supporting chips in the 1990s extended this technology's normal growth phase, but did not negate its ultimate limits.

There are multiple limits to any fast system, and the system's performance is only as fast as the slowest of them. In the early days, limits were reached on component speed and power, and pushing them meant that the Cray 1 was liquid cooled by Freon. Later systems were limited by component density how many could fit in a small space, and also how long the interconnect wires could be before the speed of light limited the system.

One famous demonstration was often made by Commander Grace Hopper. She would hand out wires that were nine inches long and when everyone had looked at them, she would say "You are looking at a nanosecond." Nine inches is the distance light or electricity travels in a wire in a nanosecond. The implication was clear to Hopper and she taught those who would listen that the speed of light, because of wire lengths, would limit computer speeds. As a result, computers would have to get smaller to get faster. Many people found this hard to grasp, especially as this was before the microprocessor revolution.

Back To The Past, Again

In 2003, all of these problems are returning for an encore with microcomputers, and they are joined by some new problems. Clock frequencies are now exceeding 3 GHz in top Intel chips, causing wire lengths to be limited to well below an inch for critical circuits. Power delivery and the heat generated are two sides of the same problem faster computers need more power to switch circuits faster at the same size. Only smaller circuits can switch faster at lower power. But smaller sizes bring their own problems.

Current top of the line processors generate well in excess of 100 watts, more than a typical lightbulb, on a chip smaller than a quarter of an inch on a side. The chip's power density, watts/cubic inch, approaches that of nuclear power reactors. Clearly heat is rapidly becoming a limiting factor. Liquid cooling and heat pipes are being used to cool the hottest chips. (Haven't we been here before?)

Lightspeed delays are back, and harder to solve than before. Now the wire routing must be on chip, as part of the whole design. If parts of it need to be especially fast, they must be close together. This creates conflicting requirements for locating where functional units are placed on the chip, and the compromises that are required will reduce potential performance.

New Performance Limitations

In addition to all the old problems, we must contend with some new ones which, as Murphy could have forecast, will in turn complicate the old problems. The new hardware problems are:

Higher performance means longer execution pipelines, which complicates design and layout.
Multiple parallel execution units, which enhance performance, add complexity, routing distance, and power.
Multiple threads add more registers and paths to the execution resources.
Internal on die caches, which speed memory access, get larger with every generation of processor. This takes more die space and adds to power and delay.
Smaller feature sizes, moving from 130 nanometer to 90 nanometer designs, will require more expensive equipment and create design challenges as the wires and features get smaller. But this will help power and circuit lengths.
Despite all the problems ahead, planners and semiconductor manufacturers are optimistic that they will solve them and a new generation of processors will, as usual, outperform the old. But we have a set of software problems that also impact performance:

Branches in an instruction stream, like stopping a car, interrupt the flow of instructions and there are delays while the instruction stream is refilled from the correct location. The past solution of branch prediction tables worked up to 95 percent accuracy, but multiple streams and threads may require something better to maintain that level of accuracy.
Multiple threads can come from widely different memory spaces, making bigger caches necessary to offset lower hit rates. It also increases the number of circuits between functional units for simultaneous access. More circuits, more space, more delay.
All this complexity adds up to a really difficult design process. The trade offs are complex and not easy to predict, except through simulation. Accurate simulation of a new design takes supercomputer level performance just to circuit simulate a few instructions per second.

New processor designs are close to the limits of complexity we can currently manage. Worst of all, increasing complexity is bringing decreasing returns. All of the obvious and many of the not so obvious shortcuts and optimizations are already in place. Each new step in complexity leaves less freedom to change, and brings less improvement to the performance. Something else has to be done.

New Designs Aim For Simplicity

IBM and Sony's "Cell processor" was the first example of breaking the current paradigm of complex designs for higher performance. Instead of one very complex chip that squeezes the very best from one or two instruction streams, Cell chips are designed around multiple subprocessors on a single chip, operating in parallel on different tasks, or different parts of the same task.

Another option in the Cell chip is for subprocessors to be able to change their function in response to demand. If there is a stream of integer calculations to perform, some additional subprocessors can act as integer processors, increasing the total thruput. The complex part of this chip is the coordination of the subprocessors that are performing a single instruction stream. IBM's Blue Gene supercomputer will be built from Cell processors.

In the last month, IBM has announced two R&D projects that take the Cell chip to the next level, and address another looming performance limit interchip communications.

Interchip communications are nearing a limit that is caused by the power and size of the driver circuits that send and receive signals outside the processor chip. Speed requires power which requires more or bigger circuits. This causes conflicts with heat, power and circuit paths.

IBM and Agilent Technologies are joining in a DARPA funded research project to create optical links between chips that reach 1 trillion bits/second at first, and 40 terabits/second by 2010. One of the challenges they note is keeping the power requirements down.

The next version of the Cell is being researched by IBM in another DARPA funded program. The Trips (short for "Tera op Reliable Intelligently Adaptive Processing System") prototypes will include four Trips processors, each containing 16 execution units laid out in a 4x4 grid. Later processors will contain more execution units.

Clearly the future HPCS will contain large numbers of multi execution unit chips, interconnected by fast optical links. A group of these components with memory and peripheral interfaces on a single circuit board will be the base component in these next generation high performance computer systems. They will be a supercomputer in a box.

Real Software Engineering

I want to lead this section off with an amazing bargain/opportunity, one of the best I have seen in my 40 years in the IT industry. A major corporation, IO Software, has released their ArcStyler tool for Model Driven Architecture as a freely downloadable Java package, in a fully functional Community Architect Edition.

I have called this powerful modeling and design tool the solution to "Real Software Engineering." Making this edition freely available as a Java package means that any programming team that wants to build reliable and maintainable software now has every reason to use this version of ArcStyler. Broad use of this tool could change the quality and costs of software across the industry.

More than that, ArcStyler also supports re engineering of older software in a variety of languages, and provides assistance for rebuilding software in a different language than the one in which it was originally written. Some of these features will not be free, but the ability to solve long standing software challenges is worth that and more. Strongly recommended.

Open Source Gem Update

On the open source front, I want to add another software gem to my earlier list. It is Quanta Plus, available at http://quanta.sourceforge.net/index.php. This powerful editing tool is not just an HTML editor for Web sites, although it does that very well.

It is a general purpose editing tool that can handle XML or Docbook as well as a long list from HTML 4.01 Transitional to Docbook 4.2, with nine in the current list. On top of that, it comes with a selection list of highlight choices for 25 Sources from C++ to Scilab. It can also highlight nine markup languages, from HTML to ColdFusion. It has nine "other" markups and 13 more under Scripts, and four under Games.

In addition, Quanta Plus is the easiest editor to customize that I have ever used. I wanted to add a few heading icons to the Standard toolbar. It went like this:

Click on Settings, click on Configure Actions.
Select the new tag action or one you want to change.
Click on Place This Action
Select the toolbar (1 of 6) to place this action on.
Modify the action if needed
Click on OK.

In less than five minutes, with no instructions, I had H3, H4 and H5 icons on the Standard toolbar where I do most of my work. I compliment the original authors, Alexander Yakovlev and Dmitry Poplavsky, and doubly Eric Laffoon, the current author. An excellent tool that should be in everyone's toolchest.

by **Taro Toporific** » Wed Oct 08, 2003 4:07 pm

Steve Bildermann wrote:from Byte
....Current top of the line processors generate well in excess of 100 watts, more than a typical lightbulb, on a chip smaller than a quarter of an inch on a side. The chip's power density, watts/cubic inch, approaches that of nuclear power reactors. Clearly heat is rapidly becoming a limiting factor. Liquid cooling and heat pipes are being used to cool the hottest chips.....

Just in time for the coming kotatsu season, I've discovered that a shot glass of homemade ume-shu warms to just below the boiling point of alcohol when set on top of the fan of my PowerBook.

Ahhhhhhh, drinking the Nectar-of-the-gods while surfing the net---the joys of an over-clocked life.

by **Steve Bildermann** » Wed Oct 08, 2003 4:53 pm

Hot Topics
Buraku Tourette's, Miyazaki-style	Buraku Nigerians to be banned in Tokyo's Shinjuku Ward
Coligny Hong Kong Tourists in Fatal Hokkaido Car Crash	Buraku K-Pop Songwriter Racist Rant: World Better Without Black People, Whitney No Big Loss
Coligny Looking for the Japanese Elon Musk	Buraku Japanese jazz pianist beaten up on NYC subway
Buraku There'll be fewer cows getting off that Qantas flight	Coligny Where The Hell Did Everyone Go?
Buraku 'Paris Syndrome' strikes Japanese	Buraku Ocean's Seven Samurai

Extract From Byte Magazine - Oct 07 High Performance Systems

Extract From Byte Magazine - Oct 07 High Performance Systems

'High Performance' over-clocked life.

Who is online