Tuesday, May 10, 2011

Computing and supercomputing, 1974 and 2011

This video is Seymour Cray talking to LANL in 1974 (not 1976 like the caption says, and also not his last surviving public speech, but still amazing):


And here's another video, from 15 years later, where he's talking about the Cray-3:


The Cray 3 was technically successful, but had so many production difficulties the company almost went bankrupt in 1989 and decided to shelf the entire Cray-3 project, and release an incremental upgrade to the Cray XM/P (which was itself an incremental upgrade to the Cray-2) with the faster memory Cray had designed for the Cray-3 (using COTS parts).

Cray hated incremental designs, and was very confident in the Cray-3; so he split and formed a new company... again... which also went bankrupt in 1995 after only 1 machine had been delivered.

Now there's a man I wish I could have been friends with. I've known a few of the guys who worked with him, still in the field today, and man they have some stories.

So, just for giggles, heres a quick and dirty history and comparison between the supercomputers of 1974 and the desktop computers of today.

In 1976, when they fixed the parity memory and worked the bugs out of the Cray-1, it achieved a sustained performance of about 250 megaflops (million floating point operations per second), at a clock speed of 80mhz; making it the fastest computer in the world at the time (its direct predecessors the CDC 7600 and CDC star 100 were both capable of about 35mflops on standard workloads . The Star 100 could hit 100mflops, but only on specially optimized workloads).

They sold about 80 of them, at $8 million or so a piece (no two Crays were ever exactly alike, nor did any two ever cost the same) and they remained the fastest computer in the world for about 5 years.

The first machine capable of a sustained performance of 1 gigaflop was also a Cray, a specially modified XM-P/48, in late 1984 or early 1985 (the standard model was capable of either 800mflops, or 940mflops depending on when it was manufactured). The XM-P/48 stayed the fastest for about 12 months (until the Cray-2 had the bugs ironed out and ran 4gflops; though the Russians had a machine that could run 2gflops, it wasn't a general purpose computer, being constructed specially to run some aerodynamic calculations). It ran at 105mhz, and cost about 15 million, depending on the configuration.

In 1985, the fastest PCs were running at about 8mhz (80286 with an 80287 floating point coprocessor), and could in theory run about .1 megaflops (100 kiloflops)
A note: Supercomputer numbers before 1993 are not necessarily consistent or directly comparable. 
It wasn't until 1993 that a variant of linpack benchmark became the international standard for supercomputer comparison; although it was commonly used from the mid 80s forward. 
PC numbers are not linpack here either, as there are too many variables in PC construction and performance (particularly I/O performance); so they are not directly comparable with supercomputers linpack numbers. Also, it wasn't really until the late 90s that linpack benchmarks were commonly run on PCs. 
A huge component of a supercomputers speed, is the truly massive I/O and node interconnect capacity they have; but even todays supercomputers cannot feed their CPUs fast enough to use their entire theoretical capacity.  PC's, even today, only have a very small fraction of that I/O capacity; and can typically only use a small percentage of their CPUs theoretical maximum processing power on general workloads because of it. 
Geekbench is the current standard for PC benchmarks, and it's numbers are anywhere from roughly comparable, to 3-5 times the numbers a machine will get on linpack; but either are much lower than the fastest the CPU can perform on workloads that don't have I/O bottlenecks (like running the same piece of data that fits in main memory through the same instructions that fit in cache, over and over again... a common calculation in scientific computing, graphics etc...). 
Early PC cpus were focused strongly on integer processing, and had very little floating point horsepower. It wasn't until 1989s 80486dx that mainstream CPUs even had dedicated floating point units (the lower end sx models didn't); and not coincidentally, the 486dx/33 was the first mainstream CPU that could push 1mflop.

Apples didn't get a machine with a dedicated FPU until the Quadra 630 of 1994 (68040 cpu at about 3mflop); and mainstream Macs didn't get an FPU until the switch to the PPC601 and 603 powermacs of '94 and '95 (which could push about 5mflops).

The first supercomputer to make a sustained 100gflops was the Quadrics APE100 in 1991 (yes, there was a 100fold increase in speed in 5 years) but it was eclipsed just a few months later; as in the 90s supercomputers leapfrogged each other every year (until the market completely collapsed and most of the supercomputer companies folded around the middle to late 90s).

It wasn't until 1994s Pentium 100 that a desktop CPU would push 10mflops. At the same clock speed as a standard pentium though, Pentium MMXs could do about double on MMX optimized workloads (a 133mmx could push 25mflops on MMX optimized workloads).

In 1996 the Pentium pro 200 could push 50mflops on optimized workloads.

The first teraflop machine was the Intel ASCI/Red built for Sandia labs by Intel in 1996; capable of 1.4 teraflops in its original configuration (using 4510 pentium pro processors at 200mhz). It was later rebuilt in 1999 using 9280 specially modified Pentium IIs at 300mhz, and achieved 2.4 tflops. It cost something like 25 million, and was the fastest supercomputer in the world for over 3 years


Apples G3 power PC (actually an IBM ppc 750) was the first mainstream desktop cpu to break 100mflops on optimized workloads in 1997. 

Apple kept the floating point crown over Intel with the G4 cpu as well, with the 500mhz G4 breaking 200mflops on general workloads, and just under 1 gflop on optimized workloads in 2000; something Intel couldn't do with the Pentium III or even the Pentium 4 until higher clock speed parts in late 2002. 

Also in 2002 Apple offered mainstream configurations of the G4 with dual processors; allowing for a standard production desktop machine (admittedly one that was heavily optimized) to break a gflop for the first time. 

The G4 was also the first PC that could credibly claim to have somewhere near the same calculating capability that a Cray-1 had in 1976 (thus the "desktop supercomputer" ad campaign that Apple ran at the time); though of course, it still only had a small fraction of the I/O. 


Fast forward to today.
Note: these numbers are for highly optimized workloads, using special GPU drivers for high performance computing; and special high speed interconnects also designed for high performance computing etc...
For normal PC's configured with disk drives, normal networking, and a normal operating system; you will only see something like 10% of this performance even under the best conditions, and more like 5% on general workloads. 
Realistically, for all but relatively small datasets which fit in main memory (so 4-24 gigs lets say), running instruction sets that fit in cache; PC I/O bottlenecks prevent them from achieving sustained high performance.  
For example, the fastest core i7 cpus can pump out about 120gflops; but only for a couple of seconds at most... maybe only a few hundred milliseconds... before it has to go out to disk for more data to process. If you ran linpack on that PC with that CPU, 8 gigs of ram, a fast SSD, and a normal OS; you would only see sustained performance in the 2-5gflops range (excluding the GPU performance. Conventional OS's don't let linpack use the GPU).
In order to see the real maximum performance of the CPU and GPU, you need to run the systems diskless, with high speed interconnects as part of a high performance computing cluster; and even then you'll probably only see 20% or so of the max number in linpack because of the way the benchmark is structured, and maybe 60% under highly optimized workloads. 
In 2011, the typical single processor quad core Intel based desktop machine, selling for around $800, can theoretically push about 100gflops from the processor, and another 250 from its relatively weak GPU (because GPUs in high end video cards are highly optimized floating point processors), for about 350gflops aggregate (presuming I/O is taken out of the picture, and the system was running as a node in a high performance computing cluster).

Although the benchmarks are not directly comparable, that's still more than 1000 times the performance of the original Cray-1,  at 1/10,000th the price.

I think we beat Moores law on that one, though not by much (10 million, vs. 8.4 million); at least in pure compute capacity (vs. I/O, which hasn't nearly kept up with Moores law).

A $2,500 high end gaming rig can theoretically push something like 120gflops from is processor and, 2.4 TERA flops (2.4 trillion flops) from 3 dual gpu cards, for an aggregate of over 2.5tflops (as part of an HPC cluster).

That's theoretically about the same performance of the 1999 ASCI red rebuild, for about 1/1000th the price (beat Moores law again, this time by 1000 to 256; mostly accounted for by the huge jump in floating point performance on PC's in the early to mid 2000s, combined with the late 2000s multicore revolution)

A high end dual processor quad core system (8 cores total), with three high end video cards (a high end graphics workstation, running something like $10,000), can push about 240gflops from the cpus, and over 3 tflops from the GPUs; for an aggregate of over 3.25 teraflops (again, if it were part of an HPC cluster).

That's theoretically faster than any supercomputer built before 2000 (IBM ASCI white, at 7.2tflops linpack - 12.3 on optimized workloads - for $110m).

Oh and todays fastest supercomputers?

Well, that depends on who you believe; because the Chinese are claiming a machine that runs at 2.57 petaflops, but from reviewing the architecture, a lot of folks don't believe that number. Otherwise, it's Cray again (though Cray isn't really Cray anymore), with their Jaguar system, at 1.76 petaflops (about 20 million).

Petaflops.... in 1976 it was 250 million flops, now its 2.5 quadrillion flops.

Funny enough though... Now, almost all of the top 500 supercomputers are now built using commodity cpus and memory, and run a variant of Linux.... and it's not particularly hard to get on the top500 list (the difference between the top and the bottom of the list is pretty huge.. about 3 orders of magnitude).

In fact, a fair number of the clusters and high performance systems my company runs today would come close to hitting the list if we bothered running linpack on them (we don't. We put 24,000 servers on the floor in 2010, and the smallest of them was a dual quad core box, with the biggest of them being a cluster with 256x 8 core CPUs); and there have been colleges that built top 500 clusters for well under a million dollars.

Supercomputing entered this world about the same time I did, and in that time we've gone from one guy in the world being able to do this, with hand built boards and custom chips, taking five years to do it, and having to run 100% custom developed code from the assembler up; to now being literally ten million times faster, with commodity hardware, an open source OS that anyone can run, and a standard set of open source tools.

Amazing how the world changes.