I've been having many discussions with people from the Research & Education community - TeraGrid Science Gateway providers, individual users, computer center directors, etc. - regarding the notion of taking advantage of some new and interesting storage and computing web services such as Amazon's S3 and EC2. Google, Microsoft, eBay, and others are surely going to provide new web servies in this space. Further, anyone paying moderate attention will also see that technology provider companies (IBM, EMC, Platform, Univa, etc.) are introducing powerful building blocks aimed at building service oriented systems (e.g. "Grids"). Some (especially end users!) respond with enthusiasm - and some folks have responded along the lines of "we can do it ourselves cheaper" or "performance isn't good enough."
I think these responses are true to some extent, but they also ignore some important factors. The first is Moore's Law. Today's price is irrelevant - prices based on technology (like disk or CPU or bandwidth) get cheaper, rapidly, over time. (Imagine if the $100,000 price tag on a visualization workstation twenty years ago had stopped us from developing imaging tools...) What we have typically done in this community is to ask what the computational environment of the future will look like, and we design and plan around the future - not the present. That's how you invent the future rather than just reacting to change as it hits you.
The second is mistaking oranges for apples, and thus doing an apples to oranges comparison. Take Amazon S3. It's way, way more expensive than buying a disk drive, especially if you already operate a large computing facility. But is it the same? Not if your computing server does not provide a web services interface! Does it matter? Only if your users want a web services interface, or if you want to develop a workflow, or other sophisticated capability with web services. Many users I've spoken with say they do!
Let's look at an example. If you don't already run a storage service, what's the best way to share something like a 5 TeraByte data collection with colleagues spread around the Internet? To set up a server with 5 TB of disk and a sensible backup system (if you care about that, otherwise the calculations change) you'll pay about the same as the storage cost for putting the data in S3 for three years. The open question is data transfer- if you're sharing the 5 TB with thousands of users you may be better off hosting it yourself due to the S3 I/O charges. But if you're sharing with a small community, with modest needs in terms of moving data out and in, then S3 is likely much cheaper than rolling your own- unless your system administration staff work for free.
I believe that TeraGrid and similar initiatives must seriously investigate what a partnership might look like with (web/grid) "service providers." While these services do not address the requirements of users who need multiple Teraflops of computing or tens of Terabytes of storage, they just may offer something for the many people who want to share smaller amounts of data, or have intermittent needs for rapidly accessible, modest computing power.
TeraGrid is focused, rightly, on providing for Petascale computational, storage, and data analysis services. For the Gigascale stuff, perhaps we should think about a new type of "resource provider" - Amazon.edu?