Cyberinfrastructure User Advisory Committee (CUAC)

A few months ago the TeraGrid leadership team worked with NSF to create the Cyberinfrastructure user Advisory Committee, or CUAC. The CUAC is comprised of twelve end-users - consumers - of cyberinfrastructure. We were delighted to be able to find advisors from each and every one of NSF's science directorates, and we had our first meeting in June. We are beginning to pull together the first report from the CUAC, documenting a series of small-group discussions (the CUAC is loosely organized into three subgroups) that the CUAC has had over the past few months. Several themes strike me about this draft (which will eventually be posted at the CUAC website referenced above).

One of the topics that comes up repeatedly is the need for more training and education regarding how to use the individual TeraGrid resources as well as how to use them together, for example in a workflow. Having harnessed TeraGrid, many users also are looking for training and help in analysing and visualizing their data. Simply put, we need to look hard at how we can increase our training and education offerings.

CUAC members also recommended that we look carefully at the barriers faced by interested potential users, before they even become users. From the point of view of a scientist considering writing his or her first proposal for a TeraGrid allocation, it would be useful to understand what are the chances that their research can be accelerated by TeraGrid, and what are the chances that their proposal will result in an allocation. We do have development allocations, or DAC awards that are very straightforward to propose, and so much of this is also a matter of better communication with potential users. (actually the DAC process has been wildly successful, with nearly 250 awards granted already this calendar year!).

In a nutshell, communication, training and education are clearly high-priority items for TeraGrid to address in the coming months.

Non-sequiter of the week: Louisville Slugger Baseball Bats. After my daughter's cross country meet tomorrow we plan to head for the Louisville Slugger Museum and Factory which I'm told is a very fun tour, particularly if you are a baseball fan. Speaking of baseball - and it is that time of year - I'm pulling for a twenty-year anniversary world championship this year, which may give you an idea of my baseball leanings. :-)


Exploring VMs

After posting a few days ago some thoughts on virtual machine technology, and spending lots of time thinking about how to leverage commercial services such as Amazon's EC2, I was talking with Kate Keahey from Argonne, who has been working in this area for a while. She gave me a very nice summary of work in the Globus Alliance that I thought would be worth sharing here:

One of the advantages of using virtual machines is the ability to easily and efficiently deploy desired software environments encapsulated in a VM image. This allows resource users to configure the virtual machine images themselves and deploy them on a VM-enabled platform made available by a resource provider. Another feature of interest is that VM tools offer capabilities allowing a resource provider to guarantee the delivery of specific resource quota (in terms of memory, CPU%, disk, bandwidth, etc.) to a VM -- this facilitates implementing sharing and accounting between different clients. The Globus Virtual Workspaces project leverages these capabilities to provide such controlled sharing and configuration independence (see a recent paper).
The configuration and performance isolation implemented by virtual machines enables a division of labor between resource provider and consumer which has the potential to significantly contribute to the growth and scalability of Grids.

The advantages of using virtual machines in Grid and generally distributed computing are still emerging as new hypervisor capabilities and new requirements emerge. The VTDC06 workshop, co-hosted with SC06 this year, brings together the virtualization and distributed computing communities to discuss the potential of virtualization in resource management, scheduling, security and service hosting.


Grid Interoperation (Now?)

About a year ago many of us involved in major grid initiatives and facilities realized that there were many pair-wise discussions about interoperation, and a set of emerging "common themes" to these discussions. This quest for interoperation is driven by two strong needs. First, there are many research teams with collaborators located in different countries, and/or on different continents, with access to multiple grid facilities. How do we help them work together, which often involves use of grid resources in multiple grid facilities? A second driver here is a practical and technical desire to adopt working solutions from others rather than reinventing them.

Leaders from nine major Grid initiatives met in November 2005 to band together to drive interoperation (pardon the acronyms): TeraGrid (US), OSG (US), DEISA (Europe), NGS (UK), NAREGI (Japan), K*Grid (Korea), PRAGMA (Pacific Rim), APAC-Grid (Australia), and EGEE (Europe)

During a half-day discussion this group identified four areas where the current state of technology, with some coordination on our part, could begin to support interoperation. We formed several task-forces to develop interoperation plans in the areas of:

- Information services
- Job submission
- Data movement
- Authorization

The PRAGMA folks also took the lead in identifying several early-adopter applications to drive these four areas, and we set up an operations task force to capture that experience. Plans in these areas were presented at the Athens GGF meeting in February, and eleven more grid projects joined us (I won't try to list them here in this already acronym-rich post). A tremendous amount of work was done early this year, and we held updates on progress at the Tokyo (May) and Washington, DC (September) GGF meetings.

You can find details on this progress, constantly being updated and expanded as we move forward, at the Grid Interoperation Now (GIN) wiki hosted at the GGF site.

The next steps for this group involve expanding the applications effort to bring in at least another dozen science teams interested in testing what we have put in place and driving it forward. The GIN effort is completely open, and we are always looking for more people to help out- head over to the site and jump right in!


Predicting the (near) Future

This past week we had a quarterly TeraGrid management meeting in Austin at the University of Texas, home of TACC. One of the discussions we had was regarding the growing number of computational resources available to users, and the need to help them to sort through the options. A key question for a user is "if I submit my job to this particular TeraGrid machine, and I'd like it to run in the next n minutes, what is the likelihood that it will run in that time?

It turns out that there are several tools that can give the user a prediction, albeit not with 100% certainty, based on the state and history of the queue in question. Rich Wolski (UC Santa Barbara) and his Network Weather Service project have been doing nice work in this area for quite a while. Rich's work with the VGrADS project has brought us a very nice tool that you can see demonstrated at his demo website.

We are working to get this capability embedded in the TeraGrid User Portal and after I sent email to our Science Gateways mailing list I found that many of them were already in the process of making this tool available. For a scientist trying to get work done, it will be a wonderful thing to be able to look across the now more than 20 major computational systems in TeraGrid and get a sense for where his or her job will run soonest!

Non Sequitur of the Week: Apple Widgets. Well this is really a non-sequitur if you're not a Mac person, but I am and I had been interested in what was involved in writing Dashboard widgets. I downloaded the Apple Developer kits and there were some nice examples in there. Since Rich Wolski sent me a tarball of the bqp (ok not a total non-sequitur, see above) command line utilities I decided to try to make a very simple widget, building on one of Apple's examples. It took about an hour to figure out the basics, and was kinda fun (I called it "AskRich"). It assumes you put the NWS command line tools in /usr/local/bin on your Mac and executes a hard-coded query, but it's a start. Perhaps next weekend I'll learn how to let the widget user select the options... (if you are interested in seeing the widget, send me email with the word widget in the subject line).


{Amazon, Google, eBay, Microsoft...}.EDU

I've been having many discussions with people from the Research & Education community - TeraGrid Science Gateway providers, individual users, computer center directors, etc. - regarding the notion of taking advantage of some new and interesting storage and computing web services such as Amazon's S3 and EC2. Google, Microsoft, eBay, and others are surely going to provide new web servies in this space. Further, anyone paying moderate attention will also see that technology provider companies (IBM, EMC, Platform, Univa, etc.) are introducing powerful building blocks aimed at building service oriented systems (e.g. "Grids"). Some (especially end users!) respond with enthusiasm - and some folks have responded along the lines of "we can do it ourselves cheaper" or "performance isn't good enough."

I think these responses are true to some extent, but they also ignore some important factors. The first is Moore's Law. Today's price is irrelevant - prices based on technology (like disk or CPU or bandwidth) get cheaper, rapidly, over time. (Imagine if the $100,000 price tag on a visualization workstation twenty years ago had stopped us from developing imaging tools...) What we have typically done in this community is to ask what the computational environment of the future will look like, and we design and plan around the future - not the present. That's how you invent the future rather than just reacting to change as it hits you.

The second is mistaking oranges for apples, and thus doing an apples to oranges comparison. Take Amazon S3. It's way, way more expensive than buying a disk drive, especially if you already operate a large computing facility. But is it the same? Not if your computing server does not provide a web services interface! Does it matter? Only if your users want a web services interface, or if you want to develop a workflow, or other sophisticated capability with web services. Many users I've spoken with say they do!

Let's look at an example. If you don't already run a storage service, what's the best way to share something like a 5 TeraByte data collection with colleagues spread around the Internet? To set up a server with 5 TB of disk and a sensible backup system (if you care about that, otherwise the calculations change) you'll pay about the same as the storage cost for putting the data in S3 for three years. The open question is data transfer- if you're sharing the 5 TB with thousands of users you may be better off hosting it yourself due to the S3 I/O charges. But if you're sharing with a small community, with modest needs in terms of moving data out and in, then S3 is likely much cheaper than rolling your own- unless your system administration staff work for free.

I believe that TeraGrid and similar initiatives must seriously investigate what a partnership might look like with (web/grid) "service providers." While these services do not address the requirements of users who need multiple Teraflops of computing or tens of Terabytes of storage, they just may offer something for the many people who want to share smaller amounts of data, or have intermittent needs for rapidly accessible, modest computing power.

TeraGrid is focused, rightly, on providing for Petascale computational, storage, and data analysis services. For the Gigascale stuff, perhaps we should think about a new type of "resource provider" - Amazon.edu?

(at Austin)


Virtual Machines and Types of Service for TeraGrid Computing

Foundational capabilities we provide in TeraGrid, such as "roaming" access and a "coordinated" software environment, open new possibilities in terms of more specialized services, or to allow the TeraGrid, as a system, to respond to supply and demand. For example, a resource provider might elect to increase the "price" of a queue in order to improve turnaround time by reducing demand, or decrease the price to increase demand (and thus utilization).

We also are looking at ways to support on-demand services for urgent computing, through projects like Pete Beckman's Spruce work. The tricky part is being able to service an on-demand job where it's not a viable option to keep supercomputers on hot standby! We have considered things like offering a 'preemptible' service on a particular resource, where the user is charged at a lower rate in exchange for knowing that his or her job may be killed to make room for an on-demand job.

It is worth considering the use of virtual machine technology for an even better 'preemptible' service, or even to support migration of jobs in the event of an on-demand service request. One might even consider migrating the jobs to a commercial service such as Amazon's EC2!

Many people have demonstrated moving virtual machine images around with virtually no disruption to the application. The TeraGyroid collaboration between TeraGrid and the UK Reality Grid project is an example, and at iGrid2005 Franco Travostino and others demonstrated job migration.

Of course the ideal applications to take advantage of a virtual machine service are those that involve ensembles of single-processor jobs without large data requirements. But we do have a large number of users whose applications fit this very profile, so it is worth investigating such a service. Further down the road we will want to be able to support message passing (multiple-processor parallel) jobs as well as data staging needs of applications that are data-intensive. Not being able to solve those issues just yet shouldn't prevent us from looking at services that solve simpler cases!

(at OGF/GlobusWorld)


A New Tool for Science Gateways

I got an update from TeraGrid Science Gateways director Nancy Wilkins-Diehr and Stuart Martin on an important set of activities related to TeraGrid science gateways. During the past month or so the group has focused on testing a new GRAM audit service, which Stuart has been spearheading. From the relevant Globus Alliance bugzilla post describing the new feature:

An auditing mechanism for WS-GRAM and a proof-of-concept interface to compound audit/TeraGrid accounting database queries has been created using OGSA-DAI at the request of the TeraGrid infrastructure team. The next step is to actually deploy these components on TeraGrid to get a working example. This will provide a fully integrated proof of concept for the entire setup as well as allow TeraGrid people to use it and report back on how they would like to use it (i.e. what specific queries will they need). Additional campaigns may need to be created to add additional OGSA-DAI activities to support the desired query set.

What does this mean in practical terms, and why is it important to gateway providers?

TeraGrid is funded by the National Science Foundation as a service to researchers, who are allocated access based on peer review. That peer review process takes into account the scientific progress enabled by work associated with an allocation, or project. TeraGrid accounting systems keep track of usage for each individual job that has been executed, associating the usage with a specific allocation (project). Traditionally, a project will have a handful of users associated with it, and a principal investigator who keeps track of what science has been accomplished with that allocation. TeraGrid provides tools for the principal investigator to track usage, and the principal investigator works with his or her team to manage the allocation.

But science gateway principal investigators will use a community allocation to support a very large team of users - potentially hundreds! How does that gateway provider keep track of what is being accomplished with the allocation? Some science gateways may track usage on a per-user basis; others may track on a per-application basis (i.e. tracking usage by function, or application service rather than user, where an application, or function of the gateway, may be one of a small number of tools made available to the community through the gateway).

To track usage at this level, a gateway provider must be able to associate a grid job identifier (associated with the user or application service at the gateway) with the job entry in the TeraGrid accounting system. The current TeraGrid accounting systems report only the local job id on the TeraGrid resource, but have no information about the grid job id on the gateway end. Thus there is no way to correlate individual jobs in the TeraGrid accounting system with individual actions taken at the gateway (e.g. with users or applications).

The GRAM audit capability maintains this correlation in a database, allowing the gateway to retrieve usage information from the TeraGrid accounting system as well as mappings to individual users/applications from the audit database. This capability is in beta test today with some of the TeraGrid science gateways, and will be more tightly integrated with the accounting system, for example through the existing usage query mechanisms available to users.

This is a nice example of the symbiotic relationship TeraGrid has with the middleware community where an important capability (not just for TeraGrid but for other Grid projects) is created. Working with the open source Globus Alliance we were able to implement a new and necessary service in relatively short order, leveraging the UK eScience project led OGSA-DAI work that was standardized in the Open Grid Forum (previously Global Grid Forum).



I had a conference call with Scott Lathrop (director of TeraGrid's external communicaitons, education, outreach, and training) and a subgroup of the Cyberinfrastructure User Advisory Committee (CUAC) the other day. This subgroup is focusing on issues related to training, communications, education, and outreach.

We spent some time discussing strategies for expanding and improving on-line training for TeraGrid as well as the on-line documentation in general. Over the past year our external communications team has made tremendous improvements to the website, and they continue to do so. Is there a way to improve it even more, and make the information more fresh?

One approach we talked about was the use of technology such as is used for Wikipedia, allowing our team of experts and editors to be effectively expanded to include any member of the community. But can such an approach work for TeraGrid? Will the information be accurate?

Stanford's Roy Pea, one of our CUAC advisors, did an interesting experiment using the Wikipedia technology to engage a community of students to build a site for one of his graduate courses. He notes that common concerns to this approach include quality and accuracy, but these are challenges to address rather than fatal flaws to the approach.

Nature did a study comparing Wikipedia with the Encyclopedia Britanica in late 2005. Forty-two science articles - the Wikipedia version and the Britanica version - were sent to reviewers, who were not told which was which. Reviewers found on average 4 errors per article in the Wikipedia version and 3 errors per article in Britanica. In this Nature article, author Jim Giles writes:

Only eight serious errors, such as misinterpretations of important concepts, were detected in the pairs of articles reviewed, four from each encyclopaedia. But reviewers also found many factual errors, omissions or misleading statements: 162 and 123 in Wikipedia and Britannica, respectively.

So Wikipedia isn't quite up to the Britannica standard, but it's pretty close. My sense is that the Wikipedia approach to on-line training and documentation for cyberinfrastructure would give us much more up-to-date information, and would make the information more informative in many cases as domain experts contribute. At the same time, concerns from professional editors about quality have also been raised, and Wikipedia's founder, Jimmy Wales, has expressed the need to focus on quality at this year's Wikimania conference. I think for cyberinfrastructure, such as TeraGrid, the best approach will be to combine the strengths of our editors and writers with the input of the community. A "TeraWikiPedia" is likely to deepen and improve our documentation and online training much more rapidly - and allow it to adapt in near-real time. It will in fact mean we will rely even more heavily on our editors and writers to curate and polish the content.

I'd like to see us try the Wikipedia approach with a particular set of materials, such as our education, outreach, and training materials, to see how it goes. Based on our experience there we'll have a better idea of how best to harness the creativity of the community for all of our online content.

Non-sequitur of the week. Global Positioning System (GPS) - One of my favorite toys is my Garmin Legend Cx hand held GPS unit. Besides using it for navigation (thus never having to ask directions!), I "collect" two kinds of waypoints. One kind is what I'd call a souvenir waypoint - things that you don't need a GPS to find. For example, Tokyo Station. The second kind is much more useful - places I'd like to go back to, or point others to, that are not necessarily easy to find. My favorite coffee shop (41.89529N, 12.48019E), an excellent local artisan's pottery shop (48.68931N, 122.95795W), or a friend's office (35.27568S, 149.12085E). A fun thing to do with your waypoints is to mash them up into a Google map, which is an easy way to share them with friends. I use Mapbuilder to do this. In fact a couple of friends and I are using a shared map there to assemble our favorite waypoints (so we can harvest the benefits of one anothers' exploration!). Geeky? You betcha.


Improving Security and Usability....

It's generally considered to be the case that security and usability (i.e. convenience) are mutually exclusive trade-offs. Anyone who has flown on a commercial flight in the past few weeks (years) has seen this in action. One place where usability and security tend to collide in a facility like TeraGrid is the process by which authorized users get authenticated and gain access to services and resources. We actually have an opportunity to move to an architecture that will both improve usability and security. Yes, it sounds too good to be true....

A few weeks ago I posted a note about attribute-based authorization and a pointer to a paper that Von Welch (from the GridShib project has been putting together (with myself, Ian Foster, Tom Scavo, and Frank Siebenlist), and a TeraGrid Authorization, Authentication, and Account Management workshop scheduled to take place at Argonne this week. Ian also recently wrote about attribute-based authorization in his blog with some good pointers.

The workshop concluded yesterday, and I spoke with Dane Skow (TeraGrid deputy director) this morning about how it went. Dane was one of the co-organizers of the workshop (along with Von and also PSC's Jim Marsteller, the head of the TeraGrid security working group). In addition to checking out the website for the workshop, where all of the notes and background information can be found, you might be interested in Dane's take on what was accomplished:

1) We figured out how to cut 1 week off the process of getting new users accounts in a pretty easy first step and identified a path to cutting the time to issue new accounts even further.
2) We identified a very small set of information (persistent unique identifier and (maybe) citizenship) as the required set for gatewayed users. [editor's note, the verb "to gateway" here refers to obtaining TeraGrid access via a Science Gateway... it is usually a good sign when the proper name for a project gets verbed]
3) We designed a testbed that would enable users to use their Shibboleth credentials from home institutions to generate credentials that would work on TeraGrid. They would not have to retain a persistent x509 environment on their workstations, though for some usage modes, they would have to use short-lived proxies put into a local Globus environment.

From my point of view it was tremendous to see about 35 participants working together from TeraGrid sites as well as partner organizations such as the Globus Alliance and the Internet2 Shibboleth project. We had experts in security, accounting, grid software development, and identity management constructively grappling with this important set of issues together. The event was a nice example of why you get on an airplane and travel to a workshop - to make progress about 50 times faster than exchanging email and position papers! Having made this investment, we are ready to take the next concrete steps to make this vision a reality.

Improving security and usability at the same time. How often do you get a chance to do that?