UC Grid Summit 2009 Discussions

From UGP-Wiki

Jump to: navigation, search


[edit] Agenda

Summit Agenda

[edit] Campus Grid Project status and HPC activities, Shared Cluster, Cloud Computing Task Force

[edit] UCSB: Report from Paul

  1. Campus Grid Portal up and running with 12 active users in Winter 09 quarter.
  2. Mainly used by: non-Unix people to run Gaussian on HPC; VASP through the condor cluster(provides extra resource); Interactive apps (Windows users can run X11-apps)
  3. Sporadic use for classes
  4. In general, campus cluster owners don't have much interest or incentive to join. General attitude is if you do everything for me then it is ok to add us to grid
  5. Decentralized computing, general distrust for centralized management
  6. Web interface speed is slow
  7. If we happen to have more compute resource that are freely available then there will be more users

[edit] UCI: Report from Harry

  1. Grid Portal has been up and running, but no one is actually using it.
  2. That may change when broadcom cluster resources are added to grid, probably in 30 days
  3. One problem with a GUI versus command line is that GUI's change over time, while CLI doesn't

[edit] UCR: Report from Bill


Bill's slide are up at the summit website. I will just add few comments

  1. It appears Bill is in need of few appliance nodes because there are few clusters that can be added to the grid.

[edit] UCSD/SDSC: Shava


  1. UCSD test cluster and onDemand cluster are currently part of the campus portal. No report on any users. Shava's slides are there on the web.

[edit] LBL: Krishna and Jackie


Their slides are on the web as well. Few notes

  1. UC3 cluster will use Moab Resource Manager
  2. South and North clusters are proposed to have two different (one each) HOME file systems with periodic sync performed. Jobs can be run out of the HOME file system but preferably out of the Scrath space built upon parallel file systems located at each site. Parallel file system is NOT PVFS and its still not finalized. Jackie has figured out how single qmaster manages geographically distributed nodes. Especially file stagein and stageout.
  3. Use of one time password is a must on LBL operated computer resources
  4. Berkeley Lab is also working on an experimental project to provision Amazon EC2 compute resources on Demand and evaluate their use for Laboratory's Computational needs. (Question: This is from my recollection: There was some mention about some kind of VLAN that will still keep the new resources at Amazon still under LBL security. I am not sure whether this was mentioned here or during some other discussion.)

[edit] UCD: David Walker

> I will let David comment on it.

[David's comments are below]

The ITLC is creating a Cloud Computing Task Force (via Russ Hobby and me) to assess potential uses of cloud computing within UC over the next few months. Membership will be drawn from the Communications Planning Group, the old UC Research Computing Group, the Joint Data Center Management Group, cloud computing researchers, and UC Grid. Bill Labate has agreed to represent UC Grid, and some of the other members have been identified; the current list will be on the wiki ( https://spaces.ais.ucla.edu/display/uccctf/Home ) as members are added.

[edit] Invited Presentations

[edit] Eucalyptus Cloud Computing: Daniel Nurmi


Dan's slides are available. Discussion is in afternoon section.

Dan ? ( I think all the discussions were in the afternoon, Right?)

[edit] Kepler workflow: Jianwu


Jianwu? (Please fill in anything you like to add.)

Harry? (I remember you asked a nice question comparing to MAC OS. Basically, when it works, it works. Debugging is harder. We didn't give Jianwu much time to explain how to debug. Jianwu? Could you please give your extended answer now? May be Ilkay can contribute too.

[edit] Inca Monitoring: Shava


Basically exposed a weekness in UC Grid which is the Grid Portal admin cannot login everyday to look at the status of all services in the portal. We need something like Inca to aid Grid Portal admin. Shava already has some Inca tools. May be we will make them available later on inside UC Grid software distribution.

[edit] Technical Discussions

[edit] UC3 Shared Cluster Discussion: Gary

Looking at the discussions we had I thought everyone was looking forward to Gary's talk.

Some of the discussion topics and highlights of Gary's talk

  1. Gary is in a tight schedule to procure the machine
  2. Warren Mori was interested in the criteria for benchmarks and acceptance. Gary's answer was no individual code

will be benchmarked, but acceptance will be based on theoretical FLOPS as it is a general purpose cluster. A cluster with Intel Nehalem CPUs and Infiniband interconnect will satisfy that requirement. Any change in this policy has to go to another committeee (ITLC ?)

[David's comments are below]

I think the discussion of one-time password devices actually occurred during the UCTrust integration talk. Nevertheless, I think we agreed that this is a policy issue that should be raised with the ITLC (or the future UC3 governance group), particularly if we expect UC3 to expand beyond the initially-estimated 240 users.

3. One time password seems to be the most controversial topic.Jackie justified the inconvenience of having one time password in terms of the benefits of securiry that it provides. It will basically stop keyboard sniffing which happens to be the most usual way clusters are hacked.

Others point of view is that it is an expensive and unscalable model beyond LBNL. So if I understood correctly, the discussion was how secure we can make the system without causing inconvenience to users. The decision for using OTP in UC3 cluster is already made.

Current OTP users may want to answer some of the questions like do we have to generate OTP each time we 'scp' a file from desktop to the cluster and vice versa, etc. If I heard correct, the current policy will allow the users to communicate without any password for 24 hours from the machine you did the initial transaction.(Update from Krishna : 24 hour credential life time was NOT mentioned using OTP and I think its not possible). I suppose there must be a way to destroy that credential when we leave the machine where we initiated it. Also, users should be able to lock the screen if they are going away for a while if that credential is live for 24 hours.

Currently, UC Grid does not know how to get authentication from OTP. So UC Grid probably will not be able to use this procedure for now unless we work on it and find a solution.

There was a question whether pool users will be allowed to submit jobs to UC3 cluster.

Answer anyone?

Unfortunately, we assumed all the participants are aware of UC Grid Architecture. So we didn't include a slot for describing UC Grid Architecture overview.

For those of you who are not familiar with UC Grid architecture, here is a brief explanation:

UC Grid has two kinds of users:

Cluster users are normal UNIX account holders who use X509 based certificate to login to their cluster accounts through the web. They can open an interactive shell through VNC on the web itself. At this stage there is pretty much no difference between a direct login through ssh and web login using X509 credentials.

Pool users are those users who don't have a Unix account on the compute clusters. Those users are not entitled to a Unix like shell through VNC on any of the participating clusters. They can only submit pre-installed applications like R, Gaussian, Q-Chem, Mathematica Matlab etc. The pool users upload their input from their laptop or one of the clusters they have unix account and submit the job. The grid copies the input files to the target cluster and submit the job as a guest user. Please note that pool users are never allowed to submit an executable because Grid cannot guarantee the operating system at the target cluster, but it can only guarantee the availabilit of pre-installed software. The full path of the preinstalled software is already loaded in the Grid Portal's database.

It is not clear whether LBL will allow any pool users in UC3 cluster even if we figure out how to translate OTP into X509 certificate which itself is doubtful.

[edit] UC Grid Security and Certificate: David and Kejian

slide1 slide2

Slide for both what David and Kejian talked are available on the web.

Everybody agreed that it is a scalable model when we add 10,000 users into grid. All users will be authenticated through UC Trust Shiboleth. Campus Identity providers are responsible for making sure the authenticity of all users who claimed to be who they are. The UC Grid certificate signing service will create an X509 certificate based on the ePPN signature returned from the Idnetity provider. This ePPN will uniquely identify people from different campuses. Everyone will be able to use Grid services immediately, but if you want to submit a job to a cluster using user's unix account one has to wait until the cluster administrator verified your request and mapped the user certificate to the unix account in the grid-mapfile on the appliance. It is users responsibility to contact the cluster admins in advance and let them aware that they are applying for a grid account as well. If not cluster admin will reject the request and the grid user will remain as pool user for that particular cluster. (By default all users are pool users in a default pool.)

Kejian did a demo of Grid Certificate signing using authentication from UCLA Shibboleth He called it UCCA Register service. All 10 campuses and LBL has this capability (or will) to do it if they simply deploy this Web service. The web services are similar to that are available in Globus ToolKit 4.0.?

Kejian's Resource Broker service can take application request and submit pool jobs to a pool of clusters including Hoffman2, universal and SDSC cluster and bring back. All commands are executed on the command line.

Kejian also gave a demo of how Grid will dynamically create unix account on our perceus based cluster and do a livesync to populate the compute nodes with passwd file and other details and submit jobs through SGE resource manager

Shava and Jackie wanted the software for them to test it. Kejian promised to make them available as early as he can. Kejian ?

[edit] Cloud Discussion by Dan

Our current pool users are limited to submitting application jobs only. We want to extent the capability to upload their own image to any of the hardware resources and be able to run their own exectuable. This will eliminate the current uncertainty about the architercture knowledge in advance.

We will be restarting the UC Grid pilot clound project that started couple of months ago, but got delayed little bit.

For scientists this can be useful so you can fix the environment (e.g. libraries) without upgrading/downgrading the whole cluster which impacts everybody.

Potential uses of cloud in UC Grid:

  • Long Serial jobs
  • Absorbing demand peaks
  • Scaling existing serial resources

Use cloud simply as another queue?

There is a cloud computing task force (under the ITLC)

[edit] Workflow Discussion by Jianwu

Jianwu explained how he brought jobs that would have taken 2000 mts to 80 mts in the Rosetta protein design project.

[edit] Inca discussion by Shava,

Inca grid monitoring to UC grid that is being tested by Shava will be added to UC Grid.

[edit] Participants

Participants and Affiliation
       Allen, Arlene                   ucsb
       Nurmi, Daniel                   ucsb
       Weakliem, Paul                  ucsb
       Davis, Jim                      ucla
       Friedman, Scott                 ucla
       Huang, Shao-Ching               ucla
       Jin, Kejian                     ucla
       Kim, Seonah                     ucla
       Korambath, Prakashan            ucla
       Labate, Bill                    ucla
       Mazurkova, Svitlana             ucla
       Mori, Warren                    ucla
       Singh, Tajendra Vir (TV)        ucla
       Mangalam, Harry                 uci
       Schiano, Allen                  uci
       Kennedy, Michael                ucr
       Strossman, Bill                 ucr
       Jung, Gary                      lbl
       Muriki, Krishna                 lbl
       Scoggins, Jacqueline            lbl
       Smallen, Shava                  sdsc
       Wang, Jianwu                    sdsc
       Walker, David                   ucdavis
Personal tools