FAQ

From UC Grid Wiki
Jump to: navigation, search

FAQ


The Data Manager does not work

   The Grid Portal machine must be able to access itself and the Grid
   Appliances on port 2811. To test whether it can, from the Grid Portal try
   to telnet to the Grid Appliance on that port:
       telnet appliance.ucla.edu 2811
   If your appliance is running as a NIS client of the head node of the cluster, 
   make sure that NIS is still working, e.g. that ypcat passwd works.  If NIS is 
   stopped for some reason, end users will be getting a 'NULL' or 'Socked disconnect' 
   problem from the DM, and you will see 'Bad Password' errors in the catalina.out 
   file on the Grid Portal.

GRAM job submission fails

   Please make sure that /etc/sudoers file is updated with the following information
   for GLOBUS_LOCATION where GLOBUS_LOCATION in the example below is /home/globus/GT4
  • /etc/sudoers
    • sudo privilege for gridmap execute
   globus  ALL=(username1,username2) 
   NOPASSWD: /home/globus/GT4/libexec/globus-gridmap-and-execute 
   -g /etc/grid-security/grid-mapfile
   /home/globus/GT4/libexec/globus-job-manager-script.pl *
   globus  ALL=(username1,username2) 
   NOPASSWD: /home/globus/GT4/libexec/globus-gridmap-and-execute 
   -g /etc/grid-security/grid-mapfile
   /home/globus/GT4/libexec/globus-gram-local-proxy-tool *
    • Defaults requiretty line is commented out in /etc/sudoers
  • /etc/grid-security/grid-mapfile
  Please make sure user certificate DN is entered inside the grid-mapfile
  • Always check the 'date' on your machine and make sure time on your machine is correct.

The Resources table does not show the Status (up or down) of a particular cluster

     - Two methods in retrieving resources information: aggregation or poll upon requests
     - The Resources are agragated in the background periodically. You will
       normally have to wait up to 10 minutes after starting GT4 on the Grid
       Appliance connected to the cluster before the information about the
       cluster shows up under Resources.
      
     - Did you execute srun.sh? See the section titled "After Installation" in
       this INSTALL.txt document.
     - Check whether you can connect to the Grid Portal machine at port 8443
       from the Appliance. From the Grid Appliance enter:
           telnet portal.machine.address 8443

You do not see anything after you login to the Grid Portal

   As the Grid Administrator, you must have added at least one Grid Appliance
   to the Portal for the Portal to work.

Users cannot submit jobs to a particular cluster

   You must make sure certificates have been created and placed in
   "credentialDir" on the Grid Portal machine.
   You cannot submit jobs when logged in to the Grid Portal as the GridSphere
   super user. You must login as either the Grid Administrator or any other
   regular user for which a certificate has been created. When the user logs
   in, a proxy certificate is created for that session. The proxy certificate
   is used as credential to access the remote clusters.
   Also the Grid Appliance and Grid Portal must have a trust relationship with
   the CA. For an Appliance make sure to followed the instructions about
   copying the file:
       .globus/simpleCA/globus_simple_ca....tar.gz

There are numerous different connection problems between the Grid Portal and an Appliance

     - The Appliance CA must be signed or accepted by the Grid Portal CA, or
       whatever CA you are using.
     - Both machine must trust each other.
   See the section titled "Configuring your Certificate Authority (CA)" for
   the details.

The Grid Portal cannot retrieve the status of jobs; it always reports a status of "Unsubmitted"

     - For SGE, you must be running version 6.0 or higher for the status to
       appear. You must have $SGE_ROOT and $SGE_CELL define in the
       environment, and also have the file named:
           $SGE_ROOT/$SGE_CELL/common/reporting.
     - For PBS, you must cross-mount the file system containing PBS's server
       log file on the Grid Appliance. You must have $PBS_HOME defined in the
       environment and also have the directory named:
           $PBS_HOME/server_logs

There are two network cards in the Grid Appliance machine

The Globus Toolkit needs to be running with hostname instead of IP address public or private.

   The solution is to modify the two files:
       $GLOBUS_LOCATION/etc/globus_wsrf_core/client-server-config.wsdd
       $GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd
   and add two lines as follows so that it uses the hostname of appliance node instead of the private or public IP address.
       <parameter name="logicalHost" value="appliance.ucla.edu"/>
       <parameter name="disableDNS" value="true"/>

GridFtp is not working

   Check /etc/hosts.allow and make sure it contains a line like this:
       globus-gridftp-server: [Portal machine IP]
   If you do not see that line, add it.

When you try to submit a job to a cluster from the Grid Portal you get the following error:

       Job submission failed: Caused By Job Submission Exception:
       org.globus.wsrf.container.ContainerException: Container failed to
       initialize [Caused by: Configuration file directory './etc' does not
       exist or is not a directory or is not readable.]
   You must have GLOBUS_LOCATION variables defined in jakarta-tomcat: Look for
   lines in $CATALINA_HOME/bin/catalina.sh that look like:
       # Set juli LogManager if it is present
       if [ -r "$CATALINA_HOME"/bin/tomcat-juli.jar ]; then
       JAVA_OPTS="$JAVA_OPTS "-Djava.util.logging.manager=
       org.apache.juli.ClassLoaderLogManager" "-Djava.util.logging.config.file
       ="$CATALINA_BASE/conf/logging.properties"
       JAVA_OPTS="$JAVA_OPTS "-DGLOBUS_LOCATION=/home/globus/UCPortal/GT4
   In the last of these lines replace /home/globus/UCPortal with the full path
   of your install directory.

Can the Grid Admin manually add a user?

   Yes. the Grid Admin can. Just go to Grid Admin:User Admin:Create New User form. 
   After the submission of the form, the user will receive an email for activating his/her account. The user creates
   the new password during that activation process.

How to join the UC Grid if your department has a cluster?

     If you come from any department or organization within any of the UC campus, you can join the UC Grid.  
        1. Contact Campus Grid Admin and asking for permission
        2. Install the Grid Appliance node, attach that to your cluster. 
        3. Configure the Grid Appliance node so that it can submit the job.
        4. Install UGP software   
        5. Generate a host certificate and send to UC Grid Admin
        6. UC Grid Admin signed the certificate, send back the signed host certificate
        7. Put host certificate to /etc/grid-security
        8. Ask Campus Grid Admin to add your cluster in the Campus Grid Portal. 
        9. Once the cluster is added to the Campus Grid Portal, it will show up in UC Grid Portal.

How does the user reset the password?

   In main page, click "Fogot your Grid Password?" page. You must authenticate yourself by Shibboleth. 
   An email will be sent to you for confirmation.  Once you confirm your email, the Grid admin would be notified
   for the reset. The Grid Admin login to the portal, click "Grid Admin:User Requests:ResetPasswordApproval", click 
   "Approve". The user would receive email to reset his/her password.

How does the user add an additional resource?

   First, you have to contact the resource admin for accounts.  Once you have an account, 
   you could login to the portal and click "Add Resources", and select the resource to add. 
   When the resource admin approves your request, the resource will be added to your resource access list.


$GLOBUS_LOCATION/libexec/globus-scheduler-provider-sge: not found

The error is coming because the SGE-SEG distributed by London e-Science did not provide this file. Please create a file with following contents

echo "<Scheduler xmlns=\"http://mds.globus.org/batchproviders/2004/09\">";

echo "</Scheduler>";

and make the file executable using the command 'chmod + x globus-scheduler-provider-sge

Start Globus service using Logical Hostname instead of IP address when there are two NIC addresses

cd $GLOBUS_LOCATION/etc/globus_wsrf_core

Edit two files and add two lines in each one of them immediately following

<globalConfiguration>

File1: server-config.wsdd File2: server-config.wsdd

       <parameter name="logicalHost" value="grid-appliance-hostname"/>
       <parameter name="disableDNS" value="true"/>

Save the files and restart Globus

How to take Mysql dump to transfer the database tables

First dump the contents from current database to a file

mysqldump -a gridsphere -u dbadmin -p -r ugp.file


Now populate the new database with the contents from the dumped file.

mysql gridsphere -u dbadmin -p

mysql>\. ugp.file

Host certificate Renewal and Expiration

Host certificates are normally signed for one year. They can be renewed by sending an e-mail to UCLA's atshpc group. Normally, a new certificate will be generated from old request file. However, if you need to create a new certificate request file for some reason, please run the command below after sourcing the globus environment.

grid-cert-request -service host `hostname`

The expiry date of a certificate can be viewed with the command below:

openssl x509 -in /etc/grid-security/hostcert.pem -noout -enddate

Testing GRAM job submission for SGE

 First login to the globus installed host as regular user who has a UC grid certificate and issue the following commands

 > source $GLOBUS_LOCATION/etc/globus-user-env.sh
 > myproxy-get-delegation -s myproxy.ucgrid.org
 > globusrun-ws -debug -batch -submit -o job_epr -factory `hostname` -Ft SGE -f  submit.xml
 Then check the status using the command below 
 > globusrun-ws -status -job-epr-file job_epr
 > cat submit.xml
 <job>
   <executable>/bin/echo</executable>
   <argument>this is an example_string </argument>
   <argument>Globus was here</argument>
   <stdout>${GLOBUS_USER_HOME}/stdout</stdout>
   <stderr>${GLOBUS_USER_HOME}/stderr</stderr>
 </job>

How to test whether Index Service is working

Suppose you want to know if the index service for Statistics that returns the load on the cluster through $GLOBUS_LOCATION/libexec/aggrexec/globus-info-provider is working, please issue this command

wsrf-query -a -z none -s https://gridappliancehostname:8443/wsrf/services/DefaultIndexService "//*[local-name()='Statistics'][mds:hostname='gridappliancehostname']"


How to redeploy Gridsphere

If you like to reset the gridsphere portal. You need the following steps:

1. Remove Gridsphere database or backup in somewhere

  Warning: if you need data, please backup your database first: 
     > mysqldump -a gridsphere -u dbadmin -p -r ugp.db
  To remove all tables from gridsphere:
  >mysql -u root -p 
     >  drop database gridsphere
     >  create database gridsphere
     >  grant all privileges on gridsphere.* to dbadmin@localhost identified by 'userpass';
     >  grant all privileges on gridsphere.* to dbadmin@localhost.localdomain identified by 'userpass';


2. Reinstall the gridsphere

   > cd 3rdParty
   > ant installGridsphere

Now you point to http://localhost:9443/gridsphere, you can recreate your super user account.


Installing UC Grid Simple CA tar file in GT 5.0 Globus installations

UC Grid simpleCA was created with simpleCA version 19. Current version requires a little change on the file below at line 250

globus_simple_ca_21201907_setup-0.19/pkgdata/Makefile.in

< done

> done || true