Install Troubleshooting

From UGP-Wiki

Jump to: navigation, search

[edit] VI. Troubleshooting

1. The Data Manager does not work
   The Grid Portal machine must be able to access itself and the Grid
   Appliances on port 2811. To test whether it can, from the Grid Portal try
   to telnet to the Grid Appliance on that port:
       telnet appliance.ucla.edu 2811
2. The Resources table does not show the Status (up or down) of a particular
   cluster
     □ The Resources are agragated in the background periodically. You will
       normally have to wait up to 10 minutes after starting GT4 on the Grid
       Appliance connected to the cluster before the information about the
       cluster shows up under Resources.
     □ Did you execute srun.sh? See the section titled "After Installation" in
       this INSTALL.txt document.
     □ Check whether you can connect to the Grid Portal machine at port 8443
       from the Appliance. From the Grid Appliance enter:
           telnet portal.machine.address 8443
3. You do not see anything after you login to the Grid Portal
   As the Grid Administrator, you must have added at least one Grid Appliance
   to the Portal for the Portal to work.
4. As Grid Administrator, you added a user manually, without the user filling
   in the form on the web. The user can login but cannot work with any of the
   clusters. What is wrong?
   If you manually add a user as Grid Administrator, you must also have a
   Cluster Administrator of a cluster on which that user has a login id, Grid
   enable that user for that cluster. See the steps in the section above on
   "Adding Users", for the details.
5. Users cannot submit jobs to a particular cluster
   You must make sure certificates have been created and placed in
   "credentialDir" on the Grid Portal machine.
   You cannot submit jobs when logged in to the Grid Portal as the GridSphere
   super user. You must login as either the Grid Administrator or any other
   regular user for which a certificate has been created. When the user logs
   in, a proxy certificate is created for that session. The proxy certificate
   is used as credential to access the remote clusters.
   Also the Grid Appliance and Grid Portal must have a trust relationship with
   the CA. For an Appliance make sure to followed the instructions about
   copying the file:
       .globus/simpleCA/globus_simple_ca....tar.gz
   If the users get a "Socket Timed Out" message each time that they submit
   a job, check the /etc/sudoers file on the Appliance and make sure that you
   have commented out the line that reads:
       Defaults   requiretty
6. There are numerous different connection problems between the Grid Portal
   and an Appliance
     □ The Appliance CA must be signed or accepted by the Grid Portal CA, or
       whatever CA you are using.
     □ Both machine must trust each other.
   See the section titled "Configuring your Certificate Authority (CA)" for
   the details.
7. The Grid Portal cannot retrieve the status of jobs; it always reports a
   status of "Unsubmitted"
     □ For SGE, you must be running version 6.0 or higher for the status to
       appear. You must have $SGE_ROOT and $SGE_CELL define in the
       environment, and also have the file named:
           $SGE_ROOT/$SGE_CELL/common/reporting.
     □ For PBS, you must cross-mount the file system containing PBS's server
       log file on the Grid Appliance. You must have $PBS_HOME defined in the
       environment and also have the directory named:
           $PBS_HOME/server_logs
8. You have two network cards in the Grid Appliance machine and Globus Toolkit
   is running only at local network IP. The Globus Toolkit needs to be running
   on the public IP.
   The solution is to modify the two files:
       $GLOBUS_LOCATION/etc/globus_wsrf_core/client-server-config.wsdd
       $GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd
   and change them as follows so it runs using the hostname instead of the IP
   address.
       <parameter name="logicalHost" value="appliance.ucla.edu"/>
       <parameter name="disableDNS" value="true"/>
9. GridFtp is not working
   Check /etc/hosts.allow and make sure it contains a line like this:
       globus-gridftp-server: [Portal machine IP]
   If you do not see that line, add it.
10. When you try to submit a job to a cluster from the Grid Portal you get the
   following error:
       Job submission failed: Caused By Job Submission Exception:
       org.globus.wsrf.container.ContainerException: Container failed to
       initialize [Caused by: Configuration file directory './etc' does not
       exist or is not a directory or is not readable.]
   You must have GLOBUS_LOCATION variables defined in jakarta-tomcat: Look for
   lines in $CATALINA_HOME/bin/catalina.sh that look like:


       # Set juli LogManager if it is present
       if [ -r "$CATALINA_HOME"/bin/tomcat-juli.jar ]; then
       JAVA_OPTS="$JAVA_OPTS "-Djava.util.logging.manager=
       org.apache.juli.ClassLoaderLogManager" "-Djava.util.logging.config.file
       ="$CATALINA_BASE/conf/logging.properties"
       JAVA_OPTS="$JAVA_OPTS "-DGLOBUS_LOCATION=/home/globus/UCPortal/GT4
   In the last of these lines replace /home/globus/UCPortal with the full path
   of your install directory.
Personal tools