Hadoop On Demand (HOD)

SABER provides the Hadoop On Demand (HOD) service to its users to allow MapReduce jobs to be run on clusters of arbitrary size without the need for advance knowledge of Hadoop or allocation of dedicated resources. Only minimal configuration is required, similar to loading any software package for HPC-style workloads. HOD utilizes the underlying scheduler and accounting services provided by Torque and Moab. SABER’s HOD service runs hanythingondemand v3—Ghent University’s generalized, third-generation derivative of Apache’s Hadoop On Demand software—providing an execution environment for MapReduce applications within Hadoop, Hbase, and Hive frameworks.

Requirements

Users require a valid UIC NetID and an active SABER subscription; to gain access to SABER please visit ACER’s Request Access page.

Data staging

Before using HOD for MapReduce jobs, data must be transferred to SABER.

Using the /projects filesystem

Users with SABER access should have storage space allocated in the Lustre filesystem mounted on /projects. The ACER project ID should be provided to the project’s PI or other designated curator. Data for exclusive user access should be put in the /projects/project_ID/NetID, while data common to all users within a project should be put in /projects/project_ID/common.

While users also receive space in /home and /groups, these NFS filesystems are meant to store scripts, configurations, and other small files. NFS space is provided with a limited quota to discourage the storage of large datasets. Please store data in /projects unless otherwise instructed to do so by ACER support personnel.

Data transfer

The first step is to transfer datasets onto SABER using scpsftp, or a graphics SSH-based file transfer application; graphical options include:

Note that ACER uses Duo for two-factor authentication (2FA) for secure shell and file transfer operations. Please refer to the ACER Duo page for more information.

Operating HOD

Once data staging is complete, the user can log in and begin running jobs. There are two means for operation: interactive and batch.

Logging in

HOD is instantiated through one of SABER’s two login nodes, login-1 and login-2.saber.acer.uic.edu. To access, connect to either node using a command-line or graphical SSH client (such as PuTTY for Windows). E.g., to log in from the command line, run:

$ ssh NetID@login-n.saber.acer.uic.edu

Configuring the services/hod module

The HOD software stack is made up of dozens of various tools and applications. To streamline operation, the entirety of the HOD environment can be loaded through the module command available on all SABER nodes. To load the HOD software environment, run:

$ module load services/hod

As the services/hod module is necessary for both initiating and running HOD, it fundamentally important that the environment be loaded on any login or compute node on SABER. Therefore, users should add the line

module load services/hod

to their .bashrc or .cshrc file so it is automatically loaded on any node.

Interactive operation

ACER support staff recommends users start with interactive operation of HOD on SABER. This allows the user to (a) view realtime feedback from MapReduce jobs as they are run, and (b) debug failed jobs accordingly.

Creating an instance of HOD

With the services/hod module loaded, a user should be able to now instantiate a HOD job using hod create; e.g.:

$ hod create
Submitting HOD cluster with no label (job id will be used as a default label) ...
Job submitted: Jobid 675.admin state Q ehosts 
Verifying and connecting to an HOD cluster

Common parameters used to create a HOD cluster include:

  • --dist hadoop_distribution: HOD requires a Hadoop distribution to be identifed for execution of MapReduce tasks. By default, SABER HOD uses Hadoop-2.5.0-cdh5.3.1-native when --dist is not specified.
  • --label cluster_label: if a label is not specified, HOD will use the cluster’s associated Torque job ID by default.
  • --job-nodes node_count: while initial testing is best suited for a single node, larger jobs may require several nodes run in parallel, which can be defined with the --job-nodes paramter (analogous to qsub -l nodes=node_count).
  • --job-walltime HH:MM:SS: default hod job lengths are 48 hours; if you know you need a longer (or shorter) job duration, use the --job-walltime parameter (analogous to qsub -l walltime=HH:MM:SS).

Listing and connecting to HOD clusters

Once you have a cluster created, you can check its status with the hod list command; in the cluster creation example above, the resulting list appears as:

$ hod list
Cluster label   Job ID          State   Hosts             
675.admin       675.admin       R       compute-2-15.ibedr

Assuming your job is in the R (running) state, you can now connect with hod connect using its cluster label; e.g.:

$ hod connect 675.admin
Connecting to HOD cluster with label '675.admin'...
Job ID found: 675.admin
HOD cluster '675.admin' @ job ID 675.admin appears to be running...
Setting up SSH connection to compute-2-15.ibedr...
Welcome to your hanythingondemand cluster (label: 675.admin)
 :
 :

You will also see output providing environmental variable and module information.

Terminating HOD cluster

Once you are done with your HOD session, you should issue hod destroy with the specific cluster label to conclude your session and remove the job from the Torque scheduler. Failure to terminate an HOD session will mean you will be charged for the entirety of the cluster’s specified duration.

If the cluster runs for the entirety of its duration, the job will exit on its own. However, the HOD cluster and its label will remain until explicitly destroyed. Regardless, you must manually destroy the cluster, especially if you plan on re-using the same cluster label.

For the HOD session example above, the cluster can be terminated by running:

$ hod destroy 675.admin
Destroying HOD cluster with label '675.admin'...
  :

If the HOD cluster is still running, you will be prompted to confirm destruction of the cluster:

  :
Job ID: 675.admin
Job status: R
Confirm destroying the *running* HOD cluster with label '675.admin'? [y/n]: y

Starting actual destruction of HOD cluster with label '675.admin'...
  :

When the cluster is terminated, you will be notified accordingly:

Job with ID 675.admin deleted.
Removed cluster localworkdir directory ... for cluster labeled 675.admin
Removed cluster info directory ... for cluster labeled 675.admin

HOD cluster with label '675.admin' (job ID: 675.admin) destroyed.

Batch submission (non-interactive)

Once MapReduce code has been tested and run successfully through an interactive HOD session, a user may wish to submit jobs in batch mode. This “set it and forget it” mode of operation allows jobs to start automatically as resources are available and without any assistance from the user after submission.

Batch mode clusters are instantiated with the hod batch command. Batch instantiation requires the name of a shell script, specified using the --script=script_filename argument, containing the MapReduce commands that would otherwise be executed interactively; e.g., for batch execution of the shell script my-script.sh:

$ hod batch --script=my-script.sh

Troubleshooting

If you run into problems with HOD, please verify the following before contacting support:

  • Command-line syntax. Review your hod command line to ensure you are specifying arguments and parameters correctly. A full list of hod options are available on the hanythingondemand command line interface reference on GitHub.
  • ACER funds account balance. If your job exits immediately, you may not have sufficient funds remaining in your ACER account. Please contact support if you believe this is the case.

If you still encounter difficulties, please contact ACER Support at acer@uic.edu. Please be sure as much relevant information as possible, including any associated job IDs, submit script paths, and hod commands.

References