Frequently Asked Questions

From ACCRE Wiki

Cluster Accounts

How do I change my ACCRE account password?

After you have logged in to the cluster (login.accre.vanderbilt.edu) using your existing password, type in accre_password change and follow the instructions. Your password must meet the following requirements:

  1. Your ACCRE password must be at least 14 characters long, and not more than 4096.
  2. Your password must receive a perfect score on the ZXCVBN test. If you need a password that can pass the ZXCVBN test, type in accre_password generate.
  3. In addition to uppercase letters, lowercase letters, numbers and symbols, you may use non-ASCII characters (including emoji and characters in Chinese, Japanese, or Korean). However, you may have difficulty logging in later with different keyboards or terminal utilities. All characters must be printable and in UTF-8.

Your password change will take effect immediately.

I’ve forgotten my password; what is the procedure to reset it?

Notify us by submitting a helpdesk ticket. The script we run to reset your password propagates the change out to the cluster and sends you an e-mail with your new password. This normally takes a few minutes, and we ask you wait at least 15 minutes to log on. As soon as you receive the e-mail with your new password, please follow the procedure to reset it to something of your choosing.

Does ACCRE have a Jupyter cluster that includes tools like HDFS, MapReduce, Spark, and so on?

Yes! Our Jupyter cluster is available to all students, faculty and staff at Vanderbilt University, as well as VUMC employees and members of the Complex Muon Solenoid experiment (CMS). You do not need a traditional cluster account to use the Jupyter cluster. You can find more information about the environment here.

Can I use my Vanderbilt password to log in to ACCRE?

Your ACCRE login is separate from the Vanderbilt username and password you use for email and other services. It should not be used anywhere else.


Connectivity

I cannot connect to the cluster, am experiencing intermittent connectivity to the cluster, or the system hangs upon log on. What should I do?

If you are normally able to connect and suddenly cannot, let us know by submitting a helpdesk ticket. Please provide as much information about the issue as possible including any useful output to your screen. Besides occasional network problems, there are a number of possible causes for sluggish to zero connectivity. Please read the following to help self-diagnose before submitting a help desk ticket so we are better able to assist you:

  • You may connect to the cluster only via a Secure Shell (SSH) client. For more information go to Submitting you first job.
  • If you see the following error when trying to connect: ssh: connect to host login.accre.vanderbilt.edu port 22: Operation timed out it means the gateway you’re trying to connect to is unreachable. The cluster has roughly 6 x86 gateway machines (See node configuration). There are several reasons why we have multiple gateways of each architecture. For example, this distributes the user load across many login gateways. Another main reason is to protect against an unreachable gateway preventing a connection to the cluster. However, even with this backup system in place it is still possible when you ssh to login.accre.vanderbilt.edu to get an error similar to the above. The login.accre.vanderbilt.edu hostname is only an alias which uses DNS round-robining to select one of the actual gateway machines to connect to. If you get the above error, what has likely happened is that either the local DNS cache on your system or the DNS server you use has cached an alias to a gateway which is now unreachable for some reason. If this occurs, you should simply select one of the actual gateways at random and attempt to ssh directly to it. For example, ssh vmps13.accre.vanderbilt.edu.
  • If you can connect but the connection “hangs” before you receive a command-line prompt, your problem may be related to an error in one of your login files (e. g., .bashrc in your home directory). This we can help diagnose. Please submit a helpdesk ticket.
  • If you can connect but your login “hangs”, it is possible we are experiencing a problem with GPFS (the General Parallel File System designed by IBM). Other symptoms of this include logging on but not being able to see, for example, your home directory. Sometimes the file system problem is temporary, lasting only a moment. Larger file system problems normally occur when the system is overloaded, which can happen for various reasons. If the problem is found to be caused by a particular user account or set of jobs, we immediately work with the user to resolve it. Sometimes the issue is not related to user software or the way a user is submitting jobs and we have to work with IBM to determine the root of the file system problem. When we expect the issue cannot be resolved quickly we notify all users to expect intermittent cluster access. In most cases when the system is in this state you should still be able to accomplish your work, albeit you may find the system is occasionally sluggish and intermittently nonresponsive. Please be patient. You will find upon repeated attempts you will be able to log on and submit jobs. If these jobs are not heavily dependent on disk I/O they should continue running.

In any case, so we can immediately begin resolving the issue, please notify us ASAP of any connectivity problems by submitting a helpdesk ticket. Include details such as a “cut and paste” of the information in your login window if you are able.


I cannot connect to login.accre.vanderbilt.edu or after logging in got an error message says my home directory is not found.

login.accre.vanderbilt.edu is a DNS round robin alias for one of our ~6 cluster gateways. It is possible that the gateway you were randomly assigned is experiencing a hardware issue. When this happens we take the gateway out of the rotation but DNS may cache the old information for a period of time. Please submit a helpdesk ticket for assistance.

I'm receiving a long error message about locales whenever I log in.

The error message in question looks like this:

To list useful cluster commands type:      accre_help
To view your current storage type:         accre_storage
To list basic Linux commands type:         commands101
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LANG = "C.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
[[ repeats itself a number of times ]]

When you run ssh on a Unix-based console (Mac, Linux, or Windows Subsystem for Linux), any environmental variables are carried over to the server you are connecting to. In particular, if you have LANG = "C.UTF-8" set on your local machine and you run ssh with the default configuration, then this will connect to the terminal with LANG set to C.UTF-8, which will invoke Perl even if your .bashrc file is empty. We believe this is the default for Ubuntu 18.04 running under Windows Subsystem for Linux. To fix this, clear the LANG environment variable using unset LANG, or use env LANG= ssh login.accre.vanderbilt.edu to log in.

How can I make a scheduled downtime work for me?

As a scheduled downtime for the cluster approaches more time becomes available for shorter jobs. Thus, if you have applications that take a few days or less to run, you will be able to execute more of these types of jobs as a scheduled downtime approaches because applications requiring longer period of times will not be running. It’s an excellent time to take advantage of the extra computing cycles that would ordinarily not be available!

How do I mount a Samba (smb) share?

If you have been assigned a Samba or smb share to mount locally the following instructions should help. Note you should have been given a share name and you should have a Samba password that is different from your cluster or VUnetID password:

On a Mac open finder. On the menu bar at the top select Go, Connect to Server… (⌘K). In the Server Address: field enter the following;

smb://samba.accre.vanderbilt.edu/*sharename*/

where sharename is the name of the Samba share you were assigned. Next click Connect and you will be prompted for a Name and Password. This will be your cluster username and your Samba password. At this point hit Connect again and your share will be mounted as a drive in Finder.

On a Windows PC open windows explorer by hitting the windows key and typing explorer then enter. In the windows explorer that will open right click on your computer in the left pane (typically My PC or My Computer) and left click Map Network Drive on the menu that opens. In that window select the drive letter you want assigned to the mount and enter the following where it says folder;

samba.accre.vanderbilt.edu*sharename*

where sharename is the name of the Samba share you were assigned. Ensure that you have selected the “Reconnect at sign-in” box checked. Click finish and you will be prompted for a username and password. Enter the following;

samba.accre.vanderbilt.edu*username*

where *username* is your cluster username. For the password use your Samba password and hit enter or click ok. At this point your share will be mounted as a drive in File Explorer.

NEW I can't SSH into the cluster. I'm getting an error message: "No matching cipher found"

That error means that we have updated our SSH servers to disable weak ciphers, and you are using an older SSH client that cannot support strong ciphers.

The fix is to upgrade the SSH client on your computer. Below are update options for common scenarios. There are people connecting to the cluster with SSH in other ways (one group uses a Javascript SFTP client inside the Atom text editor), so if you're one of these edge cases, contact us for more info.

General:
Many OS's include a SSH client out of the box, so try updating your OS to get a newer client.

  • Windows: Use Windows Update to pull down all updates, rebooting as necessary. The current version of SSH on Windows 10 (as of 2021-04-27) is OpenSSH 7.7.
  • Linux: Most versions of Linux should support our current config out-of-the-box. In particular, Centos 7/8 and Ubuntu 18.04/20.04 shouldn't have any problems. If you're using an older OS release like RHEL 6, you need to either upgrade the OS to a newer release, or download the OpenSSH source and compile new binaries.
  • Mac OS: If your hardware supports it, upgrading to a newer version of macOS is recommended. If it does not, we recommend installing Homebrew (https://brew.sh) and using it to install a newer version of OpenSSH.
  • Android/MacOS: We have tested Termius and found it to work well on both.

Application links for updates:
PuTTY: https://www.chiark.greenend.org.uk/~sgtatham/putty/
FileZilla: https://filezilla-project.org
Bitvise: https://www.bitvise.com/ssh-client
Termius: https://termius.com
MobaXterm: https://mobaxterm.mobatek.net
Cyberduck: https://cyberduck.io
sFTP: https://www.sftpapp.com

For advanced users who would like to harden their own servers, we have hardened our SSH server using ssh-audit


Environment

How do I change my default shell?

Once you log onto the cluster (login.accre.vanderbilt.edu) , type: accre_chsh. You will need to enter your ACCRE password and then follow the instructions to select a new shell from the list of supported options. You may need to wait up to 20 minutes for all ACCRE servers to update this setting.

How do I display graphics from the cluster to my local machine?

The easiest way to run graphical applications is by setting up a virtual desktop within the ACCRE Visualization Portal (after you log in, go to Interactive Apps > ACCRE Desktop).

The X Window Server lets you run applications on ACCRE that have a graphical user interface. While it may come in useful at times, applications running on an X connection are slower than if they are running locally. For this reason we do not support this or recommend using it (especially from outside campus) unless there are no alternatives.

If you must use the X Window Server, you should first check with your PI before installing the following software, since one or both of these may already be on your system, especially if you’re using a computer in your lab which is already configured to run on the cluster.

  1. Get X server support on your local machine: The graphics environment on the cluster is X11, therefore, you must install and run an X server from your local machine
  2. Configure SSH tunneling: You must tell SSH on your local machine to allow the display of graphics from software running on the cluster.

Windows users: Xming X Server is one of the best X Window servers available for Windows. You can follow the instruction provided there to install and set up the server.

Mac OS X users: You can get a free X11 server from Apple. Mac OS X should already have SSH installed.

  1. Follow their directions to install and run the X11 server.
  2. Launch the X11 server.
  3. Run an xterm.
  4. When you log on to the cluster from the command line in the xterm, to activate SSH tunneling you can use the -X option, i.e., ssh -X user@login.accre.vanderbilt.edu.
  5. Finally, to test that X11 forwarding is installed and working correctly, try typing xeyes on the command prompt and hitting Enter/Return. You should see a window appear with two eyes that follow your cursor.

Linux users: We assume you are already running an X server and have SSH installed.

  1. When you log on to the cluster, ssh -X will activate SSH tunneling, i.e., ssh -X user@login.accre.vanderbilt.edu.
  2. To test that X11 forwarding is installed and working correctly, try typing xeyes on the command prompt and hitting Enter/Return. You should see a window appear with two eyes that follow your cursor.

I am running an X server, how do I fix X connection or .Xauthority file errors?

If you are getting error messages similar to these:

/usr/X11R6/bin/xauth: error in locking authority file /home/user/.Xauthority

X11 connection rejected because of wrong authentication. X connection to
local host:11.0 broken (explicit kill or server shutdown)

try removing the .Xauthority file in your home directory, then log out and back in. This file occasionally becomes corrupted. When you log back in and start X, it will recreate your .Xauthority file. Sometimes you have to do this a few times. If you continue to have problems, please submit a helpdesk ticket.


Linux

What command do I need to type in order to run an executable in Linux?

To execute a program in the current work directory, type:

./<file_name>

For files that are not in the current working directory, use the full path: /path/to/your/executable/file

How do I change the group associated with a file?

You can change the group of a file if you are the file’s owner and you are in the group to which you are trying to change the file. The command is:

chgrp [options] group_name file_name

-R: recurse through subdirectories -f: suppress most error messages If you want to submit a job from group other than your primary group, please see Submitting Jobs.


Jobs

What types of nodes are available?

We currently have nodes with between 8-16 CPU cores and 24-256GB RAM. We also have a group of nodes equipped with modern NVIDIA GPUs on board. Please refer to our Intro to the Cluster slides for more details.

How do I run test jobs?

We allow users to run very short (< 30 minutes) tests that have low memory usage (< 1 GB). Anything more should be submitted the scheduler. We have a debug SLURM partition/queue available for running quick tests and prototyping.

What are the ACCRE cluster defined attributes I can use in my SBATCH scripts corresponding to the available node properties?

The properties of our compute nodes can be specified with combinations of available attributes (defined by us), e.g.: haswell, sandy_bridge Note that the haswell attribute requests the latest Intel processors, while sandy_bridge requests the previous generation. In your batch script you could specify: #SBATCH --constraint=haswell. This would instruct the scheduler to run the job only on a node with an Intel Xeon Haswell processor. Note that your job may take longer to start when these attributes are included as you are limiting the pool of resources the scheduler can choose from. For a full list of available features, trying running the sinfofeatures command while logged into the cluster.

Can I run on the gateway machines?

When you log on via login.accre.vanderbilt.edu, you are logged onto a gateway machine. From here you submit your jobs which are sent to the compute nodes by the scheduler. However, we do allow you to run very short (< 30 minutes) test jobs that have low memory usage (< 1 GB) on the gateway machines, as long as such jobs do not slow the gateway for other users. Anything longer than this should be submitted to the compute nodes using sbatch (see the tutorial).

What happens if my job uses more resources than requested?

The job scheduler will automatically kill most jobs which exceed the resources requested in the SBATCH script. For example, if you specify a walltime of 4 hours and your job runs over that, the scheduler will kill the job. The reason for this is that running jobs which use more resources than requested may affect the scheduling and running of other jobs. This is because the scheduler relies on SLURM specifications (among other parameters) to determine on which nodes to run jobs. Also read our job scheduler policies for more information on killing jobs which are interfering with other jobs or the system itself. When testing code or running code you are unfamiliar with, you should more diligently monitor the resource consumption to fine tune your SBATCH request. Specifying much more, e.g., walltime or mem, than your job requires may delay its start time if the requested resources are not immediately available. Therefore, you should start somewhat conservatively, then reduce your resource specifications once you’ve determined what you are really using, still always leaving a buffer to ensure the job is covered. Learn more about how to request resources and the SBATCH defaults when you submit a job. Learn how to monitor and check the status of a submitted job.

Why is my eligible job waiting so long in the PENDING state?

There are several things you should check to understand your wait time in the queue. See tips on checking the status of a submitted job.

  • Make sure you have requested an allowed set of resources. Check your SBATCH script against both the available nodes in the cluster and our job scheduler policies. You can also check on the resources requested with the command: scontrol show jobjob_number
  • Check your group’s current usage by typing qSummary -g group_name. Compare that to your group’s bursting limits by running showLimits -g group_name. If your group’s current usage is close to or equal to its bursting limits, this could be causing delays. Details about both of these commands can be found [[../../new-command-slurm_groups/index.html|by clicking on this link]].
  • Check overall cluster utilization with the SlurmActive command.
  • Check the queue and current usage on the cluster. It could be the particular resources your jobs need may be heavily utilized, even if the entire cluster is not. You can check the total usage of the cluster with the command squeue. You can also see current and past utilization levels on this website.
  • Your account or group account may be running over its fairshare. This means when the cluster is very busy, other jobs from accounts which are under fairshare may be assigned higher priority and may jump ahead of your job in the eligible queue. Use the showLimits -g group_name command to check your fairshare.

If you still do not understand why your jobs are not starting more quickly, please submit a helpdesk ticket.

What does job status Deferred mean?

In SLURM there is no “deferred” state. However, jobs may ask for resources that cannot be provided, e.g., too much memory. In such cases, running the command squeue and looking for your job ID, Slurm will provide a short explanation of why the job either cannot run, or is not running. Do squeue -u <username> to see the explanation.

What is the maximum number of jobs I can submit or have running at any one time?

“Active” Limits: Each user/group/account has a limit on the number of processors in use at any time. This number is summed from any combination of single and multi-processor jobs. Additional limits are placed as necessary for groups running either medium (defined as 4 to 7 days) or long (over 7 days) jobs on a regular basis if there usage is impacting the ability of other groups to use their full fairshare. Individual groups may also request upper limits on their users. New guest users have upper job limits until they have attended the Introduction to the Cluster and Job Scheduler classes. Use the showLimits command to check your group’s limits. Please refer to the job scheduler policies for additional important details of these limits.

What is the maximum allowed “wall clock time” I may specify?

The maximum allowed walltime is 14 days, or in hh:mm:ss = 336:00:00. Your job will not start if you have specified a walltime greater than this. You may reduce the walltime of an already submitted job using scontrol (slurm job control). In addition we ask that, except for a small number of test jobs, jobs run at least 30 minutes and over an hour in length is preferable. Our job scheduler policies explains more on this subject. Also see How to Submit Basic Jobs for other SBATCH specifications and how to deal with very short jobs.

How do I hold/release/delete a job?

A user may place a USER hold upon any job the user owns. To do that, type: scontrol hold <jobId>. To release the held job, type: scontrol release <jobId>. Note that you can only hold and release jobs that are pending (i.e. this will not work for running jobs). User can also delete a queued/running job using the command: scancel <jobId>. To delete all the jobs owned by the user, type: scancel -u <userid>. To cancel a job by name, type: scancel --name <jobName>.

Where can I find detailed documentation on all SLURM commands?

Please visit SLURM for a complete and detailed list of all SLURM commands.

How can I delete many jobs at once?

If you are using bash, the following script shows how to delete all jobs between 10000 and 10010:

    for jobid in `seq 10000 10010`
    do
    scancel $jobid
    echo cancelling $jobid
    done

How much memory is available on each node?

Because the OS and other system processes (e.g. GPFS managment) already use certain amount of memory, not all physical memory is avaiable for running jobs. In general, ACCRE nodes contain anywhere from 22GB – 248GB of available memory for jobs to use.

How do I request a node for exclusive usage?

To request a node for your private use, use something like the following:

#SBATCH --ntasks=12
#SBATCH --exclusive

in your job submission script. If the job does not require exclusive access to the node (it just needs 8 cores), you can still use:

#SBATCH -ntasks=8

Note that in this case, the job can be assigned to an 8-core, 12-core, or 16-core node that is shared with other jobs. Note that requesting exclusive access to a compute node may result in longer queue times.

Do ACCRE compute nodes support hyperthreading?

All ACCRE CPUs support 2-way simultaneous multithreading (SMT), such as Intel hyperthreading. If you request 2 tasks/cores in your SLURM job, SLURM will allocate 2 physical cores (or 4 logical cores) to your job. However, the user must decide whether to make use of hyperthreading or not, and instruct his/her program to do so. We leave hyperthreading enabled on all but our GPU nodes. Many multi-processor applications can take advantage of hyperthreading to run in significantly less time. Please see this link for more information on hyperthreading.

If I belong to multiple groups, how can I define the group name under which my job is to run on the cluster?

You can add the following line in your SBATCH script: #SBATCH --account=. Here, mygroup is the group name that you want the job to run under.

How do I checkpoint my job?

If your job runs more than a few hours, it is a good idea to periodically save output to disk in case of failure. We current do not provide any checkpointing integration through SLURM, so any checkpointing must be performed directly from a user's application.

How do I use local storage on a node?

In some scenarios it may be advantageous to read or write data to a compute node’s local hard disk, rather than to/from our parallel filesystem (/home, /nobackup, and /data are all stored on the parallel filesystem). One common example is if you will be reading or writing to/from a file frequently. Each compute node has a world-readable/writeable directory at /tmp. If you want to move files to this local storage, we recommend creating a subdirectory at /tmp and then copying data to it before launching a program that will read these data. Note: a program must know where to find these data, so you generally must provide an absolute path to the file from within your program. Please be sure to clean up your data at the end of your job (using the mv or rm commands). Below is an example of how this might be done within a SLURM job:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem=4G
#SBATCH --time=4:00:00
#SBATCH --output=myjob.txt

localdir=/tmp/myjob_${SLURM_JOBID}
tmp_cleaner()
{
rm -rf ${localdir}
exit -1
}
trap 'tmp_cleaner' TERM

localdir=/tmp/myjob_${SLURM_JOBID}
mkdir ${localdir} # create unique directory on compute node
cp mydata.txt ${localdir} # copy data to node
./run_my_prog # run program that reads/writes to/from local disk
rm ${localdir}/mydata.txt # remove data from local disk
mv ${localdir}/output.txt ./ # move results to working directory on GPFS

We've added the tmp_cleaner() function above, otherwise the data will remain in /tmp if the job is cancelled. The function intercepts the SIGTERM signal and deletes the data before the job ends. The cleanup must complete before the 30 seconds of grace time that Slurm gives to jobs to complete before forcibly killing all the job's processes.

A SLURM command fails with a socket timeout message. What’s the problem?

Occasionally when you attempt to run a SLURM command (e.g. sbatch, salloc, squeue) the command may hang for an extended period of time before failing with the following message:

error: slurm_receive_msg: Socket timed out on send/recv operation
error: Batch job submission failed: Socket timed out on send/recv operation

This error results when the SLURM controller is under a high amount of stress. Avoiding this error requires all cluster users to play nice and follow cluster etiquette policies. Specifically, all cluster users are encouraged to

  1. submit a large number (>100) of similar jobs as job arrays (see our SLURM documentation page for examples), and
  2. avoid submitting a large number (>100) of short jobs (< 30 minutes),
  3. avoid frequently calling SLURM commands like squeue and scontrol for automated job submissions and monitoring.

Job arrays reduce the load on the scheduler because SLURM only attempts to schedule the entire array once, rather than every element within the array independently. Short jobs produce more “churn” within the job scheduler as it works to allocate, de-allocate, and re-allocate resources at a rapid pace. If you are running a lot of short jobs, please try to bundle multiple jobs together into a single job to put less stress on the scheduler. Similarly, automated monitoring tools can abuse the scheduler by requesting data from SLURM too frequently. Please reach out to us if you need assistance developing alternate methods of monitoring and submitting jobs in an automated manner. The socket timeout error message is generally intermittent, so if you wait a few minutes and try your SLURM command again it may complete immediately.


Disk Space

Determining Disk Space Usage and Quotas

As noted in the cluster disk policies, you have both soft and hard limits on both your home and nobackup directories. To help keep the system running smoothly, you should be in the habit of checking your usage level, especially since hard quota limits are definitive and, due to potential filesystem problems, we may have to either kill jobs or place temporary limits on accounts which exceed their soft limit. Please read our cluster disk policies to understand disk space quotas and the FAQ on how to increase your available diskspace by using nobackup space, requesting a possible temporary quota increase, or purchasing more diskspace. To view your current usage and quota levels, type the command:

accre_storage

For example:

accre storage

The left section shows information about your current disk usage on /home and /nobackup, while the right section shows your current file count usage. If the group purchases additional space on /data or /nobackup it'll be listed. Note that the Usage column is your current disk usage, the Quota column is a soft limit, while the Limit column is a hard limit. Definitions for soft and hard limits can be found on our cluster disk policies page. If you are exceeding either your disk space or file count soft quota, the relevant line will be colored yellow, as shown in the example below. Make sure you delete (or compress) files as soon as possible in order to avoid disk I/O errors once the grace period has expired. If a line is colored red, it means you have either hit a hard quota limit, or your soft quota grace period has expired; any attempts to create additional files in the corresponding storage will result in I/O errors.

accre storage

Note that /home and /nobackup are generally controlled with user or group quotas, while /data is controlled with fileset quotas. A user or group quota is based on the user or group owning a set of files, while fileset quotas are applied directly on files within a parent directory. One instance in which this distinction becomes important is when you are sharing files with a collaborator or labmate. With user-based quotas, if you copy a file into your colleague’s home directory, the file will still count against your quota if the file owner is not changed. You can check the owner of a file by using the ls -l command. One other important detail about quota is data replication. ACCRE currently has data replication set to two for /home. This means that the disk usage of a file stored in /home on the cluster will be approximately twice that of a file outside the cluster. The accre_storage command shows you disk usage without data replication, so the output of du -sh (which shows you disk space of a directory or set of directories, and includes data replication) will differ from the accre_storage command. ls -lh, on the other hand, will show you file sizes without considering data replication, so it will be consistent with the output from accre_storage.

Using /nobackup disk space

You have disk and file allocations available for your use on both your home directory (which is backed up) and on nobackup disk space (which is not backed up). To take advantage of your nobackup disk space, simply cd to that filesystem and create your personal directory. For example:

cd /nobackup
mkdir vunetid
chmod 700 vunetid
cd vunetid

where vunetid is your unique VUNetID, which is also your ACCRE user id. If you are unsure what your VUNetID is, simply type whoami while logged into the cluster to find out. Note that the chmod 700 command is needed to set the appropriate permissions on your nobackup directory so that only you can access it. Note that some ACCRE groups also have their own private shared group /nobackup directories.

If I need more disk space than this, will you temporarily grant a quota increase?

We do not necessarily relax quota restrictions. It depends entirely on the details of your request and we can discuss your options. Please submit a support ticket explaining why you wish a quota change, how much space you believe you require, and how long you expect to need it. If you need more diskspace for an extended period of time you may purchase it. Please see the details of our cluster disk policies then send us your request.

Will ACCRE restore deleted or lost data?

Yes. Please refer to our policy regarding restoring from backup. Note that files in /nobackup are never backed up.

My network connection to ACCRE is really poor and I have a lot of data that I need to upload to ACCRE (or download from ACCRE). What are my options?

To transfer files between your local machine and ACCRE, it is recommended to install and use FileZilla. FileZilla is a simple to use client which allows you to use the SFTP protocol to upload and download files between systems. To install FileZilla, simply go to their website and download the client The following is a beginner’s guide to FileZilla: https://www.ostraining.com/blog/coding/filezilla-beginner/ If you do not want to overwrite files each time you upload a directory to the cluster then you can do the following: go to Edit > Settings > Transfers > File exists action and change the Uploads setting to Overwrite file if source is newer. Changing this setting will only upload files that are newer than the copy on the remote system. Linux and Mac clients could also just use the built-in rsync command.

How can I mount NFS, Samba, FTP, SSH, HTTP, and other remote mounts locally? (beta)

To configure this, edit your .bashrc file and add:

source /accre/usr/bin/gvfs_startup.sh

Edit your .bash_logout file and add:

source /accre/usr/bin/gvfs_cleanup.sh

Log out and log back in. Now you should be able to mount remote exports like:

The remote folders will be mounted under ~/gvfs in your home directory. To unmount, use the same syntax for mounting, except change gio mount to gio mount -u. All mounted folders will be automatically unmounted if you log out.

Using local compute node temporary storage

ACCRE allows using the /tmp directory on a local compute or gateway node for the purpose of storing small to medium sized temporary files that will not be needed after a job completes. Local disk storage should be limited to no more than 10GB of space per CPU-core allocated to the job, in order to ensure sufficient space is available to all users. All data stored in /tmp will be deleted after 30 days and will also generally be considered unavailable once a job completes.

Users are responsible for removing all data from /tmp before job completion. You are welcome to write your jobs to remove their temporary files in the manner of your choice, but we do have a tool in place to make temporary file cleanup easier for users. If you run the command source setup_accre_runtime_dir at the beginning of your Slurm script, a secure temp directory will be created for you, and hooks will be set up to ensure that this space is removed when your job completes, even in the case of job failure. The path to your temporary space will then be set in the environment variable $ACCRE_RUNTIME_DIR which you can pass to your executables to store temporary data locally.

Note that storing temporary files on the local node may drastically improve performance over using the network filesystem in certain cases, especially when a job creates a large number of very small files or repeatedly opens and closes small files.


GPU

How do I request a GPU node?

Currently, you must belong to a “GPU” group, such as nbody_gpu to gain access to one or more gpu nodes. Use the appropriate SBATCH command to submit your job and tell SLURM you want a GPU node. For example: #SBATCH --partition=maxwell This will place your job on a node with NVIDIA Maxwell Titan X GPU cards. More details about submitting GPU jobs can be found by clicking here.]


Software

What research software packages are available on the cluster?

Run module available to see a comprehensive list of available software packages that can be accessed from your environment by using the module load command.

How do I make sure that my perl/python script is using the latest version available on the cluster?

First, add the appropriate package to your environment (e.g. your .bashrc/.cshrc file) with command:

module load PKG1 PKG2 PKG3

Then, use the following line:

#!/usr/bin/env python (or perl)

as the first line of your script. This automatically detects the path to the added perl/python package and use that version as the interpreter of your script.

I’d like to have some software installed on the cluster. How do I go about doing that?

As much as possible, ACCRE staff are glad to accommodate your needs for software. Of course, the software must be amenable to execution in the cluster environment and (if not open source) you are responsible for taking care of licensing arrangements prior to installation, as well as continued maintenance of the software license. If you’d like to explore the possibility of adding some software to our cluster environment, please submit a helpdesk ticket. Note that we in general recommend that users install software into their cluster home directories. In this way you have complete control over the version of the software, applying updates, and so on. ACCRE staff are more than happy to assist you during this process.

How do I install an R package from source code?

R users should take a look at our Software Page for details and best practices for using R on the ACCRE cluster. Here is an example that uses the nlme package. Login to the cluster, and, if you have not already done so, in your home directory create a directory for your R packages. Here is an example:

mkdir -p R/rlib 

You will also need a tmp directory in your home directory, so do this in your home directory:

mkdir tmp/ 

You will need to first put the appropriate version of R in your PATH environment variable using LMod:

module load GCC OpenMPI R 

Now change to your tmp directory, and download the source code:

cd tmp/
wget http://cran.r-project.org/src/contrib/nlme_3.1-104.tar.gz 

Generally, it will only take a few seconds to download the “tarball”, but sometimes it can take longer. Now start R:

R 

At the R-prompt (denoted by >) tell R where you will keep your packages:

> .libPaths("~/R/rlib") 

Next tell R to install the package:

> install.packages("nlme_3.1-104.tar.gz", repos = NULL, type="source") 

R will now compile and install nlme into your personal R library: ~/R/rlib To test your install quit R

> quit()

Restart R and at the prompt

> .libPaths("~/R/rlib")
> library("nlme") 

You should see nlme loaded. You need to remember to add these two lines to any script you feed to R if you intend to use nlme. If you wind up installing many packages you can put the .libPaths("~/R/rlib") command in your .Rprofile. You may now delete the sourcecode package:

rm nlme_3.1-104.tar.gz 

What happens if R says that there are needed dependencies? This sometimes happens, and you will need to download and install those packages before installing the one you wanted. Just follow the steps outlined above until you have downloaded and installed all the packages.

How do I download and install an R package from the internet?

R users should take a look at our R Software Page for details and best practices for using R on the ACCRE cluster. Here is an example that uses the Zelig package. Login to the cluster, and, if you have not already done so, in your home directory create a directory for your R packages. Here is an example:

mkdir -p R/rlib 

You will need to add R to your PATH environment variable using LMod:

module load GCC OpenMPI R 

Now start R:

R 

At the R-prompt (denoted by >) tell R where you will keep your packages:

> .libPaths("~/R/rlib") 

Next tell R to install the package:

> install.packages("Zelig") 

R will now give you a list of repositories to download from. Choosing the Tennessee repository seems good. That is choice 80. R will now download, compile and install Zelig into your personal R library, ~/R/rlib. Note that occasionally you may need to pass additional arguments to install.packages() if it needs a library in a nonstandard location. For example, the hdf5r package may need a recent version of the HDF5 library that is available through LMod. In this case you might instead run a command like the following:

How do I install and load an R package from Bioconductor?

R users should take a look at our R Software Page for details and best practices for using R on the ACCRE cluster. Here is an example that uses the goseq package. Login to the cluster, and in your home directory create a directory for your R packages. Here is an example:

mkdir -p R/rlib 

You will need to add R to your PATH environment variable using LMod:

module load GCC OpenMPI R 

Now start R:

R

At the R-prompt “>” tell R where you will keep your packages:

> .libPaths("~/R/rlib") 

Next, point R to the Bioconductor site:

> source("http://bioconductor.org/biocLite.R") 

Next, ask R to get the package, compile and install it in your personal R library (~/R/rlib)

> biocLite("goseq") 

goseq and its dependencies will be downloaded, compiled, and installed. If everything succeeds you will see

* DONE (goseq)

After that, you may get a series of warnings about packages needing to be upgraded. You may ignore the warnings. To test your install quit R

> quit() 

Restart R and at the prompt

> .libPaths("~/R/rlib")
> library("goseq") 

You should see goseq and the two dependencies loaded. You need to remember to add these two lines to any script you feed to R if you intend to use goseq. If you wind up installing many packages from Bioconductor you can put the .libPaths("~/R/rlib") command in your .Rprofile.

How do I install a Perl module without root privilege?

You do not need to have root permission to install a module. You just install your PERL module locally in your home directory. Make a directory called, say, lib/ in your home directory like this:

# first navigate to your home directory
$ cd ~

# now make a directory called lib
$ mkdir lib 

Now you have a directory called ~/lib where the ~ represents the path to your home dir. ~ literally means your home dir, but you probably know that already. All you need to do is add a modifier to your perl Makefile.PL command

$ perl Makefile.PL PREFIX=~/lib LIB=~/lib 

This tells Make to install the files in the lib directory in your home directory. You then just make/nmake as before. To use the module you just need to add ~/lib to @INC. Next, you modify the top of your own scripts to look like this:

#!/usr/bin/perl -w use strict; # add your ~/lib dir to @INC use lib
"/usr/home/your_home_dir/lib/"; # proceed as usual use Some::Module;

How do I check which Python packages are installed?

Python users should check out our Python Software Page for tips and best practices for using Python on the ACCRE cluster. First, make sure you have loaded the correct version of python into your environment (by typing something like module load Anaconda3. You can check the versions of python installed on the cluster by typing module spider Python and/or module spider Anaconda. Once you have done this, next type:

python_pkginfo.py

This will run a script that lists the python packages in your current environment, including any you have installed locally (see next section). python_pkginfo.py also accepts two optional arguments, --ncol (for adjusting the number of columns in the output) and --type (this controls whether installed packages, modules, or both are printed). For example:

python_pkginfo.py --ncol 3 --type both 

would list all installed packages and modules in three columns of output. By default, installed packages are output in two columns.

How do I install a Python module without root privilege?

We encourage users to create virtual environments, either through virtualenv or anaconda, to avoid dependency conflicts between packages. It’s normally not necessary to use sudo to install python modules and packages. When trying to install a python module with pip, if you see an error similar to:

error: could not create
'/usr/local/python2/2.7.4/x86_64/gcc46/nonet/lib/python2.7/site-packages/doc':
Permission denied

you should provide the --user option to pip, e.g.:

pip install word-count --user

For more information, see our python software pages.

How do I install a Python package from source code?

Python users should check out our Python Software Page for tips and best practices for using Python on the ACCRE cluster. In general, we highly encourage users to install Python packages into a virtual (or conda) environment rather than into a directory in your home directory. Nonetheless, if you have a good reason, here is an example that uses the SQLAlchemy package. You will also need a tmp directory in your home directory, so do this in your home directory:

mkdir -p temp/SQLAlchemy
cd temp/SQLAlchemy 

Download the source code, and untar it:

wget http://pypi.python.org/packages/source/S/SQLAlchemy/SQLAlchemy-0.7.9.tar.gz
tar xzf SQLAlchemy-0.7.9.tar.gz 

You will need to add the appropriate version of python to your environment:

module load Anaconda3

Install the module:

cd SQLAlchemy-0.7.9
python setup.py install --user 

This installs the package to /home/YOUR.VUNETID/.local. All packages installed to that directory are automatically added into the python environment.

How do I run Matlab/SAS job on the cluster?

Matlab is free to use for all ACCRE users. More information is available here.

In order to run SAS jobs, you must first purchase a license from ITS software store. Once ITS notifies us your purchase, you will be added to the relevant group so that you can have permission to run the software. License may not be shared among different users. However, with one license, you can run multiple jobs at the same time on the cluster.

When trying to install a package in R, I get ‘Warning: unable to access index for repository ’.

From an R interactive session, try:

 > install.packages('package_name', dependencies=TRUE, repos='http://cran.rstudio.com/') 

When trying to install a package in R, I get ‘Warning message: package ‘somepackage’ is not available (for R version 3.0.0)'.

Try using the latest version of R.