Python on the ACCRE cluster
More information Python Official Website
Python is an interpreted programming language that has become increasingly popular in high-performance computing environments because it’s available with an assortment of numerical and scientific computing libraries (numpy
, scipy
, pandas
, etc.), relatively easy to learn, open source, and free.
Python versus Anaconda
On its own, the reference implementation of the Python language is poorly suited to scientific computing as it is not compiled to machine instructions and will perform large calculations slowly. However, with the help of several libraries one can perform very fast computations.
To manage these additional packages, Python includes standard support for easily installing additional packages from the internet with a tool called pip
. Users can further maintain multiple “virtual environments” on a single machine with different packages installed using a module called venv
.
On the ACCRE cluster, highly optimized versions of the Python interpreter are available through Lmod along with many commonly used scientific libraries which have been similarly optimized to the specific Intel CPU on each machine. Using the optimized Python builds along with the venv
and pip
tools to install additional user-specific packages if required.
An alternative system for managing scientific Python libraries is the Anaconda distribution. This distribution system provides a pre-compiled Python interpreter already packaged with many popular scientific libraries. Anaconda users typically use a different tool, conda
, for managing installation of additional packages and creating “virtual environments” on a single machine.
The downside to using Anaconda on the ACCRE cluster is that the interpreter and packages provided are not specifically optimized for ACCRE hardware and will not execute code as quickly as the optimized Python build provided through Lmod. In addition, we have found that conda
-based virtual environments will often aggressively cache package archives and quickly exhaust a user’s filesystem quota in their home directory. For these reasons, ACCRE users are encouraged to use the optimized Python builds over the Anaconda distribution. However, we do understand that certain workflows may require packages only available with conda
, or that researchers may not have the time to convert existing projects. We therefore provide limited support for Anaconda by making several versions of the distribution available through Lmod.
Using ACCRE Optimized Python
ACCRE administrators have hand-compiled multiple versions of Python that are linked against highly optimized linear algebras like OpenBLAS and Intel’s MKL, and therefore will in general yield better performance (faster execution time) than the default system version of Python. To see a list of installed versions of Python on the cluster, use Lmod’s spider
command:
[bob@gw343 ~]$ ml spider Python ---------------------------------------------------------------------------- Python: ---------------------------------------------------------------------------- Description: Python is a programming language that lets you work more quickly and integrate your systems more effectively. Versions: Python/2.7.12 Python/2.7.14 Python/3.5.2 Python/3.6.3 Other possible modules matches: Biopython IPython wxPython ---------------------------------------------------------------------------- To find other possible module matches execute: $ module -r spider '.*Python.*' ---------------------------------------------------------------------------- For detailed information about a specific "Python" module (including how to load the modules) use the module's full name. For example: $ module spider Python/3.6.3 ----------------------------------------------------------------------------
Per Lmod’s instructions, we can get more information about the installed Python version:
[bob@gw343 ~]$ module spider Python/3.6.3 ---------------------------------------------------------------------------- Python: Python/3.6.3 ---------------------------------------------------------------------------- Description: Python is a programming language that lets you work more quickly and integrate your systems more effectively. You will need to load all module(s) on any one of the lines below before the "Python/3.6.3" module is available to load. GCC/6.4.0-2.28 Intel/2017.4.196 Help: Description =========== Python is a programming language that lets you work more quickly and integrate your systems more effectively. More information ================ - Homepage: http://python.org/ Included extensions =================== arff-2.1.1, blist-1.3.6, cryptography-2.1.1, Cython-0.27.1, dateutil-2.6.1, decorator-4.1.2, docopt-0.6.2, ecdsa-0.13, joblib-0.11, lockfile-0.12.2, netaddr-0.7.19, netifaces-0.10.6, nose-1.3.7, paramiko-2.3.1, paycheck-1.0.2, pbr-3.1.1, pip-9.0.1, pycrypto-2.6.1, pyparsing-2.2.0, setuptools-36.6.0, six-1.11.0, virtualenv-15.1.0, xlrd-1.1.0
We have compiled Python 3.6.3 with both the GCC compiler and with Intel compilers, linking against Intel’s MKL library, which should yield better performance on our Intel processors. Lmod tells us that we need to load either GCC or Intel as a dependency of Python, so let’s do just that:
[bob@gw343 ~]$ ml Intel/2017.4.196 [bob@gw343 ~]$ ml Python/3.6.3 [bob@gw343 ~]$ python --version Python 3.6.3 [bob@gw343 ~]$ which python /accre/arch/easybuild/software/Compiler/intel/2017.4.196/Python/3.6.3/bin/python
In addition to Python, we have compiled a number of commonly used scientific libraries against GCC and MKL. One important library available is numpy
:
[bob@gw343 ~]$ module spider numpy/1.13.1-Python-3.6.3 ---------------------------------------------------------------------------- numpy: numpy/1.13.1-Python-3.6.3 ---------------------------------------------------------------------------- Description: NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. You will need to load all module(s) on any one of the lines below before the "numpy/1.13.1-Python-3.6.3" module is available to load. GCC/6.4.0-2.28 OpenMPI/2.1.1 Intel/2017.4.196 IntelMPI/2017.3.196 Help: Description =========== NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. More information ================ - Homepage: http://www.numpy.org
Additional libraries such as mpi4py
, pandas
, scikit-learn
and many others are available. Examples using these libraries can be found in ACCRE’s GitHub repository.
Example Scripts
Running a Python script within a SLURM job is generally straightforward. Unless you are attempting to run one of Python’s multi-processing packages, you will want to request a single task, load the appropriate version of Python from your SLURM script, and then redirect your Python file to the Python interpreter. The following example runs Python 3.6.3 on a simple Python script demonstrating the utility of writing vectorized Python code:
[bob@gw343 run1]$ ls python.slurm README.md vectorization.py [bob@gw343 run1]$ cat python.slurm #!/bin/bash #SBATCH --nodes=1 #SBATCH --constraint=skylake #SBATCH --ntasks=1 #SBATCH --time=00:10:00 #SBATCH --mem=500M #SBATCH --output=python_job_slurm.out module load Intel/2016.3.210 IntelMPI/5.1.3.181 numpy/1.11.1-Python-3.5.2 python vectorization.py
As you can see, we loaded the numpy
module without loading the corresponding Python module first. This is possible because Lmod will automatically load the correct Python module for the selected version of numpy
.
[bob@gw343 run1]$ cat vectorization.py #!/usr/bin/env python # # Python script demonstrating vectorized execution # from textwrap import dedent from timeit import timeit import numpy as np SETUP = """ import numpy as np N = int(1e6) t = np.linspace(-10, 10, N) x1 = np.zeros(len(t)) x2 = np.zeros(len(t)) """ NR = 10 def run_native(): """native, naive, non-vectorized implementation""" native = dedent(""" for i in range(N): x1[i] = np.sin(t[i]) """) result = timeit(native, setup=SETUP, number=NR) print("native : {:6.3f}s".format(result)) def run_vectorized(): """vectorized implementation""" vectorized = dedent(""" x2 = np.sin(t) """) result = timeit(vectorized, setup=SETUP, number=NR) print("vectorized: {:6.3f}s".format(result)) def test_equality(): """Test equality of the methods, indepently of the speed test""" N = 10000 t = np.linspace(-10, 10, N) x1 = np.zeros(len(t)) x2 = np.zeros(len(t)) for i in range(N): x1[i] = np.sin(t[i]) x2 = np.sin(t) if (np.array_equal(x1,x2)): print("arrays equal!") if __name__ == '__main__': run_native() run_vectorized() test_equality() [bob@gw343 run1]$ sbatch python.slurm Submitted batch job 9826773
After waiting a few minutes:
[bob@gw343 run1]$ ls python_job_slurm.out python.slurm README.md vectorization.py [bob@gw343 run1]$ cat python_job_slurm.out native : 19.004s vectorized: 0.112s arrays equal!
Jupyter Notebooks
Jupyter notebooks (formerly iPython notebooks) enable a user to interactively code in Python from a web browser with support for inline plotting, equation editing, among many other things. Historically, cluster environments have been used for batch processing rather than interactive processing, however advances in web-based cluster interfaces have made these environments also suitable for interactive coding with Jupyter.
On the ACCRE cluster, the preferred method of using a Jupyter notebook is through the ACCRE Visualization Portal. Please refer to the Portal documentation for instructions on starting a notebook server. Jupyter notebook servers run on the ACCRE compute nodes as scheduled SLURM jobs and so users can request whatever resources are needed for their interactive work.
For computationally intensive or otherwise long running tasks, we recommend that the notebook be used only for code development and testing on smaller samples, and that the bulk of the computation be performed in Python scripts submitted as non-interactive batch jobs if possible.
Using Python on GPU Nodes
Python may be used on ACCRE GPU nodes just as it is on normal compute nodes, but additional Lmod packages compiled with CUDA support are available on these nodes. To explore available packages, set up your environment, and test code, it is recommended to use the salloc
command to run a short interactive job on a GPU node and develop from the command line interface on that node, for example:
[bob@gw343 ~]$ salloc --account=accre_gpu_acc --partition=pascal --gres=gpu:1 --time=1:00:00 salloc: Pending job allocation 9849882 salloc: job 9849882 queued and waiting for resources salloc: job 9849882 has been allocated resources salloc: Granted job allocation 9849882 salloc: Waiting for resource configuration salloc: Nodes gpu0020 are ready for job [appelte1@gpu0020 ~]$ ml GCC/6.4.0-2.28 CUDA OpenMPI Python/3.6.3 TensorFlow [appelte1@gpu0020 ~]$ python Python 3.6.3 (default, Aug 6 2018, 17:13:24) [GCC 6.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> quit() [appelte1@gpu0020 ~]$ exit exit salloc: Relinquishing job allocation 9849882 [bob@gw343 ~]$
Python kernels on Jupyter Notebooks with CUDA support are also available through the ACCRE Visualization Portal. In addition, you can create a GPU desktop on the Visualization Portal to use for code development and testing.
Installing Additional Packages with Virtual Environments
When users need to use Python packages not compiled in the Lmod software stack, our recommended option is to create a Python virtual environment and use pip
to install additional packages into that environment.
A virtual environment is a self-contained and independent set of Python packages which can be easily created, modified, and cleanly removed by individual users as needed.
Managing Python Virtual Environments
Before creating or using a Python virtual environment, you should set up your Lmod modules that contain the compiled Python interpreter that will be used in your environment.
[bob@gw343 run1]$ ml Intel/2017.4.196 IntelMPI Python/3.6.3
You may wish to create a named modules collection for your set of loaded modules so to make it easy to restore your Lmod environment in future sessions or for batch jobs.
A virtual environment can have an arbitrary name and be placed in any directory that you have access to. To create a virtual environment named myvenv
, use the command:
[bob@gw343 run1]$ python -m venv myvenv
Note that if you are still using Python 2 you must replace venv
with virtualenv
in the example above.
This will create a directory myvenv
within your current working directory which contains all the files needed for your environment.
To utilize the virtual environment in your session, you must “activate” it with the following command:
[bob@gw343 run1]$ source myvenv/bin/activate (myvenv) [bob@gw343 run1]$
Notice that the prompt changes to show the active Python virtual environment in parenthesis. To exit your virtual environment, use the “deactivate” command:
(myvenv) [bob@gw343 run1]$ deactivate [bob@gw343 run1]$
ACCRE users may have as many virtual environments as they desire, limited only by their filesystem quotas. When you want to permanently remove a virtual environment, you can simply delete the directory:
[bob@gw343 run1]$ rm -r myvenv
Managing Packages in a Virtual Environment
After activating a virtual environment, no Python packages will be initially installed beyond the Python standard library and any Lmod Python libraries you have loaded. To install additional packages into your virtual environment use the pip install PACKAGE
command where PACKAGE
is the name of your Python package in the Python Package Index. This will install the package into your virtual environment along with any required dependencies.
To install a specific version of a package into your virtual environment, you can specify the requirement with ==
, for example pip install MDAnalysis==0.17.0
.
To uninstall a package, use the command pip uninstall PACKAGE
. Note that this will not uninstall any dependencies that you installed along with that package.
You can get a list of all installed packages in your environment and their versions with the pip freeze
command. This can be exported to a requirements file with pip freeze > requirements.txt
.
For reproducibility, you can install a specific set of packages from a previous environment into a new one from an existing requirements.txt
file with the command pip install -r requirements.txt
.
Virtual Environment Example
In this example, we will install the Pillow
image manipulation library into a virtual environment in order to convert a GIF image and create a blurred JPEG version.
Once finished, the environment will be deactivated and deleted.
[bob@gw343 run1]$ ml Intel/2017.4.196 IntelMPI Python/3.6.3 [bob@gw343 run1]$ python -m venv imagestudy [bob@gw343 run1]$ source imagestudy/bin/activate (imagestudy) [bob@gw343 run1]$ pip install Pillow Collecting Pillow Using cached https://files.pythonhosted.org/packages/14/41/db6dec65ddbc176a59b89485e8cc136a433ed9c6397b6bfe2cd38412051e/Pillow-6.1.0-cp36-cp36m-manylinux1_x86_64.whl Installing collected packages: Pillow Successfully installed Pillow-6.1.0 You are using pip version 9.0.1, however version 19.1.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. (imagestudy) [bob@gw343 run1]$ python Python 3.6.3 (default, Aug 6 2018, 15:58:31) Python 3.6.3 (default, Aug 6 2018, 15:58:31) [GCC Intel(R) C++ gcc 6.4 mode] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from PIL import Image, ImageFilter >>> img = Image.open('600-cell.gif') >>> img.format 'GIF' >>> img.size (255, 255) >>> original = Image.open('peewee.png') >>> converted = original.convert('RGB') >>> blurred = converted.filter(ImageFilter.BLUR) >>> blurred.save('peewee-blurred.jpg') >>> quit() (imagestudy) [bob@gw343 run1]$ deactivate [bob@gw343 run1]$ rm -r imagestudy
-
Left: peewee.png
-
Right: peewee-blurred.jpg
Combining Virtual Environments and Lmod Python Libraries
Virtual Environments can be used in conjunction with compiled Lmod Python libraries such as numpy
or scipy
to allow using the optimized libraries in conjunction with additional packages that depend on them but are not currently available in ACCRE Lmod.
As an example, one might wish to use the MDAnalysis
package, which depends on numpy
, scipy
, matplotlib
, and Biopython
as well as other packages. For the packages available in Lmod, one would like to use the optimized versions, and then download any additional dependencies as well as MDAnalysis
itself into the virtual environment.
To do this, first load all Lmod modules that you intend to use:
[bob@gw343 run1]$ ml Intel/2017.4.196 IntelMPI Python/3.6.3 numpy scipy matplotlib Biopython
Now create a virtual environment and pip install
the package that you want. One difficulty that you may encounter here is that the latest version of the desired package may depend on newer versions of the Python libraries such as numpy
or scipy
than are available in Lmod. In the case of MDAnalysis
, the latest version requires a newer version of scipy
than is provided with the Intel/2017.4.196
toolchain. However, MDAnalysis==0.17.0
is compatible:
[bob@gw343 run1]$ python -m venv mda [bob@gw343 run1]$ . mda/bin/activate (mda) [bob@gw343 run1]$ pip install MDAnalysis==0.17.0 Collecting MDAnalysis==0.17.0 Using cached https://files.pythonhosted.org/packages/91/e0/b12ee57016dffcf8da1f5651745ab981a09bc9e51519b6fa752a1a9a6d0f/MDAnalysis-0.17.0.tar.gz Collecting gsd>=1.4.0 (from MDAnalysis==0.17.0) Using cached https://files.pythonhosted.org/packages/54/e4/d34048ca21c8ac5824c8ff1baa63ef50614b0376a3217714b9576c63414d/gsd-1.7.0-cp36-cp36m-manylinux1_x86_64.whl Requirement already satisfied: numpy>=1.10.4 in /gpfs22/accre/optimized/sandy_bridge/easybuild/software/MPI/intel/2017.4.196/impi/2017.3.196/numpy/1.13.1-Python-3.6.3/lib/python3.6/site-packages/numpy-1.13.1-py3.6-linux-x86_64.egg (from MDAnalysis==0.17.0) ... Installing collected packages: gsd, decorator, networkx, GridDataFormats, msgpack, mmtf-python, joblib, MDAnalysis Running setup.py install for networkx ... done Running setup.py install for MDAnalysis ... done Successfully installed GridDataFormats-0.5.0 MDAnalysis-0.17.0 decorator-4.4.0 gsd-1.7.0 joblib-0.13.2 mmtf-python-1.1.2 msgpack-0.6.1 networkx-2.3
Notice that for several packages pip
will report that the requirement is already satisfied.
If you need a package that requires a newer version of numpy
or other library than the package provided by Lmod, then the simplest solution is to only load the required Python modules in Lmod, i.e. ml Intel/2017.4.196 Python/3.6.3
, and then to pip install
your library and all dependencies into your virtual environment. This will not run as fast as the optimized library but will ensure that there are no compatibility issues.
Python Packages loaded by Lmod are visible to Python via the PYTHONPATH
environment variable, which means that these will take precedence over versions installed in your virtual environment. For example, if you load numpy
version 1.13.1 with the command ml numpy/1.13.1-Python-3.6.3
and then create a virtual environment and install numpy
1.16.4 with pip install numpy==1.16.4
, then when you import numpy
in the Python interpreter, numpy
1.13.1 will still be visible. This order of precedence can be reversed by adjusting the PYTHONPATH
as follows after loading all Lmod modules and activating your virtual environment:
(venv) [bob@gw343 run1]$ export PYTHONPATH=$(python -c 'import sys; print(sys.path[-1])'):${PYTHONPATH}
The recommended order of commands to set up a complex environment where Lmod Python packages are combined with those in your virtual environment is to first load all Lmod modules, then activate your virtual environment, and finally to export the modified PYTHONPATH
variable. An example SLURM script with such an environment might look like the following:
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:30:00 #SBATCH --mem=8G #SBATCH --cpus_per_task=4 #SBATCH --output=python_venv_job_slurm.out ml Intel/2017.4.196 IntelMPI Python/3.6.3 numpy scipy source ${HOME}/venv/bin/activate export PYTHONPATH=$(python -c 'import sys; print(sys.path[-1])'):${PYTHONPATH} python analysis.py
When using Jupyter Notebooks in the ACCRE Visualization Portal in conjunction with a virtual environment, all available Lmod Python libraries will be loaded, and then the PYTHONPATH
will be modified to ensure that any conflicting libraries you have installed in your virtual environment are used.
Using Anaconda
Anaconda provides an easy to use, extended distribution of Python including more packages than may be available in the ACCRE Lmod system. However, as Anaconda is distributed as precompiled binaries, it’s expected that using Python via Anaconda might not perform as well as using the Python modules compiled and optimized for ACCRE hardware using GCC or Intel MKL.
Anaconda distributions can be accessed via Lmod, use the command module spider Anaconda
to see a list of currently available versions.
Loading an Anaconda distribution into your environment will make all Anaconda distributed packages available to your python interpreter. For example:
[bob@gw343 ~]$ ml Anaconda3/5.0.1 [bob@gw343 ~]$ python Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy, pandas, bokeh >>> quit()
Anaconda virtual environments may be created and are compatible with the Jupyter notebook app on the ACCRE Visualization Portal. An introduction to conda
virtual environments can be found in the Anaconda documentation. Note that a conda
virtual environment is a different system then the standard Python virtual environment system provided with the Python standard library, so commands and concepts will vary. If you do choose to use conda
virtual environments, it is recommended to watch your data usage in the hidden ${HOME}/.conda
directory as Anaconda may aggressively cache packages and can easily use up most of your ACCRE quota.
Also note that Python packages we provide on the Lmod software stack are built against our own optimized Python interpreters and as such are incompatible with Anaconda. In order to avoid any incompatibilities, please make sure your environment is cleared with module purge
before loading Anaconda.
Contributing New Examples
In order to foster collaboration and develop local Python expertise at Vanderbilt, we encourage users to submit examples of their own to ACCRE’s Python GitHub repository. Instructions for doing this can be found on this page.