Python on the ACCRE cluster

From ACCRE Wiki

More information Python Official Website

Python is an interpreted programming language that has become increasingly popular in high-performance computing environments because it’s available with an assortment of numerical and scientific computing libraries (numpy, scipy, pandas, etc.), relatively easy to learn, open source, and free.

Python versus Anaconda

On its own, the reference implementation of the Python language is poorly suited to scientific computing as it is not compiled to machine instructions and will perform large calculations slowly. However, with the help of several libraries one can perform very fast computations.

To manage these additional packages, Python includes standard support for easily installing additional packages from the internet with a tool called pip. Users can further maintain multiple “virtual environments” on a single machine with different packages installed using a module called venv.

On the ACCRE cluster, highly optimized versions of the Python interpreter are available through Lmod along with many commonly used scientific libraries which have been similarly optimized to the specific Intel CPU on each machine. Using the optimized Python builds along with the venv and pip tools to install additional user-specific packages if required.

An alternative system for managing scientific Python libraries is the Anaconda distribution. This distribution system provides a pre-compiled Python interpreter already packaged with many popular scientific libraries. Anaconda users typically use a different tool, conda, for managing installation of additional packages and creating “virtual environments” on a single machine.

The downside to using Anaconda on the ACCRE cluster is that the interpreter and packages provided are not specifically optimized for ACCRE hardware and will not execute code as quickly as the optimized Python build provided through Lmod. In addition, we have found that conda-based virtual environments will often aggressively cache package archives and quickly exhaust a user’s filesystem quota in their home directory. For these reasons, ACCRE users are encouraged to use the optimized Python builds over the Anaconda distribution. However, we do understand that certain workflows may require packages only available with conda, or that researchers may not have the time to convert existing projects. We therefore provide limited support for Anaconda by making several versions of the distribution available through Lmod.

Using ACCRE Optimized Python

ACCRE administrators have hand-compiled multiple versions of Python that are linked against highly optimized linear algebras like OpenBLAS and Intel’s MKL, and therefore will in general yield better performance (faster execution time) than the default system version of Python. To see a list of installed versions of Python on the cluster, use Lmod’s spider command:

[bob@gw343 ~]$ ml spider Python

----------------------------------------------------------------------------
  Python:
----------------------------------------------------------------------------
    Description:
      Python is a programming language that lets you work more quickly and
      integrate your systems more effectively. 

     Versions:
        Python/2.7.12
        Python/2.7.14
        Python/3.5.2
        Python/3.6.3
     Other possible modules matches:
        Biopython  IPython  wxPython

----------------------------------------------------------------------------
  To find other possible module matches execute:

      $ module -r spider '.*Python.*'

----------------------------------------------------------------------------
  For detailed information about a specific "Python" module (including how to load the modules) use the module's full name.
  For example:

     $ module spider Python/3.6.3
----------------------------------------------------------------------------

Per Lmod’s instructions, we can get more information about the installed Python version:

[bob@gw343 ~]$ module spider Python/3.6.3

----------------------------------------------------------------------------
  Python: Python/3.6.3
----------------------------------------------------------------------------
    Description:
      Python is a programming language that lets you work more quickly and
      integrate your systems more effectively. 


    You will need to load all module(s) on any one of the lines below before the "Python/3.6.3" module is available to load.

      GCC/6.4.0-2.28
      Intel/2017.4.196
 
    Help:
      
      Description
      ===========
      Python is a programming language that lets you work more quickly and integrate your systems
       more effectively.
      
      
      More information
      ================
       - Homepage: http://python.org/
      
      
      Included extensions
      ===================
      arff-2.1.1, blist-1.3.6, cryptography-2.1.1, Cython-0.27.1, dateutil-2.6.1,
      decorator-4.1.2, docopt-0.6.2, ecdsa-0.13, joblib-0.11, lockfile-0.12.2,
      netaddr-0.7.19, netifaces-0.10.6, nose-1.3.7, paramiko-2.3.1, paycheck-1.0.2,
      pbr-3.1.1, pip-9.0.1, pycrypto-2.6.1, pyparsing-2.2.0, setuptools-36.6.0,
      six-1.11.0, virtualenv-15.1.0, xlrd-1.1.0

We have compiled Python 3.6.3 with both the GCC compiler and with Intel compilers, linking against Intel’s MKL library, which should yield better performance on our Intel processors. Lmod tells us that we need to load either GCC or Intel as a dependency of Python, so let’s do just that:

[bob@gw343 ~]$ ml Intel/2017.4.196
[bob@gw343 ~]$ ml Python/3.6.3
[bob@gw343 ~]$ python --version
Python 3.6.3
[bob@gw343 ~]$ which python
/accre/arch/easybuild/software/Compiler/intel/2017.4.196/Python/3.6.3/bin/python

In addition to Python, we have compiled a number of commonly used scientific libraries against GCC and MKL. One important library available is numpy:

[bob@gw343 ~]$ module spider numpy/1.13.1-Python-3.6.3

----------------------------------------------------------------------------
  numpy: numpy/1.13.1-Python-3.6.3
----------------------------------------------------------------------------
    Description:
      NumPy is the fundamental package for scientific computing with
      Python. It contains among other things: a powerful N-dimensional
      array object, sophisticated (broadcasting) functions, tools for
      integrating C/C++ and Fortran code, useful linear algebra, Fourier
      transform, and random number capabilities. Besides its obvious
      scientific uses, NumPy can also be used as an efficient
      multi-dimensional container of generic data. Arbitrary data-types can
      be defined. This allows NumPy to seamlessly and speedily integrate
      with a wide variety of databases. 


    You will need to load all module(s) on any one of the lines below before the "numpy/1.13.1-Python-3.6.3" module is available to load.

      GCC/6.4.0-2.28  OpenMPI/2.1.1
      Intel/2017.4.196  IntelMPI/2017.3.196
 
    Help:
      
      Description
      ===========
      NumPy is the fundamental package for scientific computing with Python. It contains among other things:
       a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran
       code, useful linear algebra, Fourier transform, and random number capabilities. Besides its obvious scientific uses,
       NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be 
       defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
      
      
      More information
      ================
       - Homepage: http://www.numpy.org
      

Additional libraries such as mpi4py, pandas, scikit-learn and many others are available. Examples using these libraries can be found in ACCRE’s GitHub repository.

Example Scripts

Running a Python script within a SLURM job is generally straightforward. Unless you are attempting to run one of Python’s multi-processing packages, you will want to request a single task, load the appropriate version of Python from your SLURM script, and then redirect your Python file to the Python interpreter. The following example runs Python 3.6.3 on a simple Python script demonstrating the utility of writing vectorized Python code:

[bob@gw343 run1]$ ls
python.slurm  README.md  vectorization.py

[bob@gw343 run1]$ cat python.slurm 
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --constraint=skylake
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=python_job_slurm.out

module load Intel/2016.3.210 IntelMPI/5.1.3.181 numpy/1.11.1-Python-3.5.2

python vectorization.py

As you can see, we loaded the numpy module without loading the corresponding Python module first. This is possible because Lmod will automatically load the correct Python module for the selected version of numpy.

[bob@gw343 run1]$ cat vectorization.py
#!/usr/bin/env python
#
# Python script demonstrating vectorized execution
#
from textwrap import dedent
from timeit import timeit
import numpy as np

SETUP = """
import numpy as np
N = int(1e6)
t = np.linspace(-10, 10, N)
x1 = np.zeros(len(t))
x2 = np.zeros(len(t))
"""
NR = 10


def run_native():
    """native, naive, non-vectorized implementation"""
    native = dedent("""
        for i in range(N):
            x1[i] = np.sin(t[i])
    """)
    result = timeit(native, setup=SETUP, number=NR)
    print("native    : {:6.3f}s".format(result))


def run_vectorized():
    """vectorized implementation"""
    vectorized = dedent("""
        x2 = np.sin(t)
    """)
    result = timeit(vectorized, setup=SETUP, number=NR)
    print("vectorized: {:6.3f}s".format(result))


def test_equality():
    """Test equality of the methods, indepently of the speed test"""
    N = 10000
    t = np.linspace(-10, 10, N)
    x1 = np.zeros(len(t))
    x2 = np.zeros(len(t))

    for i in range(N):
        x1[i] = np.sin(t[i])

    x2 = np.sin(t)

    if (np.array_equal(x1,x2)):
        print("arrays equal!")


if __name__ == '__main__':
    run_native()
    run_vectorized()
    test_equality()

[bob@gw343 run1]$ sbatch python.slurm 
Submitted batch job 9826773

After waiting a few minutes:

[bob@gw343 run1]$ ls
python_job_slurm.out  python.slurm  README.md  vectorization.py

[bob@gw343 run1]$ cat python_job_slurm.out 
native    : 19.004s
vectorized:  0.112s
arrays equal!

Jupyter Notebooks

Jupyter notebooks (formerly iPython notebooks) enable a user to interactively code in Python from a web browser with support for inline plotting, equation editing, among many other things. Historically, cluster environments have been used for batch processing rather than interactive processing, however advances in web-based cluster interfaces have made these environments also suitable for interactive coding with Jupyter.

On the ACCRE cluster, the preferred method of using a Jupyter notebook is through the ACCRE Visualization Portal. Please refer to the Portal documentation for instructions on starting a notebook server. Jupyter notebook servers run on the ACCRE compute nodes as scheduled SLURM jobs and so users can request whatever resources are needed for their interactive work.

For computationally intensive or otherwise long running tasks, we recommend that the notebook be used only for code development and testing on smaller samples, and that the bulk of the computation be performed in Python scripts submitted as non-interactive batch jobs if possible.

Using Python on GPU Nodes

Python may be used on ACCRE GPU nodes just as it is on normal compute nodes, but additional Lmod packages compiled with CUDA support are available on these nodes. To explore available packages, set up your environment, and test code, it is recommended to use the salloc command to run a short interactive job on a GPU node and develop from the command line interface on that node, for example:

[bob@gw343 ~]$ salloc --account=accre_gpu_acc --partition=pascal --gres=gpu:1 --time=1:00:00
salloc: Pending job allocation 9849882
salloc: job 9849882 queued and waiting for resources
salloc: job 9849882 has been allocated resources
salloc: Granted job allocation 9849882
salloc: Waiting for resource configuration
salloc: Nodes gpu0020 are ready for job
[appelte1@gpu0020 ~]$ ml GCC/6.4.0-2.28 CUDA OpenMPI Python/3.6.3 TensorFlow
[appelte1@gpu0020 ~]$ python
Python 3.6.3 (default, Aug  6 2018, 17:13:24) 
[GCC 6.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> quit()
[appelte1@gpu0020 ~]$ exit
exit
salloc: Relinquishing job allocation 9849882
[bob@gw343 ~]$

Python kernels on Jupyter Notebooks with CUDA support are also available through the ACCRE Visualization Portal. In addition, you can create a GPU desktop on the Visualization Portal to use for code development and testing.

Installing Additional Packages with Virtual Environments

When users need to use Python packages not compiled in the Lmod software stack, our recommended option is to create a Python virtual environment and use pip to install additional packages into that environment.

A virtual environment is a self-contained and independent set of Python packages which can be easily created, modified, and cleanly removed by individual users as needed.

Managing Python Virtual Environments

Before creating or using a Python virtual environment, you should set up your Lmod modules that contain the compiled Python interpreter that will be used in your environment.

[bob@gw343 run1]$ ml Intel/2017.4.196 IntelMPI Python/3.6.3

You may wish to create a named modules collection for your set of loaded modules so to make it easy to restore your Lmod environment in future sessions or for batch jobs.

A virtual environment can have an arbitrary name and be placed in any directory that you have access to. To create a virtual environment named myvenv, use the command:

[bob@gw343 run1]$ python -m venv myvenv

Note that if you are still using Python 2 you must replace venv with virtualenv in the example above.

This will create a directory myvenv within your current working directory which contains all the files needed for your environment.

To utilize the virtual environment in your session, you must “activate” it with the following command:

[bob@gw343 run1]$ source myvenv/bin/activate
(myvenv) [bob@gw343 run1]$

Notice that the prompt changes to show the active Python virtual environment in parenthesis. To exit your virtual environment, use the “deactivate” command:

(myvenv) [bob@gw343 run1]$ deactivate
[bob@gw343 run1]$

ACCRE users may have as many virtual environments as they desire, limited only by their filesystem quotas. When you want to permanently remove a virtual environment, you can simply delete the directory:

[bob@gw343 run1]$ rm -r myvenv

Managing Packages in a Virtual Environment

After activating a virtual environment, no Python packages will be initially installed beyond the Python standard library and any Lmod Python libraries you have loaded. To install additional packages into your virtual environment use the pip install PACKAGE command where PACKAGE is the name of your Python package in the Python Package Index. This will install the package into your virtual environment along with any required dependencies.

To install a specific version of a package into your virtual environment, you can specify the requirement with ==, for example pip install MDAnalysis==0.17.0.

To uninstall a package, use the command pip uninstall PACKAGE. Note that this will not uninstall any dependencies that you installed along with that package.

You can get a list of all installed packages in your environment and their versions with the pip freeze command. This can be exported to a requirements file with pip freeze > requirements.txt.

For reproducibility, you can install a specific set of packages from a previous environment into a new one from an existing requirements.txt file with the command pip install -r requirements.txt.

Virtual Environment Example

In this example, we will install the Pillow image manipulation library into a virtual environment in order to convert a GIF image and create a blurred JPEG version.

Once finished, the environment will be deactivated and deleted.

[bob@gw343 run1]$ ml Intel/2017.4.196 IntelMPI Python/3.6.3
[bob@gw343 run1]$ python -m venv imagestudy
[bob@gw343 run1]$ source imagestudy/bin/activate
(imagestudy) [bob@gw343 run1]$ pip install Pillow
Collecting Pillow
  Using cached https://files.pythonhosted.org/packages/14/41/db6dec65ddbc176a59b89485e8cc136a433ed9c6397b6bfe2cd38412051e/Pillow-6.1.0-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: Pillow
Successfully installed Pillow-6.1.0
You are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(imagestudy) [bob@gw343 run1]$ python
Python 3.6.3 (default, Aug  6 2018, 15:58:31)
Python 3.6.3 (default, Aug  6 2018, 15:58:31) 
[GCC Intel(R) C++ gcc 6.4 mode] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image, ImageFilter
>>> img = Image.open('600-cell.gif')
>>> img.format
'GIF'
>>> img.size
(255, 255)
>>> original = Image.open('peewee.png')
>>> converted = original.convert('RGB')
>>> blurred = converted.filter(ImageFilter.BLUR)
>>> blurred.save('peewee-blurred.jpg')
>>> quit()
(imagestudy) [bob@gw343 run1]$ deactivate
[bob@gw343 run1]$ rm -r imagestudy

Combining Virtual Environments and Lmod Python Libraries

Virtual Environments can be used in conjunction with compiled Lmod Python libraries such as numpy or scipy to allow using the optimized libraries in conjunction with additional packages that depend on them but are not currently available in ACCRE Lmod.

As an example, one might wish to use the MDAnalysis package, which depends on numpy, scipy, matplotlib, and Biopython as well as other packages. For the packages available in Lmod, one would like to use the optimized versions, and then download any additional dependencies as well as MDAnalysis itself into the virtual environment.

To do this, first load all Lmod modules that you intend to use:

[bob@gw343 run1]$ ml Intel/2017.4.196 IntelMPI Python/3.6.3 numpy scipy matplotlib Biopython

Now create a virtual environment and pip install the package that you want. One difficulty that you may encounter here is that the latest version of the desired package may depend on newer versions of the Python libraries such as numpy or scipy than are available in Lmod. In the case of MDAnalysis, the latest version requires a newer version of scipy than is provided with the Intel/2017.4.196 toolchain. However, MDAnalysis==0.17.0 is compatible:

[bob@gw343 run1]$ python -m venv mda
[bob@gw343 run1]$ . mda/bin/activate
(mda) [bob@gw343 run1]$ pip install MDAnalysis==0.17.0
Collecting MDAnalysis==0.17.0
  Using cached https://files.pythonhosted.org/packages/91/e0/b12ee57016dffcf8da1f5651745ab981a09bc9e51519b6fa752a1a9a6d0f/MDAnalysis-0.17.0.tar.gz
Collecting gsd>=1.4.0 (from MDAnalysis==0.17.0)
  Using cached https://files.pythonhosted.org/packages/54/e4/d34048ca21c8ac5824c8ff1baa63ef50614b0376a3217714b9576c63414d/gsd-1.7.0-cp36-cp36m-manylinux1_x86_64.whl
Requirement already satisfied: numpy>=1.10.4 in /gpfs22/accre/optimized/sandy_bridge/easybuild/software/MPI/intel/2017.4.196/impi/2017.3.196/numpy/1.13.1-Python-3.6.3/lib/python3.6/site-packages/numpy-1.13.1-py3.6-linux-x86_64.egg (from MDAnalysis==0.17.0)
...
Installing collected packages: gsd, decorator, networkx, GridDataFormats, msgpack, mmtf-python, joblib, MDAnalysis
  Running setup.py install for networkx ... done
  Running setup.py install for MDAnalysis ... done
Successfully installed GridDataFormats-0.5.0 MDAnalysis-0.17.0 decorator-4.4.0 gsd-1.7.0 joblib-0.13.2 mmtf-python-1.1.2 msgpack-0.6.1 networkx-2.3

Notice that for several packages pip will report that the requirement is already satisfied.

If you need a package that requires a newer version of numpy or other library than the package provided by Lmod, then the simplest solution is to only load the required Python modules in Lmod, i.e. ml Intel/2017.4.196 Python/3.6.3, and then to pip install your library and all dependencies into your virtual environment. This will not run as fast as the optimized library but will ensure that there are no compatibility issues.

Python Packages loaded by Lmod are visible to Python via the PYTHONPATH environment variable, which means that these will take precedence over versions installed in your virtual environment. For example, if you load numpy version 1.13.1 with the command ml numpy/1.13.1-Python-3.6.3 and then create a virtual environment and install numpy 1.16.4 with pip install numpy==1.16.4, then when you import numpy in the Python interpreter, numpy 1.13.1 will still be visible. This order of precedence can be reversed by adjusting the PYTHONPATH as follows after loading all Lmod modules and activating your virtual environment:

(venv) [bob@gw343 run1]$ export PYTHONPATH=$(python -c 'import sys; print(sys.path[-1])'):${PYTHONPATH}

The recommended order of commands to set up a complex environment where Lmod Python packages are combined with those in your virtual environment is to first load all Lmod modules, then activate your virtual environment, and finally to export the modified PYTHONPATH variable. An example SLURM script with such an environment might look like the following:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
#SBATCH --mem=8G
#SBATCH --cpus_per_task=4
#SBATCH --output=python_venv_job_slurm.out

ml Intel/2017.4.196 IntelMPI Python/3.6.3 numpy scipy
source ${HOME}/venv/bin/activate
export PYTHONPATH=$(python -c 'import sys; print(sys.path[-1])'):${PYTHONPATH}

python analysis.py

When using Jupyter Notebooks in the ACCRE Visualization Portal in conjunction with a virtual environment, all available Lmod Python libraries will be loaded, and then the PYTHONPATH will be modified to ensure that any conflicting libraries you have installed in your virtual environment are used.

Using Anaconda

Anaconda provides an easy to use, extended distribution of Python including more packages than may be available in the ACCRE Lmod system. However, as Anaconda is distributed as precompiled binaries, it’s expected that using Python via Anaconda might not perform as well as using the Python modules compiled and optimized for ACCRE hardware using GCC or Intel MKL.

Anaconda distributions can be accessed via Lmod, use the command module spider Anaconda to see a list of currently available versions.

Loading an Anaconda distribution into your environment will make all Anaconda distributed packages available to your python interpreter. For example:

[bob@gw343 ~]$ ml Anaconda3/5.0.1
[bob@gw343 ~]$ python
Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy, pandas, bokeh
>>> quit()

Anaconda virtual environments may be created and are compatible with the Jupyter notebook app on the ACCRE Visualization Portal. An introduction to conda virtual environments can be found in the Anaconda documentation. Note that a conda virtual environment is a different system then the standard Python virtual environment system provided with the Python standard library, so commands and concepts will vary. If you do choose to use conda virtual environments, it is recommended to watch your data usage in the hidden ${HOME}/.conda directory as Anaconda may aggressively cache packages and can easily use up most of your ACCRE quota.

Also note that Python packages we provide on the Lmod software stack are built against our own optimized Python interpreters and as such are incompatible with Anaconda. In order to avoid any incompatibilities, please make sure your environment is cleared with module purge before loading Anaconda.

Contributing New Examples

In order to foster collaboration and develop local Python expertise at Vanderbilt, we encourage users to submit examples of their own to ACCRE’s Python GitHub repository. Instructions for doing this can be found on this page.