SciLo: Long Term Data Archiving

From ACCRE Wiki

SciLo is a long term archiving service at ACCRE based on the Spectra Logic BlackPearl converged storage solution. With SciLo you can archive data at a very low cost and minimal system administrator intervention. Movement of data from ACCRE (or anywhere) to SciLo and back is accomplished with command line client (ds3_java_cli) or, if you are using portal, an available GUI (dsb-gui). There are other options available including cyberduck and an API with a number of SDKs depending on your expertise. There is also the Eon Browser GUI solution from Spectra Logic for Windows, Mac and Linux. All of this is based on Spectra S3 which uses the standard HTTP S3 command set plus expanded commands designed to optimize moving data object to and from tape.

ACCRE provides two scripts to help the user to do the upload and download jobs. The two scripts encapsulate the use of ds3_java_cli command inside, and the users don’t need to get down to the details of how to feed the correct file path information to the ds3_java_cli command.

Getting Started

Initial sign up to the SciLo requires creation of an account on our BlackPearl and issuance of an id and key. [[../support/helpdesk.html|Open a helpdesk ticket]] with ACCRE requesting access. You will receive confirmation of account setup and the ID and secret key.

Recommended Server

Archiving can take some time. As such running on a gateway will not work. If your group has a custom gateway connected to the cluster you can use that (I always recommend using screen or tmux so that you can log out and back in later).

We do have a gateway dedicated to archiving in the case where you don’t have a custom gateway to use. When your access is created your login credentials will work on that gateway and your ticket will be updated with that server’s login information.

You can also backup other sources (not just the cluster) and so this can be run from personal servers/desktops.

Environment

Once you have access you will want to add this information to your environment. The environment variables $DS3_ACCESS_KEY, $DS3_SECRET_KEY, and DS3_ENDPOINT are all special variables that ds3_java_cli uses by default. These can be overridden with options (see -a, -k, and -e below) if that fits more into your workflow.

~ $ export DS3_ACCESS_KEY=<Assigned s3 id>
~ $ export DS3_SECRET_KEY=<Assigned secret s3 key>
~ $ export DS3_ENDPOINT=archive1.accre.vanderbilt.edu
~ $ export s3bucket=<Assigned bucket>

User Scripts

The scripts provide the following features to help the user run the upload/download job:

  1. Checksum verification. In the upload job, a checksum value is obtained from the original file and this information is stored in the same bucket for the later verification purpose. During the download job, the script will use this checksum data to verify the download file is not corrupted.
  2. Use multi-cores for a parallel job. For both download and upload jobs, the jobs can be parallelized by using multi-cores. Each core is corresponding to one job, and all of downloading/uploading files will be processed in different jobs. The jobs are launched in independent way, so one job’s failure doesn’t affect other jobs.
  3. Additional log folder is used for rerun purpose. Sometimes the downloading/uploading jobs could be broken for variety of reasons. In our script we set up a log folder which records the log information for the successfully finished jobs. With this log information, the rerun can easily pick up the unfinished jobs and avoid the repeat work.
  4. Use local disk to avoid GPFS fluctuation. In both of the uploading/downloading jobs, the users need to specify a local directory as a scratch directory (to use /tmp is an example). The job will copy the files to the scratch directory and perform the uploading/downloading work. In this way the uploading/downloading process will be independent from GPFS fluctuation to increase the robustness of the process.
  5. Multiple local disks can be for the scratch directories. Both of the two scripts are able to utilize multiple local disks for scratch directory, this can help to increase the efficiency and stability for multi-cores parallel jobs.
  6. The uploading/downloading job can be performed for files on any ACCRE supported storage, including local storage (for example, the local folders on the custom gateways), /home, /data, /scratch, /dors, NFS, LStore, etc. For the files on LStore, the uploading job will recursively scan all of files within the given directory. In other cases like files on GPFS, the uploading job will upload the files from the given directory (the sub-directories currently are omitted).

For the uploading and downloading jobs, one important note is to make sure the job is processed with enough scratch space. For example, if the maximum size for the uploading files is about 1 terabyte size, and the job is going to use 6 cores, the best option is to find 6 local disks with each local disk larger than 1TB. In this way, the I/O burden is evenly distributed onto the 6 disks. Otherwise in this example please make sure the local scratch directory is larger than 6TB size; so that as all of 6 uploading processes execute together the scratch space is able to hold all of the data.

In ACCRE, the best option for the SciLo job is to utilize the custom gateways. The benefit to use custom gateways is that the jobs can utilize the large local disks, and the job does not have time limit. Please make sure the scratch directory is not on the shared storage, like the GPFS or NFS folders on the node. If you prefer to use a Slurm job for the SciLo tasks, please allocate enough time for the Slurm job and use the local /tmp folder as scratch directory.

For both uploading/downloading jobs the script needs to set up a folder to store the log related data. Since the script uses the parallel command for executing the jobs in parallel, after the job is done you can check the *.par files in the log folder; these are the log files for all uploading or downloading jobs.

Uploading Files to SciLo

The script scilo_put.sh is to perform the uploading job to a SciLo bucket. It’s available in the path /accre/common/bin. Before the upload job, please make sure that you have correct setup of SciLo in your home directory. In ACCRE we will place a .s3keys file under the user’s home directory. This file contains the access key and other information needed for accessing SciLo. Before the uploading job the script will test the existence of the .s3keys file in your home directory, and test whether SciLo can access the given bucket and retrieve information.

The uploading process has the following parameters:

  • -p(--path): provide the file path for archive purpose
  • -b(--bucket): SciLo bucket name for archive
  • -i(--islstore): whether the input path is an LStore path
  • -s(--scratch): scratch directory for the work
  • -n(--ncores): how many cores we want to use. You need to specify a value; each core for one uploading job.
  • -l(--log): log directory for storing the information and checksum files regarding the progress. Please reuse the same log directory so that the re-run can pick up the unfinished jobs and avoid repeating the duplicate uploading work.

Here is an example for running a uploading job:

scilo_put.sh -p (path for upload) -b (bucket name) -s (scratch path) -n (ncores) -l (log dir)

If the uploading job is for LStore files, please make sure to have the -i option, and the LStore path should be without the /lio/lfs part (for example, if the LStore path is /lio/lfs/testing/test; the path for the -p option should be /testing/test).

Here is an example for a job with utilizing multiple local disks for scratch directories:

scilo_put.sh -p /testing/test -i -b test -s /mnt/d1,/mnt/d2,/mnt/d3 -n 3 -l $HOME/test

This example uploads data files in LStore folder /lio/lfs/testing/test to “test” bucket with utilizing 3 cores and 3 folders (/mnt/d1, /mnt/d2 and /mnt/d3) in the local disks. The log folder is in $HOME/test.

This example shows a uploading job for data files on GPFS:

scilo_put.sh -p /home/xyz/test -b test -s /tmp -n 3 -l $HOME/scilo_logs

This example uploads files from /home folder for user xyz to the test bucket with utilizing 3 cores and 1 local folder /tmp. The log folder is in $HOME/scilo_logs.

Downloading Files from SciLo

In comparison to the uploading job, the downloading job is divided into two steps. In Step 1 the script is to generate the downloaded file list from the bucket. Because the files with different folders are stored together on the same bucket, you may only want to download some of the files; therefore the script will pull out the a file list from the bucket and you can edit the file list and utilize it for later downloading purpose.

The downloading job script is scilo_get.sh and locates at the same folder /accre/common/bin. The available options for a downloading job from scilo_get.sh are as follows:

  • -f ( --file) : the file with full path which stores the list of downloaded files from SciLo
  • -g --generate) : enable to generate the download file list without downloading
  • -b --bucket) : SciLo bucket name for achieve
  • -s --scratch) : scratch directory for downloading the file from SciLo
  • -n --ncores) : how many cores we want to use; you need to specify a value
  • -o --output) : output directory to store the downloaded files
  • -l --log) : logs directory for storing the information regarding the

progress. Please reuse the same log directory so that the re-run can pick up the unfinished jobs and avoid repeating the duplicate downloading work.

To generate the downloaded file list, the script can be run as:

scilo_get.sh -f (file list name) -b (bucket name) -o (the downloaded file path) -g

In this step, the script will pull out all of available data files on the bucket and write the information into the given file list. An important note here is that the script needs to know the final downloaded file path. Although this information is not used in this step, the purpose is to put all of downloading job information together.

For example,

scilo_get.sh -f files.txt -b test -o /home/xyz/test -g

will generate a file files.txt in the current directory, it contains the data files from bucket test. An example of this file list is as follows:

# we use NONE to show that the data file does not have corresponding checksum file, please do not delete it!!

# file_name_from_tape  corresponding_check_sum_information  output_dir_for_downloaded_file

home/xyz/qos.txt   NONE  /home/xyz/test

ssss/testing_archive/testing/test/pindel.TWAM-App282Req70_00194.chr1.tar.gz   ssss/testing_archive/testing/test/pindel.TWAM-App282Req70_00194.chr1.tar.gz.chksum=md5sum=620a27e00ef50374bdf38de8625d5013  /home/xyz/test

yyyy/testing_archive/testing/test/pindel.TWAM-App282Req70_00194.chr11.tar.gz   yyyy/testing_archive/testing/test/pindel.TWAM-App282Req70_00194.chr11.tar.gz.chksum=md5sum=ce13593823e4800e4e5c26a1fc007aba  /home/xyz/test

Here in the above example, the data is divided into three columns. The first and second column are the original data file path and corresponding checksum information generated from upload job. If the data file does not have checksum information, the script uses NONE to label it so that the downloading process will ignore the checksum verification. Please make sure the first and second column are NOT changed, otherwise the downloading process will report error.

The third column corresponds to the downloaded directory. In this example, all of the files are downloaded to /home/xyz/test. But it’s possible that different files go to different folders. You can modify the output directory for each data files in the third column.

If the input bucket has a large size of achieved data files, the generated file list could be very large, too. Since it contains all of the available files in the bucket. Please delete all of the data file lines that are not for downloading purpose; scilo_get.sh will only download the files in the given list.

After the file list is finished, the second downloading step is as follows:

scilo_get.sh -f (file list path) -b (bucket name) -s (scratch path) -n (ncores) -l (log dir)

This step is executed in the same way as in the uploading process, you can provide multiple scratch directories and multi-cores to make the download job parallel.

Load the Software

The command line client (as well as the GUI if you are on portal) are in our Lmod setup. You will want to execute the following to get them into your environment:

~$ module load GCC
~$ module load scilo-cli  #for the command line ds3_java_cli
~$ module load scilo-gui  # loads the gui for portal or X 11 forwarding dsp-gui

Command Line Parameters

The command line interface is ds3_java_cli:

~$ ds3_java_cli --help #displays a general help listing
~$ ds3_java_cli -c get_service #get a list of available buckets
+-------------------------------------------------------+--------------------------+
|                      Bucket Name                      |       Creation Date      |
+-------------------------------------------------------+--------------------------+
| my_bucket                                             | 2019-03-07T00:08:24.000Z |
+-------------------------------------------------------+--------------------------+

~$ ds3_java_cli --http -c put_bulk -b mybucket -p /home/myusername/ -d /home/myusername/archivedirectory/ --sync -nt 6 --checksum

usage: ds3_java_cli

Options
Option Option Help
-a Access Key ID or have "DS3_ACCESS_KEY" set as an environment variable
-bs Set the buffer size in bytes. The default is 1MB
-c The Command to execute. For Possible values, use '--help list_commands.'
--debug Debug (more verbose) output to console.
-e The ds3 endpoint to connect to or have "DS3_ENDPOINT" set as an environment variable.
-h Help Menu
--help Command Help (provide command name from -c)
--http Send all requests over standard HTTP
--insecure Ignore SSL certificate verification
-k Secret access key or have "DS3_SECRET_KEY" set as an environment variable
--log-debug Debug (more verbose) output to log file.
--log-trace Trace (most verbose) output to log file.
--log-verbose Log output to log file.
--output-format Configure how the output should be displayed. Possible values: [cli, json]
-r Specifies how many times puts and gets will be attempted before failing the request. The default is 5
--trace Trace (most verbose) output to console.
--verbose Log output to console.
--version Print version information
-x The URL of the PROXY server to use or have "http_proxy" set as an environment variable

Generally the ds3_java_cli follows this example:

ds3_java_cli -e -a -k --http -c -o <object, if used by command> -b <bucket, if used by command>

Available Commands

Command Command Help
delete_bucket Deletes an empty bucket.
Requires the '-b' parameter to specify bucket (by name or UUID).
Use the '--force' flag to delete a bucket and all its contents.
Use the get_service command to retrieve a list of buckets
delete_folder Deletes a folder and all its contents.
Requires the '-b' parameter to specify bucket (name or UUID).
Requires the '-d' parameter to specify folder name
delete_job Terminates and removes a current job.
Requires the '-i' parameter with the UUID of the jobUse the '--force' flag to remove objects already loaded into cache.
Use the get_jobs command to retrieve a list of jobs
delete_object Permanently deletes an object.
Requires the '-b' parameter to specify bucketname.
Requires the '-i' parameter to specify object name (UUID or name).
Use the get_service command to retrieve a list of buckets.
Use the get_bucket comma/nd to retrieve a list of objects
delete_tape Deletes the specified tape which has been permanently lost from the BlackPearl database.
Any data lost as a result is marked degraded to trigger a rebuild.
Requires the '-i' parameter to specify tape ID (UUID or barcode).
Use the get_tapes command to retrieve a list of tape
delete_tape_drive Deletes the specified offline tape drive.
This request is useful when a tape drive is permanently removed from a partition.
Requires the '-i' parameter to specify tape drive ID.
Use the get_tape_drives command to retrieve a list of tapes
delete_tape_failure Deletes a tape failure from the failure list.
Requires the '-i' parameter to specify tape failure ID (UUID).
Use the get_tape_failure command to retrieve a list of IDs
delete_tape_partition Deletes the specified offline tape partition from the BlackPearl gateway configuration.
Any tapes in the partition that have data on them are disassociated from the partition.
Any tapes without data on them and all tape drives associated with the partition are deletedfrom the BlackPearl gateway configuration.
This request is useful if the partition should neverhave been associated with the BlackPearl gateway or if the partition was deleted from the library.
Requires the '-i' parameter to specify tape partition
get_bucket Returns bucket details plus a list of objects contained.
Requires the '-b' parameter to specify bucket name or UUID.
Use the get_service command to retrieve a list of buckets
get_bulk Retrieve multiple objects from a bucket.
Requires the '-b' parameter to specify bucket (name or UUID).
Optional '-d' parameter to specify restore directory (default '.').
Optional '-p' parameter to specify prefix or directory name.
Separate multiple values with spaces, e.g., -p prefix1 prefix2Optional '--sync' flag to retrieve only newer or non-extant files.
Optional '--file-metadata' flag restores file metadata to the values extant when archived.
Optional '-nt' parameter to specify number of threads
system_information Retrieves basic system information: software version, build, and system serial number.
Useful to test communication
get_config_summary Runs multiple commands to capture configuration information
get_data_policy Returns information about the specified data policy.
Requires the '-i' parameter to specify data policy (UUID or name).
Use the get_data_policies command to retrieve a list of policies
get_data_policies Returns information about the specified data policy.
Requires the '-i' parameter to specify data policy (UUID or name).
Use the get_data_policies command to retrieve a list of policies
get_job Retrieves information about a current job.
Requires the '-i' parameter with the UUID of the jobUse the get_jobs command to retrieve a list of jobs
get_jobs Retrieves a list of all current jobs
get_object Retrieves a single object from a bucket.
Requires the '-b' parameter to specify bucket (name or UUID).
Requires the '-o' parameter to specify object (name or UUID).
Optional '-d' parameter to specify restore directory (default '.').
Optional '--sync' flag to retrieve only newer or non-extant files.
Optional '--file-metadata' flag restores file metadata to the values extant when archived.
Optional '-nt' parameter to specify number of threads.
Use the get_service command to retrieve a list of buckets.
Use the get_bucket command to retrieve a list of objects
get_objects_on_tape Returns a list of the contents of a single tape.
Requires the '-i' parameter to specify tape (barcode or UUID).
Use the get_tapes command to retrieve a list of tapes
get_physical_placement Returns the location of a single object on tape.
Requires the '-b' parameter to specify bucket (name or UUID).
Requires the '-o' parameter to specify object (name or UUID).
Use the get_service command to retrieve a list of buckets.
Use the get_bucket command to retrieve a list of objects
get_service Returns a list of buckets on the device
get_tape_failure Returns a list of tape failures
get_tapes Returns a list of all tapes
get_user Returns information about an individual user.
Requires the '-i' parameter to specify user (name or UUID).
Use the get_users command to retrieve a list of users
get_users Returns a list of all users
head_object Returns metadata but does not retrieve an object from a bucket.
Requires the '-b' parameter to specify bucket (name or UUID).
Requires the '-o' parameter to specify object (name or UUID).
Useful to determine if an object exists and you have permission to access it
modify_data_policy Alter parameters for the specified data policy.
.
Requires the '-i' parameter to specify data policy (UUID or name).
Requires the '--modify-params' parameter to be set.
Use key:value pair key:value,key2:value2: . . .
Legal values:name, checksum_type, default_blob_size, default_get_job_priority,default_put_job_priority, default_verify_job_priority, rebuild_priority,end_to_end_crc_required, versioning.
See API documentation for possible values).
Use the get_data_policies command to retrieve a list of policies and current values
modify_user Alters information about an individual user.
Requires the '-i' parameter to specify user (name or UUID).
Requires the '--modify-params' parameter to be set.
Use key:value pair key:value,key2:value2: . . .
Legal values:default_data_policy_idUse the get_users command to retrieve a list of users
performance For internal testing.
Generates mock file streams for put, and a discard (/dev/null)stream for get.
Useful for testing network and system performance.
Requires the '-b' parameter with a unique bucketname to be used for the test.
Requires the '-n' parameter with the number of files to be used for the test.
Requires the '-s' parameter with the size of each file in MB for the test.
Optional '-bs' parameter with the buffer size in bytes (default 1MB).
Optional '-nt' parameter with the number of threads
put_bucket Create a new empty bucket.
Requires the '-b' parameter to specify bucket name
put_bulk Put multiple objects from a directory or pipe into a bucket.
Requires the '-b' parameter to specify bucket (name or UUID).
Requires the '-d' parameter (unless \") to specify source directory.
Optional '-p' parameter (unless \" ) to specify prefix or directory name.
Optional '--sync' flag to put only newer or non-extant files.
Optional '--file-metadata' flag archives file metadata with files.
Optional '-nt' parameter to specify number of threads.
Optional '--ignore-errors' flag to continue on errors.
Optional '--follow-symlinks' flag to follow symlink (default is disregard)
reclaim_cache Forces a full reclaim of all caches, and waits untilthe reclaim completes.
Cache contents that need to be retainedbecause they are a part of an active job are retained.
Any cachecontents that can be reclaimed will be.
This operation may take avery long time to complete, depending on how much of the cache canbe reclaimed and how many blobs the cache is managing
verify_bulk_job A verify job reads data from the permanent data store and verifies that the CRC of the dataread matches the expected CRC.
Verify jobs ALWAYS read from the data store - even if the datacurrently resides in cache.
Requires the '-b' parameter to specify bucket (name or UUID).
Requires the '-o' parameter to specify object (name or UUID).
Optional '-p' parameter to specify prefix or directory name
get_data_path_backend Gets configuration information about the data path backend
get_cache_state Gets the utilization and state information for all cache filesystems
get_system_failure
get_capacity_summary Get a summary of the BlackPearl Deep Storage Gateway system-wide capacity
verify_system_health Verifies that the system appears to be online and functioning normally and that there is adequate free space for the database file system
verify_all_tapes Verify the integrity of all the tapes in the black pearl
verify_tape
get_suspect_objects
get_suspect_blob_tapes
modify_data_path
verify_pool
verify_all_pools
get_detailed_objects Filter an object list by size or creation date.
Returns one line for each object.
Optional '-b' bucket_nameOptional '--filter-params' to filter results.
Use key:value pair key:value,key2:value2: . . .
Legal values:newerthan, olderthan specify relative date from now in format d1.
h2.
m3.
s4 (zero values can be omitted , separate with '.')before, after specify absolute UTC date in format Y2016.
M11.
D9.
h12.
ZPDT(zero values or UTC time zone can be omitted , separate with '.')owner owner namecontains string to match in object namelargerthan, smallerthan object size in bytesNote: bucket will restrict values returned, filter-params will transfer (potentially large) object listand filter client-side
get_detailed_objects_physical Get a list of objects on tape, filtered by size or creation date.
Returns one line for each instance on tape.
Optional '-b' bucket_nameOptional '--filter-params' to filter results.
Use key:value pair key:value,key2:value2: . . .
Legal values:newerthan, olderthan specify relative date from now in format d1.
h2.
m3.
s4 (zero values can be omitted , separate with '.')before, after specify absolute UTC date in format Y2016.
M11.
D9.
h12.
ZPDT(zero values or UTC time zone can be omitted , separate with '.')owner owner namecontains string to match in object namelargerthan, smallerthan object size in bytesNote: bucket will restrict values returned, filter-params will transfer (potentially large) object listand filter client-side
eject_storage_domain Ejects all eligible tapes within the specified storage domain.
Tapes are not eligible for ejection if mediaEjectionAllowed=FALSE for the storage domain.
If a tape is being used for a job, it is ejected once it is no longer in use.
Use the get_storage_domains command to retrieve a list of storage domains
get_storage_domains Get information about all storage domains.
Optional -i (UUID or name) restricts output to one storage domainOptional --writeOptimization (capacity performance) filters results to those matching write optimization.
get_tape Returns information on a single tape.
If the tape has been ejected, then the ejection information will also be displayed.
Required '-i' tape barcode or i
get_bucket_details Returns bucket details by either UUID or bucket name.
Requires the '-b' parameter to specify bucket name or UUID.
Useful to get name by ID or ID by name.
Use the get_service command to retrieve a list of buckets
eject_tape Ejects the tape uniquely identified by ID.
Tapes are not eligible for ejection if mediaEjectionAllowed=FALSE for the storage domain.
If a tape is being used for a job, it is ejected once it is no longer in use.
Use the get_tapes command or get_detailed_objects_physical to find tape id
modify_job
recover_put_bulk ) to specify source directory.
Requires the '-i" parameter with the UUID for the interrupted or failed job.
Optional '--file-metadata' flag archives file metadata with files.
Other parameters should match the original put_bulk
recover_get_bulk Recovers a get_bulk job.
Requires the '-b' parameter to specify bucket (name or UUID).
Requires the '-i" parameter with the UUID for the interrupted or failed job.
Optional '--file-metadata' flag restores file metadata to the values extant when archived.
Other parameters should match the original get_bulk
cancel_verify_all_tapes Cancel a previous request to verify all the tapes in the DS3 appliance
cancel_verify_tape Cancel a previous request to verify a tape in the DS3 appliance.
Required '-i' tape id (barcode, name, UUID
get_pools Returns all pools matching option filter criteria
get_pool Returns information on a single pool.
Required '-i' pool name or i
cancel_verify_pool Cancel previous request to verify a pool in the DS3 appliance.
Required '-i' pool id (name or UUID
cancel_verify_all_pools Cancel previous request to verify all the pools in the DS3 appliance
recover Recover a failed or iterrupted put_bulk or get_bulk job using recover files.
Recover files are written to temp space on put_bulk and get_bulk