Stanford provides an unlimited* Google Drive space for all the labs. There are some restrictions like daily upload limit of 750 GB/person/day (this limit is 5TB for a single file which makes "tar"ing a folder a good practice), 2 files per second limit (very slow transfer for multiple small files) and the number of files each teamdrive can store. There can be multiple teamdrives setup on Google Drive to overcome the number of file limit. There are several teamdrives setup for Yiorgo's lab named "Skiniotis Lab", "SkiniotisLab2" and etc. You can use rclone module in Sherlock to copy your files to these TEAMDRIVEs.
- In order to transfer data using rclone, you should first configure rclone for your account on Sherlock to link rclone to these teamdrives on Google Drive.
- I prepared a rclone/tar/split script to automate archiving for our lab. Run this script to achieve a folder.
- It is better to transfer folders that are less than 5TB. This program will tar your folder to your $SCRATCH space (100TB capacity) and will split the tarred file to less than 5TB pieces. Therefore for a 20TB folder, you need 40TB space in $SCRATH. 5TB files do not require splitting and a 5TB folder will only require 5TB space on $SCRATCH
- Transfering or tarring are long processes, therefore submitting these jobs to queue is better that interactive jobs which requires constant internet connection of your open terminal
- Please follow the following instructions to configure rclone and transfer data.
updated 20210218
How to configure Rclone for your Sherlock account
Please follow the following steps create a "Shared Drive" on Google Drive and configure "rclone" software on Sherlock to connect "rclone" to the "Shared Drive" you have created. More info in this link. https://uit.stanford.edu/service/gsuite/drive
PS: Commands are denoted with "$" sign and you should no include the "$" sign when you type these commands.
"#" sign denotes my comments and you should not include anything after this sign when you type these commands.
Login to your Stanford Google Drive account and create a "Shared Drive".
Right click on the "Shared Drive" you created and manage members to give manager permissions to your PI and the team member responsible for the archiving (Click off notify people).
Login to Sherlock by opening a terminal on your local computer and type the following
$ ssh -Y your_sunet_id@sherlock.stanford.edu #do not forget to change your_sunet_id
#Follow the the instructions to login
$ ml system rclone
$ rclone config
type --> n (to create a new remote config file)
type --> teamremote (create a name for the "Share Drive" on Google Drive for your rclone configuration on Sherlock, you need a new name for each "Shared Drive")
type --> drive (to select for Google Drive)
client_id> (leave empty and press enter)
client_secret> (leave empty and press enter)
scope> (leave empty and press enter)
root_folder_id> (leave empty and press enter)
service_account_file> (leave empty and press enter)
Edit advanced config? --> n
Use auto config? --> n
#Copy the given link and open on a browser, select your Stanford account (not personal gmail account) to give permission to google drive.
#Copy the verification code and enter it on the terminal
Configure this as a team drive --> y
Enter a Shared Drive ID --> 1 (Select the "Shared Drive" you would like to configure)
y/e/d --> y (yes to confirm config details)
e/n/d/r/c/s/q --> q (quit to finalize configuration)
You have configured the rclone for your account in Sherlock which connects one of our/your shared "teamdrive"s on your Google Drive account to your Sherlock account with name "teamremote".
When your certain teamdrive in Google Drive is full, you should create another teamdrive and run the same configuration for linking the new teamdrive on your Google Drive to your Sherlock account for rclone (perhaps with the name "teamremote2"). Alternatively, you can transfer all the files in your teamdrive to another teamdrive.
You may need to refresh your token once in a while (bad token error)
How to transfer AN assigned folder from Sherlock to Google Drive
Prerequisites
- Setup your teamremote
- Make sure to have enough space on your $SCRATCH
- Make sure to have at least read permissions on all the files that you wish to archive.
How to run?
#Navigate to a folder where the log files for archiving will be kept.
cd $HOME/archiving
#Load a7scripts
ml a7scripts
#Run the archiving command with user inputs, project name, folder path to be archived and the teamremote name.
a7archieve.sh my_fav_protein /oak/stanford/groups/yiorgo/Alpay/my_fav_protein/relion teamremoteAlpay
--------
How does it work?
- User inputs the Project_Name, Folder/File to be Archived and TeamRemoteName that you setup for you account.
- The program will create a gateway folder under $SCRATCH/A7_Archive/Project_Name
- Makes a list of files and save it as Folder.list under the gateway folder.
- Tar the Folder into a single file under the gateway folder
- If the file is larger than 5TB, splits the tar file into smaller (less than 5TB) files
- Archive the tar file(s) together with the list file under Project_Name folder on your TeamDrive
- Delete the tar files and list file from the gateway folder.
- This script has 2 parts, tarring/split and rclone. These 2 parts are submitted to the "yiorgo" queue separately. Due to the Google Drive limitations, only 1 instance of rclone part will be running on the queue for each person and the other rclone jobs submitted through this script will wait for running instance to finish and will start automatically.
- You should check if everything is archived correctly and manually delete your folder that is archived
Note: All the processes will be submitted to queue. So no need to keep your terminal open.
Google drive limits
- Upload limit - 750 GB of data per day & 2 files per second.
- Single file size limit - 5 TB in size (single file transfer can exceed 750GB/day limit, so single tar file can be beneficial).
- If a single file exceeds the 750 GB daily limit, that file will upload. Subsequent files will not upload until the daily upload limit resets the next day.
- Number of files limit - 400,000 files and folders per teamdrive (you will get an error if this number exceed and you should create a new teamdrive).
- Subfolder limit - nest up to 20 subfolders