AWS S3 Notes

Some helpful instructions for downloading public datasets hosted in AWS S3.

Here we will download some data from the esgf-world S3 dataset. https://esgf-world.s3.amazonaws.com/index.html#CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/

High-level overview

There are two ways to download datasets from the AWS S3 bucket.

Quick and dirty: If you have a few files to download you can simply naviagate to the file URL in your browser. Then copy the URL for the netcdf file. For example, we may want to download the following file https://esgf-world.s3.amazonaws.com/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/pr_6hrPlev_CESM2_pdSST-pdSIC_r100i1p1f1_gn_200006010000-200106010000.nc

In a terminal, Simply type wget <url_you_copied_from_browser>. Here is a more concete example: wget https://esgf-world.s3.amazonaws.com/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/pr_6hrPlev_CESM2_pdSST-pdSIC_r100i1p1f1_gn_200006010000-200106010000.nc

However, this approach fails if you have a lot of files to download.

The efficient approach: The recommended way to download lots of files from an S3 bucket is by using AWS CLI.

Setting up AWS CLI

You can either instal AWS CLI in your home directory or choose to use the pre-installed binaries in /depot/itap/amaji/softwares/aws-cli/awscli-install.
Run the following command to use AWS CLI:

   export PATH=/depot/itap/amaji/softwares/aws-cli/awscli-install/v2/2.4.10/bin:$PATH

(Optional)
To install it in your home directory, use the following steps.

First download AWS CLI zip file by following the instructions in https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html On Bell, you need to download the Linux 64-bit binaries.
Unpack the zip file. unzip awscliv2.zip. This will create a directory named aws in the same directory.
Now run aws/install.sh -i /path/to/install. This will copy the binaries to the directory that you specified in /path/to/install.
Add the installation directory in your PATH. Run the following command.

   export PATH=/path/to/install/v2/current/bin:$PATH

Now you should be able to run aws s3 help

Browse and download files using AWS CLI

There are two primary commands for viewing the files in an S3 bucket and downloading them: aws s3 ls and aws s3 sync.

To get more instructions and options about any AWS command, simply type aws <command> help. For example, aws s3 help will show you all the subcommands for aws s3 command. Similarly, aws s3 sync help will show you the options for the s3 sync subcommand.

Assume that I want to download all the files in the S3 URL: https://esgf-world.s3.amazonaws.com/index.html#CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/

Identify the S3 bucket name: This is the name appearing before s3.amazonaws.com, in this case, esgf-world.
Now list all the files/folders in that directory.

   aws s3 ls --no-sign-request s3://esgf-world/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/

Notice that the URL here is slightly different from the one in your browser. We are no longer using the full domain name s3.amazonaws.com, this is implied. We only use s3://<bucket-name>. You also need to append the path to the directory inside the bucket. In this example, the path is /CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/. Together, they tell AWS CLI which directory you are interested in.

To download the entire directory locally run:

   aws s3 sync --no-sign-request s3://esgf-world/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/ .

Make sure you are in the scratch directory when running the sync command.

amaji/AWS-S3-notes

About

Resources

Stars

Watchers

Forks

Releases