AWS S3 Notes
Some helpful instructions for downloading public datasets hosted in AWS S3.
Here we will download some data from the esgf-world
S3 dataset.
https://esgf-world.s3.amazonaws.com/index.html#CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/
High-level overview
There are two ways to download datasets from the AWS S3 bucket.
- Quick and dirty: If you have a few files to download you can simply naviagate to the file URL in your browser. Then copy the URL for the netcdf file. For example, we may want to download the following file https://esgf-world.s3.amazonaws.com/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/pr_6hrPlev_CESM2_pdSST-pdSIC_r100i1p1f1_gn_200006010000-200106010000.nc
In a terminal, Simply type wget <url_you_copied_from_browser>
. Here is a more concete example:
wget https://esgf-world.s3.amazonaws.com/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/pr_6hrPlev_CESM2_pdSST-pdSIC_r100i1p1f1_gn_200006010000-200106010000.nc
However, this approach fails if you have a lot of files to download.
- The efficient approach: The recommended way to download lots of files from an S3 bucket is by using AWS CLI.
Setting up AWS CLI
You can either instal AWS CLI in your home directory or choose to use the pre-installed binaries in /depot/itap/amaji/softwares/aws-cli/awscli-install
.
Run the following command to use AWS CLI:
export PATH=/depot/itap/amaji/softwares/aws-cli/awscli-install/v2/2.4.10/bin:$PATH
(Optional)
To install it in your home directory, use the following steps.
-
First download AWS CLI zip file by following the instructions in https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html On Bell, you need to download the Linux 64-bit binaries.
-
Unpack the zip file.
unzip awscliv2.zip
. This will create a directory namedaws
in the same directory. -
Now run
aws/install.sh -i /path/to/install
. This will copy the binaries to the directory that you specified in/path/to/install
. -
Add the installation directory in your
PATH
. Run the following command.
export PATH=/path/to/install/v2/current/bin:$PATH
- Now you should be able to run
aws s3 help
Browse and download files using AWS CLI
There are two primary commands for viewing the files in an S3 bucket and downloading them: aws s3 ls
and aws s3 sync
.
To get more instructions and options about any AWS command, simply type aws <command> help
. For example, aws s3 help
will show you all the subcommands for aws s3
command. Similarly, aws s3 sync help
will show you the options for the s3 sync
subcommand.
Assume that I want to download all the files in the S3 URL: https://esgf-world.s3.amazonaws.com/index.html#CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/
-
Identify the S3 bucket name: This is the name appearing before
s3.amazonaws.com
, in this case,esgf-world
. -
Now list all the files/folders in that directory.
aws s3 ls --no-sign-request s3://esgf-world/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/
Notice that the URL here is slightly different from the one in your browser. We are no longer using the full domain name s3.amazonaws.com
, this is implied. We only use s3://<bucket-name>
. You also need to append the path to the directory inside the bucket. In this example, the path is /CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/
. Together, they tell AWS CLI which directory you are interested in.
- To download the entire directory locally run:
aws s3 sync --no-sign-request s3://esgf-world/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/ .
Make sure you are in the scratch directory when running the sync
command.