README.md

# AWS S3 Notes

Some helpful instructions for downloading public datasets hosted in AWS S3.

Here we will download some data from the `esgf-world` S3 dataset.
 https://esgf-world.s3.amazonaws.com/index.html#CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/

# High-level overview

There are two ways to download datasets from the AWS S3 bucket.

1. **Quick and dirty**: If you have a few files to download you can simply naviagate to the file URL in your browser. Then copy the URL for the netcdf file. For example, we may want to download the following file https://esgf-world.s3.amazonaws.com/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/pr_6hrPlev_CESM2_pdSST-pdSIC_r100i1p1f1_gn_200006010000-200106010000.nc

In a terminal, Simply type `wget <url_you_copied_from_browser>`. Here is a more concete example:
`wget https://esgf-world.s3.amazonaws.com/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/pr_6hrPlev_CESM2_pdSST-pdSIC_r100i1p1f1_gn_200006010000-200106010000.nc`

However, this approach fails if you have a lot of files to download.

2. **The efficient approach**: The recommended way to download lots of files from an S3 bucket is by using AWS CLI.

## Setting up AWS CLI

You can either instal AWS CLI in your home directory or choose to use the pre-installed binaries in `/depot/itap/amaji/softwares/aws-cli/awscli-install`.   
Run the following command to use AWS CLI:
```bash
   export PATH=/depot/itap/amaji/softwares/aws-cli/awscli-install/v2/2.4.10/bin:$PATH
```

**(Optional)**   
To install it in your home directory, use the following steps.

1. First download AWS CLI zip file by following the instructions in https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
On Bell, you need to download the Linux 64-bit binaries.

2. Unpack the zip file. `unzip awscliv2.zip`. This will create a directory named `aws` in the same directory.

3. Now run `aws/install.sh -i /path/to/install`. This will copy the binaries to the directory that you specified in `/path/to/install`.

4. Add the installation directory in your `PATH`. Run the following command.  
```bash
   export PATH=/path/to/install/v2/current/bin:$PATH
```

5. Now you should be able to run `aws s3 help`

## Browse and download files using AWS CLI

There are two primary commands for viewing the files in an S3 bucket and downloading them: `aws s3 ls` and `aws s3 sync`.

To get more instructions and options about any AWS command, simply type `aws <command> help`. For example, `aws s3 help` will show you all the subcommands for `aws s3` command. Similarly, `aws s3 sync help` will show you the options for the `s3 sync` subcommand.

Assume that I want to download all the files in the S3 URL: https://esgf-world.s3.amazonaws.com/index.html#CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/

1. Identify the S3 bucket name: This is the name appearing before `s3.amazonaws.com`, in this case, `esgf-world`.

2. Now **list all the files/folders** in that directory.
```bash
   aws s3 ls --no-sign-request s3://esgf-world/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/
```  
**_Notice that the URL here is slightly different from the one in your browser._** We are no longer using the full domain name `s3.amazonaws.com`, this is implied. We only use `s3://<bucket-name>`. You also need to append the path to the directory _inside_ the bucket. In this example, the path is `/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/`. Together, they tell AWS CLI which directory you are interested in.

3. To **download the entire directory** locally run:   
```bash
   aws s3 sync --no-sign-request s3://esgf-world/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/ . 
```  
Make sure you are in the **scratch directory** when running the `sync` command.