Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
AWS-S3-notes/README.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
71 lines (43 sloc)
3.72 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# AWS S3 Notes | |
Some helpful instructions for downloading public datasets hosted in AWS S3. | |
Here we will download some data from the `esgf-world` S3 dataset. | |
https://esgf-world.s3.amazonaws.com/index.html#CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/ | |
# High-level overview | |
There are two ways to download datasets from the AWS S3 bucket. | |
1. **Quick and dirty**: If you have a few files to download you can simply naviagate to the file URL in your browser. Then copy the URL for the netcdf file. For example, we may want to download the following file https://esgf-world.s3.amazonaws.com/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/pr_6hrPlev_CESM2_pdSST-pdSIC_r100i1p1f1_gn_200006010000-200106010000.nc | |
In a terminal, Simply type `wget <url_you_copied_from_browser>`. Here is a more concete example: | |
`wget https://esgf-world.s3.amazonaws.com/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/6hrPlev/pr/gn/v20190430/pr_6hrPlev_CESM2_pdSST-pdSIC_r100i1p1f1_gn_200006010000-200106010000.nc` | |
However, this approach fails if you have a lot of files to download. | |
2. **The efficient approach**: The recommended way to download lots of files from an S3 bucket is by using AWS CLI. | |
## Setting up AWS CLI | |
You can either instal AWS CLI in your home directory or choose to use the pre-installed binaries in `/depot/itap/amaji/softwares/aws-cli/awscli-install`. | |
Run the following command to use AWS CLI: | |
```bash | |
export PATH=/depot/itap/amaji/softwares/aws-cli/awscli-install/v2/2.4.10/bin:$PATH | |
``` | |
**(Optional)** | |
To install it in your home directory, use the following steps. | |
1. First download AWS CLI zip file by following the instructions in https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html | |
On Bell, you need to download the Linux 64-bit binaries. | |
2. Unpack the zip file. `unzip awscliv2.zip`. This will create a directory named `aws` in the same directory. | |
3. Now run `aws/install.sh -i /path/to/install`. This will copy the binaries to the directory that you specified in `/path/to/install`. | |
4. Add the installation directory in your `PATH`. Run the following command. | |
```bash | |
export PATH=/path/to/install/v2/current/bin:$PATH | |
``` | |
5. Now you should be able to run `aws s3 help` | |
## Browse and download files using AWS CLI | |
There are two primary commands for viewing the files in an S3 bucket and downloading them: `aws s3 ls` and `aws s3 sync`. | |
To get more instructions and options about any AWS command, simply type `aws <command> help`. For example, `aws s3 help` will show you all the subcommands for `aws s3` command. Similarly, `aws s3 sync help` will show you the options for the `s3 sync` subcommand. | |
Assume that I want to download all the files in the S3 URL: https://esgf-world.s3.amazonaws.com/index.html#CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/ | |
1. Identify the S3 bucket name: This is the name appearing before `s3.amazonaws.com`, in this case, `esgf-world`. | |
2. Now **list all the files/folders** in that directory. | |
```bash | |
aws s3 ls --no-sign-request s3://esgf-world/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/ | |
``` | |
**_Notice that the URL here is slightly different from the one in your browser._** We are no longer using the full domain name `s3.amazonaws.com`, this is implied. We only use `s3://<bucket-name>`. You also need to append the path to the directory _inside_ the bucket. In this example, the path is `/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/`. Together, they tell AWS CLI which directory you are interested in. | |
3. To **download the entire directory** locally run: | |
```bash | |
aws s3 sync --no-sign-request s3://esgf-world/CMIP6/PAMIP/NCAR/CESM2/pdSST-pdSIC/r100i1p1f1/ . | |
``` | |
Make sure you are in the **scratch directory** when running the `sync` command. | |