README.md

# Code and data for evaluating oil spill amount from text-form incident information

This repository includes the code and data for evaluating oil spill amounts from incident textual information. The code is written in Python and operated using Jupyter Lab and Anaconda.

## Table of Contents

- [Dataset](#dataset)
- [Algorithms](#algorithms)
- [Supporting Files](#supporting-files)
- [Usage](#usage)
- [Contact](#contact)

## Dataset

The raw data used is the incident data published by NOAA. For each oil spill incident supported by NOAA (available at NOAA Incident News), the IncidentNews platform records detailed information in two primary categories:

- **Incident-level data**: Stored in the folder `description`.
- **Post-level data**: Stored in the folder `posts`.

## Algorithms

The algorithms for evaluating release amounts (RA) are split into three notebooks based on their functions:

1. **1_RA_extraction.ipynb**:
    - Identifies oil spill-related incidents from raw incident data.
    - Separately extracts candidate RAs for each text segment in the incident-level and post-level data.

2. **2_RA_identification.ipynb**:
    - Identifies the final RA for each incident from its candidate RAs.

3. **3_add_ra_source_and_update.ipynb**:
    - Adds three columns to indicate:
        - The actual RA (actual RA)
        - The source of the actual RA (RA source).
        - How the actual RA updates the given potential maximum release amount (update label).

The calculated results are stored in the folders `description`, `posts`, and `incident_posts` (created automatically after running "2_RA_identification.ipynb").

## Supporting Files

Additional files supporting the operation of the algorithm are automatically called and used in the above three notebooks. These include:

- **func_set_general.ipynb** and **func_set_ra_eval.ipynb**:
    - Custom functions that are automatically called and used in the three primary notebooks.

- Folders **manual** and **gpted**:
    - Contain the manually verified candidate RAs and the candidate RAs evaluated by the Large Language Model (LLM) to ensure repeatability.

## Usage

The following code is designed to be operated on Anaconda environment, the following additional packages need to be installed before running.

```bash
# install stanza for rule-based RA candidate extraction. 
# Note that the version of stanza must be 1.3.0 to ensure the repeatability as the corpus of different versions are different
!pip install stanza==1.3.0

# install openai for rule-based RA candidate extraction
!pip install openai
```

Additionally, a GPT API key is needed for running it, which should be clarified in the query_gpt function in the func_set_ra_eval.ipynb file.

To run the notebooks, follow these steps:

1. Open Jupyter Lab:
    ```bash
    jupyter lab
    ```

2. Navigate to the project directory in Jupyter Lab.

3. Run the notebooks in the following order:
    - `1_RA_extraction.ipynb`
    - `2_RA_identification.ipynb`
    - `3_add_ra_source_and_update.ipynb`

## Contact

Yiming Liu - [liu3285@purdue.edu]

Hua Cai - [huacai@purdue.edu]