Code and data for evaluating oil spill amount from text-form incident information

This repository includes the code and data for evaluating oil spill amounts from incident textual information. The code is written in Python and operated using Jupyter Lab and Anaconda.

Dataset

The raw data used is the incident data published by NOAA. For each oil spill incident supported by NOAA (available at NOAA Incident News), the IncidentNews platform records detailed information in two primary categories:

Incident-level data: Stored in the folder description.
Post-level data: Stored in the folder posts.

Algorithms

The algorithms for evaluating release amounts (RA) are split into three notebooks based on their functions:

1_RA_extraction.ipynb:
- Identifies oil spill-related incidents from raw incident data.
- Separately extracts candidate RAs for each text segment in the incident-level and post-level data.
2_RA_identification.ipynb:
- Identifies the final RA for each incident from its candidate RAs.
3_add_ra_source_and_update.ipynb:
- Adds three columns to indicate:
  - The actual RA (actual RA)
  - The source of the actual RA (RA source).
  - How the actual RA updates the given potential maximum release amount (update label).

The calculated results are stored in the folders description, posts, and incident_posts (created automatically after running "2_RA_identification.ipynb").

Supporting Files

Additional files supporting the operation of the algorithm are automatically called and used in the above three notebooks. These include:

func_set_general.ipynb and func_set_ra_eval.ipynb:
- Custom functions that are automatically called and used in the three primary notebooks.
Folders manual and gpted:
- Contain the manually verified candidate RAs and the candidate RAs evaluated by the Large Language Model (LLM) to ensure repeatability.

Usage

The following code is designed to be operated on Anaconda environment, the following additional packages need to be installed before running.

# install stanza for rule-based RA candidate extraction. 
# Note that the version of stanza must be 1.3.0 to ensure the repeatability as the corpus of different versions are different
!pip install stanza==1.3.0

# install openai for rule-based RA candidate extraction
!pip install openai

Additionally, a GPT API key is needed for running it, which should be clarified in the query_gpt function in the func_set_ra_eval.ipynb file.

To run the notebooks, follow these steps:

Open Jupyter Lab:
```
jupyter lab
```
Navigate to the project directory in Jupyter Lab.
Run the notebooks in the following order:
- 1_RA_extraction.ipynb
- 2_RA_identification.ipynb
- 3_add_ra_source_and_update.ipynb

Contact

Yiming Liu - [liu3285@purdue.edu]

Hua Cai - [huacai@purdue.edu]