Skip to content
Permalink
a6e798ec44
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
83 lines (54 sloc) 3.04 KB
# Code and data for evaluating oil spill amount from text-form incident information
This repository includes the code and data for evaluating oil spill amounts from incident textual information. The code is written in Python and operated using Jupyter Lab and Anaconda.
## Table of Contents
- [Dataset](#dataset)
- [Algorithms](#algorithms)
- [Supporting Files](#supporting-files)
- [Usage](#usage)
- [Contact](#contact)
## Dataset
The raw data used is the incident data published by NOAA. For each oil spill incident supported by NOAA (available at NOAA Incident News), the IncidentNews platform records detailed information in two primary categories:
- **Incident-level data**: Stored in the folder `description`.
- **Post-level data**: Stored in the folder `posts`.
## Algorithms
The algorithms for evaluating release amounts (RA) are split into three notebooks based on their functions:
1. **1_RA_extraction.ipynb**:
- Identifies oil spill-related incidents from raw incident data.
- Separately extracts candidate RAs for each text segment in the incident-level and post-level data.
2. **2_RA_identification.ipynb**:
- Identifies the final RA for each incident from its candidate RAs.
3. **3_add_ra_source_and_update.ipynb**:
- Adds three columns to indicate:
- The actual RA (actual RA)
- The source of the actual RA (RA source).
- How the actual RA updates the given potential maximum release amount (update label).
The calculated results are stored in the folders `description`, `posts`, and `incident_posts` (created automatically after running "2_RA_identification.ipynb").
## Supporting Files
Additional files supporting the operation of the algorithm are automatically called and used in the above three notebooks. These include:
- **func_set_general.ipynb** and **func_set_ra_eval.ipynb**:
- Custom functions that are automatically called and used in the three primary notebooks.
- Folders **manual** and **gpted**:
- Contain the manually verified candidate RAs and the candidate RAs evaluated by the Large Language Model (LLM) to ensure repeatability.
## Usage
The following code is designed to be operated on Anaconda environment, the following additional packages need to be installed before running.
```bash
# install stanza for rule-based RA candidate extraction.
# Note that the version of stanza must be 1.3.0 to ensure the repeatability as the corpus of different versions are different
!pip install stanza==1.3.0
# install openai for rule-based RA candidate extraction
!pip install openai
```
Additionally, a GPT API key is needed for running it, which should be clarified in the query_gpt function in the func_set_ra_eval.ipynb file.
To run the notebooks, follow these steps:
1. Open Jupyter Lab:
```bash
jupyter lab
```
2. Navigate to the project directory in Jupyter Lab.
3. Run the notebooks in the following order:
- `1_RA_extraction.ipynb`
- `2_RA_identification.ipynb`
- `3_add_ra_source_and_update.ipynb`
## Contact
Yiming Liu - [liu3285@purdue.edu]
Hua Cai - [huacai@purdue.edu]