Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Extract-oil-spill-amount-from-text/README.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
83 lines (54 sloc)
3.04 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Code and data for evaluating oil spill amount from text-form incident information | |
This repository includes the code and data for evaluating oil spill amounts from incident textual information. The code is written in Python and operated using Jupyter Lab and Anaconda. | |
## Table of Contents | |
- [Dataset](#dataset) | |
- [Algorithms](#algorithms) | |
- [Supporting Files](#supporting-files) | |
- [Usage](#usage) | |
- [Contact](#contact) | |
## Dataset | |
The raw data used is the incident data published by NOAA. For each oil spill incident supported by NOAA (available at NOAA Incident News), the IncidentNews platform records detailed information in two primary categories: | |
- **Incident-level data**: Stored in the folder `description`. | |
- **Post-level data**: Stored in the folder `posts`. | |
## Algorithms | |
The algorithms for evaluating release amounts (RA) are split into three notebooks based on their functions: | |
1. **1_RA_extraction.ipynb**: | |
- Identifies oil spill-related incidents from raw incident data. | |
- Separately extracts candidate RAs for each text segment in the incident-level and post-level data. | |
2. **2_RA_identification.ipynb**: | |
- Identifies the final RA for each incident from its candidate RAs. | |
3. **3_add_ra_source_and_update.ipynb**: | |
- Adds three columns to indicate: | |
- The actual RA (actual RA) | |
- The source of the actual RA (RA source). | |
- How the actual RA updates the given potential maximum release amount (update label). | |
The calculated results are stored in the folders `description`, `posts`, and `incident_posts` (created automatically after running "2_RA_identification.ipynb"). | |
## Supporting Files | |
Additional files supporting the operation of the algorithm are automatically called and used in the above three notebooks. These include: | |
- **func_set_general.ipynb** and **func_set_ra_eval.ipynb**: | |
- Custom functions that are automatically called and used in the three primary notebooks. | |
- Folders **manual** and **gpted**: | |
- Contain the manually verified candidate RAs and the candidate RAs evaluated by the Large Language Model (LLM) to ensure repeatability. | |
## Usage | |
The following code is designed to be operated on Anaconda environment, the following additional packages need to be installed before running. | |
```bash | |
# install stanza for rule-based RA candidate extraction. | |
# Note that the version of stanza must be 1.3.0 to ensure the repeatability as the corpus of different versions are different | |
!pip install stanza==1.3.0 | |
# install openai for rule-based RA candidate extraction | |
!pip install openai | |
``` | |
Additionally, a GPT API key is needed for running it, which should be clarified in the query_gpt function in the func_set_ra_eval.ipynb file. | |
To run the notebooks, follow these steps: | |
1. Open Jupyter Lab: | |
```bash | |
jupyter lab | |
``` | |
2. Navigate to the project directory in Jupyter Lab. | |
3. Run the notebooks in the following order: | |
- `1_RA_extraction.ipynb` | |
- `2_RA_identification.ipynb` | |
- `3_add_ra_source_and_update.ipynb` | |
## Contact | |
Yiming Liu - [liu3285@purdue.edu] | |
Hua Cai - [huacai@purdue.edu] | |