Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
{
"cells": [
{
"cell_type": "markdown",
"id": "0a2ad01d",
"metadata": {},
"source": [
"# MGMT388 Group Project Notebook\n",
"Caleb Hammoudeh | Colby McDowell | Tyler Perryman\n",
"\n",
"# Abstract / Source\n",
"\n",
"For this project we plan to perform 3 modeling techniques in order to analyze this auctioned car dataset. We will first perform a cluster analysis to look to identify groups of auctioned cars based on their selling price and model characteristics. Next we will use logistic regression in order to predict whether a car is automatic or manual based on its characteristics. Lastly, we are going to perform a linear regression test on odometer to see whether or not the given data of auctioned cars can accurately predict the odometer reading at the time of the sale. \n",
"\n",
"Goal: After using all of the tests/models discussed above, we want to see how important the characteristics of an auctioned car are. Such as make, model, interior, or condition.\n",
"\n",
"## Car Sales Dataset\n",
"* Tunguz, Bojan. “Used Car Auction Prices.” Kaggle, 18 May 2021, https://www.kaggle.com/datasets/tunguz/used-car-auction-prices.\n",
"* Shape - (472336, 23) - Data Records = 472,336\n",
"* Usage: Cluster - sample of 100,000 records | Logistic & Linear Regression - entire dataset\n",
"* NOTE: MMR stands for Manheim Market Report - this is a leading indicator of wholesale prices across the country to determine an estimated selling price based on millions of sales transactions for the specific model and year of the car for sale.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f4def57b",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import Series, DataFrame, Index, Categorical\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.model_selection import train_test_split,cross_val_predict,cross_val_score\n",
"from sklearn.metrics import r2_score,mean_squared_error\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.linear_model import SGDClassifier\n",
"import numpy as np\n",
"from sklearn.datasets import make_blobs\n",
"from sklearn.cluster import KMeans\n",
"from sklearn.metrics import silhouette_score\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"id": "d308ecd4",
"metadata": {},
"source": [
"## Data Gathering"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "b955bbac",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>make</th>\n",
" <th>model</th>\n",
" <th>trim</th>\n",
" <th>body</th>\n",
" <th>transmission</th>\n",
" <th>vin</th>\n",
" <th>state</th>\n",
" <th>condition</th>\n",
" <th>odometer</th>\n",
" <th>color</th>\n",
" <th>interior</th>\n",
" <th>seller</th>\n",
" <th>mmr</th>\n",
" <th>sellingprice</th>\n",
" <th>saledate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>13886</th>\n",
" <td>2009</td>\n",
" <td>Honda</td>\n",
" <td>Accord</td>\n",
" <td>EX-L</td>\n",
" <td>Coupe</td>\n",
" <td>automatic</td>\n",
" <td>1hgcs12859a003890</td>\n",
" <td>wa</td>\n",
" <td>2.9</td>\n",
" <td>77345.0</td>\n",
" <td>gray</td>\n",
" <td>black</td>\n",
" <td>northtown auto liquidators</td>\n",
" <td>10000</td>\n",
" <td>9500</td>\n",
" <td>Tue Dec 23 2014 14:30:00 GMT-0800 (PST)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>416120</th>\n",
" <td>2006</td>\n",
" <td>Ford</td>\n",
" <td>Expedition</td>\n",
" <td>XLT</td>\n",
" <td>SUV</td>\n",
" <td>automatic</td>\n",
" <td>1fmpu15576lb00788</td>\n",
" <td>ga</td>\n",
" <td>3.6</td>\n",
" <td>151388.0</td>\n",
" <td>silver</td>\n",
" <td>gray</td>\n",
" <td>fifth third bank</td>\n",
" <td>3225</td>\n",
" <td>4100</td>\n",
" <td>Thu May 21 2015 02:30:00 GMT-0700 (PDT)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>395589</th>\n",
" <td>2011</td>\n",
" <td>Chevrolet</td>\n",
" <td>Equinox</td>\n",
" <td>LS</td>\n",
" <td>SUV</td>\n",
" <td>automatic</td>\n",
" <td>2cnflcec5b6395773</td>\n",
" <td>mn</td>\n",
" <td>4.3</td>\n",
" <td>105800.0</td>\n",
" <td>white</td>\n",
" <td>gray</td>\n",
" <td>select lane</td>\n",
" <td>10100</td>\n",
" <td>10800</td>\n",
" <td>Thu Mar 05 2015 02:30:00 GMT-0800 (PST)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" year make model trim body transmission \\\n",
"13886 2009 Honda Accord EX-L Coupe automatic \n",
"416120 2006 Ford Expedition XLT SUV automatic \n",
"395589 2011 Chevrolet Equinox LS SUV automatic \n",
"\n",
" vin state condition odometer color interior \\\n",
"13886 1hgcs12859a003890 wa 2.9 77345.0 gray black \n",
"416120 1fmpu15576lb00788 ga 3.6 151388.0 silver gray \n",
"395589 2cnflcec5b6395773 mn 4.3 105800.0 white gray \n",
"\n",
" seller mmr sellingprice \\\n",
"13886 northtown auto liquidators 10000 9500 \n",
"416120 fifth third bank 3225 4100 \n",
"395589 select lane 10100 10800 \n",
"\n",
" saledate \n",
"13886 Tue Dec 23 2014 14:30:00 GMT-0800 (PST) \n",
"416120 Thu May 21 2015 02:30:00 GMT-0700 (PDT) \n",
"395589 Thu Mar 05 2015 02:30:00 GMT-0800 (PST) "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars = pd.read_csv('car_prices.csv',on_bad_lines='skip').dropna()\n",
"cars.sample(n=3)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "7e2a65c5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(472336, 16)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.shape"
]
},
{
"cell_type": "markdown",
"id": "e37bb004",
"metadata": {},
"source": [
"## Data Cleaning"
]
},
{
"cell_type": "markdown",
"id": "c85a3ccc",
"metadata": {},
"source": [
"In order to clean the data, variables like make, model, trim, body, transmission, state, color, and interior must be changed to numerical values in order to be used as features in regression and cluster analysis. Make, model, trim, body, state, color, and interior will be changed to category variables where each specific value will be given a number to represent it. For the transmission variable there are only 2 outcomes so the new values for transimission will be binary where 1 = automatic and 0 = manual. "
]
},
{
"cell_type": "markdown",
"id": "fe8819d7",
"metadata": {},
"source": [
"First we am going to change transmission to a binary variable:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "733ff26a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 455974\n",
"0 16362\n",
"Name: transmission, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.transmission = np.where(cars.transmission == 'automatic',1,0)\n",
"cars.transmission.value_counts()"
]
},
{
"cell_type": "markdown",
"id": "dfbf76de",
"metadata": {},
"source": [
"Next we are going to give all of the categorical variables we plan to use for our model a numerical representation:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "a7773e34",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>make</th>\n",
" <th>model</th>\n",
" <th>trim</th>\n",
" <th>body</th>\n",
" <th>transmission</th>\n",
" <th>vin</th>\n",
" <th>state</th>\n",
" <th>condition</th>\n",
" <th>odometer</th>\n",
" <th>...</th>\n",
" <th>mmr</th>\n",
" <th>sellingprice</th>\n",
" <th>saledate</th>\n",
" <th>makeCode</th>\n",
" <th>modelCode</th>\n",
" <th>trimCode</th>\n",
" <th>bodyCode</th>\n",
" <th>stateCode</th>\n",
" <th>colorCode</th>\n",
" <th>interiorCode</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>489058</th>\n",
" <td>2012</td>\n",
" <td>Ford</td>\n",
" <td>Focus</td>\n",
" <td>SE</td>\n",
" <td>hatchback</td>\n",
" <td>1</td>\n",
" <td>1fahp3k20cl404844</td>\n",
" <td>mi</td>\n",
" <td>3.3</td>\n",
" <td>59353.0</td>\n",
" <td>...</td>\n",
" <td>8250</td>\n",
" <td>9400</td>\n",
" <td>Thu Jun 11 2015 02:30:00 GMT-0700 (PDT)</td>\n",
" <td>14</td>\n",
" <td>275</td>\n",
" <td>1044</td>\n",
" <td>65</td>\n",
" <td>12</td>\n",
" <td>14</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>528471</th>\n",
" <td>2007</td>\n",
" <td>Honda</td>\n",
" <td>Accord</td>\n",
" <td>Value Package</td>\n",
" <td>sedan</td>\n",
" <td>1</td>\n",
" <td>jhmcm56107c008377</td>\n",
" <td>fl</td>\n",
" <td>2.3</td>\n",
" <td>113249.0</td>\n",
" <td>...</td>\n",
" <td>5250</td>\n",
" <td>4200</td>\n",
" <td>Fri Jun 12 2015 02:30:00 GMT-0700 (PDT)</td>\n",
" <td>18</td>\n",
" <td>54</td>\n",
" <td>1337</td>\n",
" <td>76</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>216428</th>\n",
" <td>2013</td>\n",
" <td>Chrysler</td>\n",
" <td>300</td>\n",
" <td>Base</td>\n",
" <td>Sedan</td>\n",
" <td>1</td>\n",
" <td>2c3ccaag5dh508315</td>\n",
" <td>tx</td>\n",
" <td>3.0</td>\n",
" <td>70542.0</td>\n",
" <td>...</td>\n",
" <td>15250</td>\n",
" <td>14625</td>\n",
" <td>Wed Jan 28 2015 02:00:00 GMT-0800 (PST)</td>\n",
" <td>8</td>\n",
" <td>10</td>\n",
" <td>395</td>\n",
" <td>36</td>\n",
" <td>29</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 23 columns</p>\n",
"</div>"
],
"text/plain": [
" year make model trim body transmission \\\n",
"489058 2012 Ford Focus SE hatchback 1 \n",
"528471 2007 Honda Accord Value Package sedan 1 \n",
"216428 2013 Chrysler 300 Base Sedan 1 \n",
"\n",
" vin state condition odometer ... mmr sellingprice \\\n",
"489058 1fahp3k20cl404844 mi 3.3 59353.0 ... 8250 9400 \n",
"528471 jhmcm56107c008377 fl 2.3 113249.0 ... 5250 4200 \n",
"216428 2c3ccaag5dh508315 tx 3.0 70542.0 ... 15250 14625 \n",
"\n",
" saledate makeCode modelCode trimCode \\\n",
"489058 Thu Jun 11 2015 02:30:00 GMT-0700 (PDT) 14 275 1044 \n",
"528471 Fri Jun 12 2015 02:30:00 GMT-0700 (PDT) 18 54 1337 \n",
"216428 Wed Jan 28 2015 02:00:00 GMT-0800 (PST) 8 10 395 \n",
"\n",
" bodyCode stateCode colorCode interiorCode \n",
"489058 65 12 14 1 \n",
"528471 76 4 2 6 \n",
"216428 36 29 1 1 \n",
"\n",
"[3 rows x 23 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.make = pd.Categorical(cars.make)\n",
"cars['makeCode'] = cars.make.cat.codes\n",
"cars.model = pd.Categorical(cars.model)\n",
"cars['modelCode'] = cars.model.cat.codes\n",
"cars.trim = pd.Categorical(cars.trim)\n",
"cars['trimCode'] = cars.trim.cat.codes\n",
"cars.body = pd.Categorical(cars.body)\n",
"cars['bodyCode'] = cars.body.cat.codes\n",
"cars.state = pd.Categorical(cars.state)\n",
"cars['stateCode'] = cars.state.cat.codes\n",
"cars.color = pd.Categorical(cars.color)\n",
"cars['colorCode'] = cars.color.cat.codes\n",
"cars.interior = pd.Categorical(cars.interior)\n",
"cars['interiorCode'] = cars.interior.cat.codes\n",
"cars.sample(n=3)"
]
},
{
"cell_type": "markdown",
"id": "88b37751",
"metadata": {},
"source": [
"Now that all categorical variables have been converted, they can be used in a linear regression model in order to predicted the selling price of the car based on its characteristics. We simply updated the transmission column but created new columns for the numerical representation of the categorical variables that will be used in our models."
]
},
{
"cell_type": "markdown",
"id": "1ba37d7e",
"metadata": {},
"source": [
"## Understanding the Data"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "75e92f75",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['year', 'make', 'model', 'trim', 'body', 'transmission', 'vin', 'state',\n",
" 'condition', 'odometer', 'color', 'interior', 'seller', 'mmr',\n",
" 'sellingprice', 'saledate', 'makeCode', 'modelCode', 'trimCode',\n",
" 'bodyCode', 'stateCode', 'colorCode', 'interiorCode'],\n",
" dtype='object')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.columns"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "1da554a0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>transmission</th>\n",
" <th>condition</th>\n",
" <th>odometer</th>\n",
" <th>mmr</th>\n",
" <th>sellingprice</th>\n",
" <th>makeCode</th>\n",
" <th>modelCode</th>\n",
" <th>trimCode</th>\n",
" <th>bodyCode</th>\n",
" <th>stateCode</th>\n",
" <th>colorCode</th>\n",
" <th>interiorCode</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>year</th>\n",
" <td>1.000000</td>\n",
" <td>0.089430</td>\n",
" <td>0.548307</td>\n",
" <td>-0.773104</td>\n",
" <td>0.588605</td>\n",
" <td>0.578918</td>\n",
" <td>-0.007196</td>\n",
" <td>-0.047812</td>\n",
" <td>0.084704</td>\n",
" <td>0.053416</td>\n",
" <td>-0.000645</td>\n",
" <td>0.068540</td>\n",
" <td>-0.189505</td>\n",
" </tr>\n",
" <tr>\n",
" <th>transmission</th>\n",
" <td>0.089430</td>\n",
" <td>1.000000</td>\n",
" <td>0.029010</td>\n",
" <td>-0.032237</td>\n",
" <td>0.043743</td>\n",
" <td>0.046125</td>\n",
" <td>-0.042781</td>\n",
" <td>-0.014385</td>\n",
" <td>0.004080</td>\n",
" <td>0.062315</td>\n",
" <td>-0.016593</td>\n",
" <td>0.020776</td>\n",
" <td>0.020557</td>\n",
" </tr>\n",
" <tr>\n",
" <th>condition</th>\n",
" <td>0.548307</td>\n",
" <td>0.029010</td>\n",
" <td>1.000000</td>\n",
" <td>-0.537544</td>\n",
" <td>0.481460</td>\n",
" <td>0.535990</td>\n",
" <td>-0.019163</td>\n",
" <td>-0.011467</td>\n",
" <td>0.057427</td>\n",
" <td>0.010612</td>\n",
" <td>0.007658</td>\n",
" <td>0.070712</td>\n",
" <td>-0.110408</td>\n",
" </tr>\n",
" <tr>\n",
" <th>odometer</th>\n",
" <td>-0.773104</td>\n",
" <td>-0.032237</td>\n",
" <td>-0.537544</td>\n",
" <td>1.000000</td>\n",
" <td>-0.582648</td>\n",
" <td>-0.577385</td>\n",
" <td>-0.027418</td>\n",
" <td>0.062646</td>\n",
" <td>-0.031531</td>\n",
" <td>-0.014305</td>\n",
" <td>0.016755</td>\n",
" <td>-0.036899</td>\n",
" <td>0.164859</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mmr</th>\n",
" <td>0.588605</td>\n",
" <td>0.043743</td>\n",
" <td>0.481460</td>\n",
" <td>-0.582648</td>\n",
" <td>1.000000</td>\n",
" <td>0.983492</td>\n",
" <td>-0.061550</td>\n",
" <td>-0.003597</td>\n",
" <td>0.038344</td>\n",
" <td>-0.040206</td>\n",
" <td>-0.018688</td>\n",
" <td>0.011533</td>\n",
" <td>-0.123567</td>\n",
" </tr>\n",
" <tr>\n",
" <th>sellingprice</th>\n",
" <td>0.578918</td>\n",
" <td>0.046125</td>\n",
" <td>0.535990</td>\n",
" <td>-0.577385</td>\n",
" <td>0.983492</td>\n",
" <td>1.000000</td>\n",
" <td>-0.059856</td>\n",
" <td>-0.003213</td>\n",
" <td>0.036328</td>\n",
" <td>-0.037849</td>\n",
" <td>-0.022587</td>\n",
" <td>0.017575</td>\n",
" <td>-0.123299</td>\n",
" </tr>\n",
" <tr>\n",
" <th>makeCode</th>\n",
" <td>-0.007196</td>\n",
" <td>-0.042781</td>\n",
" <td>-0.019163</td>\n",
" <td>-0.027418</td>\n",
" <td>-0.061550</td>\n",
" <td>-0.059856</td>\n",
" <td>1.000000</td>\n",
" <td>0.050214</td>\n",
" <td>-0.083977</td>\n",
" <td>0.040430</td>\n",
" <td>-0.030963</td>\n",
" <td>-0.002738</td>\n",
" <td>-0.004095</td>\n",
" </tr>\n",
" <tr>\n",
" <th>modelCode</th>\n",
" <td>-0.047812</td>\n",
" <td>-0.014385</td>\n",
" <td>-0.011467</td>\n",
" <td>0.062646</td>\n",
" <td>-0.003597</td>\n",
" <td>-0.003213</td>\n",
" <td>0.050214</td>\n",
" <td>1.000000</td>\n",
" <td>0.121473</td>\n",
" <td>-0.011997</td>\n",
" <td>0.019903</td>\n",
" <td>0.004428</td>\n",
" <td>0.011523</td>\n",
" </tr>\n",
" <tr>\n",
" <th>trimCode</th>\n",
" <td>0.084704</td>\n",
" <td>0.004080</td>\n",
" <td>0.057427</td>\n",
" <td>-0.031531</td>\n",
" <td>0.038344</td>\n",
" <td>0.036328</td>\n",
" <td>-0.083977</td>\n",
" <td>0.121473</td>\n",
" <td>1.000000</td>\n",
" <td>-0.012417</td>\n",
" <td>0.057665</td>\n",
" <td>0.031270</td>\n",
" <td>0.004997</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bodyCode</th>\n",
" <td>0.053416</td>\n",
" <td>0.062315</td>\n",
" <td>0.010612</td>\n",
" <td>-0.014305</td>\n",
" <td>-0.040206</td>\n",
" <td>-0.037849</td>\n",
" <td>0.040430</td>\n",
" <td>-0.011997</td>\n",
" <td>-0.012417</td>\n",
" <td>1.000000</td>\n",
" <td>0.005593</td>\n",
" <td>-0.001303</td>\n",
" <td>-0.001420</td>\n",
" </tr>\n",
" <tr>\n",
" <th>stateCode</th>\n",
" <td>-0.000645</td>\n",
" <td>-0.016593</td>\n",
" <td>0.007658</td>\n",
" <td>0.016755</td>\n",
" <td>-0.018688</td>\n",
" <td>-0.022587</td>\n",
" <td>-0.030963</td>\n",
" <td>0.019903</td>\n",
" <td>0.057665</td>\n",
" <td>0.005593</td>\n",
" <td>1.000000</td>\n",
" <td>0.008656</td>\n",
" <td>0.063051</td>\n",
" </tr>\n",
" <tr>\n",
" <th>colorCode</th>\n",
" <td>0.068540</td>\n",
" <td>0.020776</td>\n",
" <td>0.070712</td>\n",
" <td>-0.036899</td>\n",
" <td>0.011533</td>\n",
" <td>0.017575</td>\n",
" <td>-0.002738</td>\n",
" <td>0.004428</td>\n",
" <td>0.031270</td>\n",
" <td>-0.001303</td>\n",
" <td>0.008656</td>\n",
" <td>1.000000</td>\n",
" <td>0.023422</td>\n",
" </tr>\n",
" <tr>\n",
" <th>interiorCode</th>\n",
" <td>-0.189505</td>\n",
" <td>0.020557</td>\n",
" <td>-0.110408</td>\n",
" <td>0.164859</td>\n",
" <td>-0.123567</td>\n",
" <td>-0.123299</td>\n",
" <td>-0.004095</td>\n",
" <td>0.011523</td>\n",
" <td>0.004997</td>\n",
" <td>-0.001420</td>\n",
" <td>0.063051</td>\n",
" <td>0.023422</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" year transmission condition odometer mmr \\\n",
"year 1.000000 0.089430 0.548307 -0.773104 0.588605 \n",
"transmission 0.089430 1.000000 0.029010 -0.032237 0.043743 \n",
"condition 0.548307 0.029010 1.000000 -0.537544 0.481460 \n",
"odometer -0.773104 -0.032237 -0.537544 1.000000 -0.582648 \n",
"mmr 0.588605 0.043743 0.481460 -0.582648 1.000000 \n",
"sellingprice 0.578918 0.046125 0.535990 -0.577385 0.983492 \n",
"makeCode -0.007196 -0.042781 -0.019163 -0.027418 -0.061550 \n",
"modelCode -0.047812 -0.014385 -0.011467 0.062646 -0.003597 \n",
"trimCode 0.084704 0.004080 0.057427 -0.031531 0.038344 \n",
"bodyCode 0.053416 0.062315 0.010612 -0.014305 -0.040206 \n",
"stateCode -0.000645 -0.016593 0.007658 0.016755 -0.018688 \n",
"colorCode 0.068540 0.020776 0.070712 -0.036899 0.011533 \n",
"interiorCode -0.189505 0.020557 -0.110408 0.164859 -0.123567 \n",
"\n",
" sellingprice makeCode modelCode trimCode bodyCode \\\n",
"year 0.578918 -0.007196 -0.047812 0.084704 0.053416 \n",
"transmission 0.046125 -0.042781 -0.014385 0.004080 0.062315 \n",
"condition 0.535990 -0.019163 -0.011467 0.057427 0.010612 \n",
"odometer -0.577385 -0.027418 0.062646 -0.031531 -0.014305 \n",
"mmr 0.983492 -0.061550 -0.003597 0.038344 -0.040206 \n",
"sellingprice 1.000000 -0.059856 -0.003213 0.036328 -0.037849 \n",
"makeCode -0.059856 1.000000 0.050214 -0.083977 0.040430 \n",
"modelCode -0.003213 0.050214 1.000000 0.121473 -0.011997 \n",
"trimCode 0.036328 -0.083977 0.121473 1.000000 -0.012417 \n",
"bodyCode -0.037849 0.040430 -0.011997 -0.012417 1.000000 \n",
"stateCode -0.022587 -0.030963 0.019903 0.057665 0.005593 \n",
"colorCode 0.017575 -0.002738 0.004428 0.031270 -0.001303 \n",
"interiorCode -0.123299 -0.004095 0.011523 0.004997 -0.001420 \n",
"\n",
" stateCode colorCode interiorCode \n",
"year -0.000645 0.068540 -0.189505 \n",
"transmission -0.016593 0.020776 0.020557 \n",
"condition 0.007658 0.070712 -0.110408 \n",
"odometer 0.016755 -0.036899 0.164859 \n",
"mmr -0.018688 0.011533 -0.123567 \n",
"sellingprice -0.022587 0.017575 -0.123299 \n",
"makeCode -0.030963 -0.002738 -0.004095 \n",
"modelCode 0.019903 0.004428 0.011523 \n",
"trimCode 0.057665 0.031270 0.004997 \n",
"bodyCode 0.005593 -0.001303 -0.001420 \n",
"stateCode 1.000000 0.008656 0.063051 \n",
"colorCode 0.008656 1.000000 0.023422 \n",
"interiorCode 0.063051 0.023422 1.000000 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.corr(numeric_only='TRUE')"
]
},
{
"cell_type": "markdown",
"id": "e1d10f72",
"metadata": {},
"source": [
"There aren't many strong correlations other than selling price and mmr. This shows that the estimated price based on millions of other transactions is quite accurate and should definitely be used in the model to make prediction as good as possible."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "05953e64",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3.0, 4.0] 157960\n",
"(4.0, 5.0] 147264\n",
"(2.0, 3.0] 106286\n",
"(0.995, 2.0] 60826\n",
"Name: condition, dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.condition.value_counts(ascending= False, bins=4)"
]
},
{
"cell_type": "markdown",
"id": "60169969",
"metadata": {},
"source": [
"# Data Modeling"
]
},
{
"cell_type": "markdown",
"id": "00ad95cb",
"metadata": {},
"source": [
"## Model 1: K Means Cluster Analysis on Auctioned Cars"
]
},
{
"cell_type": "markdown",
"id": "0584e9fc",
"metadata": {},
"source": [
"Since we are dealing with such a large dataset of auctioned cars, we are definitely experiencing a large range in values among the data records. So, by using cluster analysis we hope to group the auctioned cars into accurate groups so that there can be prediction models used exclusively for different categories of auctioned cars. For the features we will use sellingprice, mmr, condition, interiorCode, and odometer as we feel these are the most classifying regarding auctioned cars.\n",
"\n",
"NOTE: This cluster analysis will be done on a sample of 100,000 data records. This is done so that the analysis won't take a long period in order to perform. In addition, please note that if running the silhouette score function, it may take a couple minutes to officially finish running. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "564c5dc1",
"metadata": {},
"outputs": [],
"source": [
"features=['sellingprice','mmr','condition','interiorCode','odometer']"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "274d9a82",
"metadata": {},
"outputs": [],
"source": [
"carsSubset=cars[features].copy().dropna().sample(n=100000)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "96b675ad",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>KMeans(max_iter=1000, n_clusters=3, n_init=&#x27;auto&#x27;)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">KMeans</label><div class=\"sk-toggleable__content\"><pre>KMeans(max_iter=1000, n_clusters=3, n_init=&#x27;auto&#x27;)</pre></div></div></div></div></div>"
],
"text/plain": [
"KMeans(max_iter=1000, n_clusters=3, n_init='auto')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numClusters= 3\n",
"carsCluster=KMeans(n_clusters=numClusters,n_init=\"auto\",max_iter=1000)\n",
"carsCluster.fit(carsSubset)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "bc669de9",
"metadata": {},
"outputs": [],
"source": [
"carsSubset['cluster']=carsCluster.labels_"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ed3f21ea",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"58745911800330.86"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"carsCluster.inertia_"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "1262d1cc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.5512702976030737"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"silhouette_score(carsSubset, carsCluster.labels_)"
]
},
{
"cell_type": "markdown",
"id": "ce4e2d38",
"metadata": {},
"source": [
"## K-Means Evaluation\n",
"This inertia score is very high signaling a great distance between clusters in this kmeans cluster analysis. Additionally, we have a silhouette sore of 0.55 which signals that most data points belong in the correct cluster they were placed in. \n",
"\n",
"To attempt to update this cluster analysis, we are going to change the number of clusters to 5 to see if that increases our silhouette score and increases the accuracy of our clusters."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "922e285f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>KMeans(max_iter=1000, n_clusters=5, n_init=&#x27;auto&#x27;)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" checked><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">KMeans</label><div class=\"sk-toggleable__content\"><pre>KMeans(max_iter=1000, n_clusters=5, n_init=&#x27;auto&#x27;)</pre></div></div></div></div></div>"
],
"text/plain": [
"KMeans(max_iter=1000, n_clusters=5, n_init='auto')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numClusters= 5\n",
"carsCluster=KMeans(n_clusters=numClusters,n_init=\"auto\",max_iter=1000)\n",
"carsCluster.fit(carsSubset)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "7d2203a4",
"metadata": {},
"outputs": [],
"source": [
"carsSubset['cluster']=carsCluster.labels_"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "63f04b1f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"33936955765386.645"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"carsCluster.inertia_"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "76b15575",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4310201521817097"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"silhouette_score(carsSubset, carsCluster.labels_)"
]
},
{
"cell_type": "markdown",
"id": "7fdd36e4",
"metadata": {},
"source": [
"## K-Means Summary\n",
"Increasing the amount of clusters actually reduced our silhouette score which means that increasing the amount of clusters had a negative effect on accuracy. This means that 3 clusters was most likely our best cluster analysis possible. \n",
"\n",
"Our intertia score was still very high which means that the clusters are very unique in their placement but with a lower silhouette our data records weren't properly put in the correct cluster. \n",
"\n",
"Conclusion: The best number of clusters is a lower number. This indicates that there isn't a huge variety in types of cars that are being auctioned based on the dataset used, given that most data records were accurately placed in their cluster. One possible takeaway from this is that the auctioned cars fall into 3 groups that would be tiered by quality. This quality indicator would be based on the condition, sales price, and car characteristics (interior/odometer). The 3 groups could be described as Low Quality, Medium Quality, and High Quality. If you chose to isolate these groups into their own subsets, it could lead to more accurate models since prices and other characteristics of the cars being auctioned wouldn't vary as much since the auction outcomes are similar. "
]
},
{
"cell_type": "markdown",
"id": "67fbd873",
"metadata": {},
"source": [
"## Model 2: Logistic Regression Analysis on Transmission\n",
"\n",
"We will be using a logistic analysis of this data set next. We will be using the \"transmission\" variable as our response variable and seeing how accurate the rest of the variables are to prediciting whether or not a car's transmission is automatic or manual."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "2ea644de",
"metadata": {},
"outputs": [],
"source": [
"response = ['transmission']\n",
"features = ['year','makeCode','modelCode','trimCode','bodyCode'\n",
" ,'colorCode','interiorCode', 'mmr', 'condition','stateCode','sellingprice']"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "6772bc6e",
"metadata": {},
"outputs": [],
"source": [
"modelCars = cars[['year','makeCode','modelCode','trimCode','bodyCode'\n",
" ,'colorCode','interiorCode', 'mmr','transmission','condition','stateCode','odometer','sellingprice']]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "a892a0a6",
"metadata": {},
"outputs": [],
"source": [
"trainingData, testData = train_test_split(modelCars, test_size=0.1, train_size=0.8, shuffle=True)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "6e5a2006",
"metadata": {},
"outputs": [],
"source": [
"trainFeatures = trainingData[features]\n",
"trainResponse = trainingData['transmission']\n",
"testFeatures = testData[features]\n",
"testResponse = testData['transmission']\n",
"trainTransmission = trainResponse == 1\n",
"testTransmission = testResponse == 1"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "23e797da",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>SGDClassifier(random_state=42)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SGDClassifier</label><div class=\"sk-toggleable__content\"><pre>SGDClassifier(random_state=42)</pre></div></div></div></div></div>"
],
"text/plain": [
"SGDClassifier(random_state=42)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transmissionClassifier = SGDClassifier(random_state=42)\n",
"transmissionClassifier.fit(trainFeatures,trainResponse)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "62a97a87",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 1, ..., 1, 1, 1])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"testPredictions = transmissionClassifier.predict(testFeatures)\n",
"testPredictions"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "575cf0e1",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_predict\n",
"\n",
"testPredictTransmission = cross_val_predict(transmissionClassifier, testFeatures, testTransmission, cv=3)"
]
},
{
"cell_type": "markdown",
"id": "05324578",
"metadata": {},
"source": [
"## Logistic Regression Evaluation"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "144a0db8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>predicted</th>\n",
" <th>neg</th>\n",
" <th>pos</th>\n",
" </tr>\n",
" <tr>\n",
" <th>expected</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>neg</th>\n",
" <td>54</td>\n",
" <td>1593</td>\n",
" </tr>\n",
" <tr>\n",
" <th>pos</th>\n",
" <td>432</td>\n",
" <td>45155</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"predicted neg pos\n",
"expected \n",
"neg 54 1593\n",
"pos 432 45155"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"\n",
"transmissionConfusion = confusion_matrix(testTransmission, testPredictTransmission)\n",
"confusion=DataFrame(transmissionConfusion)\n",
"confusion.index=['neg','pos']\n",
"confusion.index.name='expected'\n",
"confusion.columns=['neg','pos']\n",
"confusion.columns.name='predicted'\n",
"confusion"
]
},
{
"cell_type": "markdown",
"id": "2da3b2a6",
"metadata": {},
"source": [
"## Changing It Up\n",
"\n",
"Accuracy = 45,209/47,234 = 0.9571 ~ 96%\n",
"\n",
"Success of Predicted 1 = Sensitivity = 45,155/45,587 = 0.9905 ~ 99%\n",
"\n",
"Success of Predicted 0 = Specificity = 45,294/46,857 = 0.0328 ~ 3%\n",
"\n",
"Overall this initial logistic regression model is very accurate. Our only note to take away from this model is that the positive value 1 (automatic) was predicted at a very high rate while the negative value 0 (manual) was predicted at a very low rate. We concluded that this is most likely associated with the low count of manual transmission cars in the dataset. \n",
"\n",
"For a different analysis, we will be taking out the highest correlated variable towards transmission in the table shown above, which is bodycode, and seeing if this changes the outcome of our model. As accuracy and sensitivity are high already, it is going to be interesting to see if taking out the highest correlated variable of transmission will change things. In our opinion we think there will be a drop off, as the highest correlated variable should hold some sort of impact when doing the prediction."
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "43c1d30c",
"metadata": {},
"outputs": [],
"source": [
"features = ['makeCode','modelCode','trimCode',\n",
" 'year','colorCode','interiorCode', 'mmr', 'stateCode','condition','sellingprice']\n",
"modelCars = cars[['year','makeCode','modelCode','trimCode','bodyCode'\n",
" ,'colorCode','interiorCode','mmr','transmission','condition','stateCode','odometer','sellingprice']]\n",
"trainingData, testData = train_test_split(modelCars, test_size=0.1, train_size=0.8, shuffle=True)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "65b7b034",
"metadata": {},
"outputs": [],
"source": [
"trainFeatures = trainingData[features]\n",
"trainResponse = trainingData['transmission']\n",
"testFeatures = testData[features]\n",
"testResponse = testData['transmission']\n",
"trainTransmission = trainResponse == 1\n",
"testTransmission = testResponse == 1"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "aa3cf2b9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>SGDClassifier(random_state=42)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" checked><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SGDClassifier</label><div class=\"sk-toggleable__content\"><pre>SGDClassifier(random_state=42)</pre></div></div></div></div></div>"
],
"text/plain": [
"SGDClassifier(random_state=42)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transmissionClassifier = SGDClassifier(random_state=42)\n",
"transmissionClassifier.fit(trainFeatures,trainResponse)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "abca1fe3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 1, ..., 1, 1, 1])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"testPredictions = transmissionClassifier.predict(testFeatures)\n",
"testPredictions"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "74f76f02",
"metadata": {},
"outputs": [],
"source": [
"testPredictTransmission = cross_val_predict(transmissionClassifier, testFeatures, testTransmission, cv=3)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "4420980e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>predicted</th>\n",
" <th>neg</th>\n",
" <th>pos</th>\n",
" </tr>\n",
" <tr>\n",
" <th>expected</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>neg</th>\n",
" <td>111</td>\n",
" <td>1544</td>\n",
" </tr>\n",
" <tr>\n",
" <th>pos</th>\n",
" <td>1393</td>\n",
" <td>44186</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"predicted neg pos\n",
"expected \n",
"neg 111 1544\n",
"pos 1393 44186"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transmissionConfusion = confusion_matrix(testTransmission, testPredictTransmission)\n",
"confusion=DataFrame(transmissionConfusion)\n",
"confusion.index=['neg','pos']\n",
"confusion.index.name='expected'\n",
"confusion.columns=['neg','pos']\n",
"confusion.columns.name='predicted'\n",
"confusion"
]
},
{
"cell_type": "markdown",
"id": "3da8c92d",
"metadata": {},
"source": [
"## Logistic Regression Summary\n",
"\n",
"Accuracy = 44,297/47,234 = 0.9378 ~ 94%\n",
"\n",
"Success of Predicted 1 = Sensitivity = 44,186/45,579 = 0.9694 ~ 97% \n",
"\n",
"Success of Predicted 0 = Specificity = 111/1,655 = 0.0671 ~ 7%\n",
"\n",
"From the full features logistic analysis, where we took all of the features to predict the transmission, there really is no difference whenever you take out the highest correlation associated to transmission (bodycode). Accuracy and specificty stayed the same, but specificity increased by about 4%. So, you could say taking out the bodycode feature could be an improvement to prediciting whether or not the car's transmission is automatic or manual, but all in all, taking out the bodycode variable didn't lead to any significant in our model. "
]
},
{
"cell_type": "markdown",
"id": "daa582be",
"metadata": {},
"source": [
"## Data Model 3: Linear Regression"
]
},
{
"cell_type": "markdown",
"id": "b68fac42",
"metadata": {},
"source": [
"### Our Linear Regression model will focus on predicting the odometer(Number of miles driven)"
]
},
{
"cell_type": "markdown",
"id": "1779804c",
"metadata": {},
"source": [
"To attempt to predict the number of miles driven for each car(odometer), we will use the column sellingprice."
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "ec1cea53",
"metadata": {},
"outputs": [],
"source": [
"features=['sellingprice']\n",
"response=['odometer']"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "2af578cb",
"metadata": {},
"outputs": [],
"source": [
"selectedData=cars[['odometer','sellingprice']]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "07d242f3",
"metadata": {},
"outputs": [],
"source": [
"training, testing = train_test_split(selectedData, test_size=0.3, train_size=0.7, shuffle=True)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "baeec1f8",
"metadata": {},
"outputs": [],
"source": [
"trainingFeatures=training.loc[:,features]\n",
"trainingResponse=training.loc[:,response]\n",
"testingFeatures=testing.loc[:,features]\n",
"testingResponse=testing.loc[:,response]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "73b1ea0d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-3 {color: black;background-color: white;}#sk-container-id-3 pre{padding: 0;}#sk-container-id-3 div.sk-toggleable {background-color: white;}#sk-container-id-3 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-3 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-3 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-3 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-3 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-3 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-3 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-3 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-3 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-3 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-3 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-3 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-3 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-3 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-3 div.sk-item {position: relative;z-index: 1;}#sk-container-id-3 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-3 div.sk-item::before, #sk-container-id-3 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-3 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-3 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-3 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-3 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-3 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-3 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-3 div.sk-label-container {text-align: center;}#sk-container-id-3 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-3 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-3\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LinearRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" checked><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearRegression</label><div class=\"sk-toggleable__content\"><pre>LinearRegression()</pre></div></div></div></div></div>"
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"linreg = LinearRegression()\n",
"linreg.fit(trainingFeatures, trainingResponse)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "b6822cc9",
"metadata": {},
"outputs": [],
"source": [
"predictedResponse=linreg.predict(testingFeatures)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "938640aa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.3355164521922719"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2_score(testingResponse,predictedResponse)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "40bd96ae",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1805298849.2598186"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_squared_error(testingResponse, predictedResponse)"
]
},
{
"cell_type": "markdown",
"id": "824545e5",
"metadata": {},
"source": [
"## Linear Regression Evaluation\n",
"Due to the low R-squared score and high MSE value, we will attempt to add additional features(mmr, year, and condition) in order to more accurately predict the odometer reading."
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "f685100b",
"metadata": {},
"outputs": [],
"source": [
"features=['sellingprice','mmr','condition','year']\n",
"response=['odometer']"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "8e184ccd",
"metadata": {},
"outputs": [],
"source": [
"selectedData=cars[['odometer','sellingprice','mmr','condition','year']]"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "bfbe1c64",
"metadata": {},
"outputs": [],
"source": [
"training, testing = train_test_split(selectedData, test_size=0.3, train_size=0.7, shuffle=True)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "331c6b0c",
"metadata": {},
"outputs": [],
"source": [
"trainingFeatures=training.loc[:,features]\n",
"trainingResponse=training.loc[:,response]\n",
"testingFeatures=testing.loc[:,features]\n",
"testingResponse=testing.loc[:,response]"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "9303b64d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-4 {color: black;background-color: white;}#sk-container-id-4 pre{padding: 0;}#sk-container-id-4 div.sk-toggleable {background-color: white;}#sk-container-id-4 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-4 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-4 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-4 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-4 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-4 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-4 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-4 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-4 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-4 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-4 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-4 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-4 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-4 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-4 div.sk-item {position: relative;z-index: 1;}#sk-container-id-4 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-4 div.sk-item::before, #sk-container-id-4 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-4 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-4 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-4 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-4 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-4 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-4 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-4 div.sk-label-container {text-align: center;}#sk-container-id-4 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-4 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-4\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LinearRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" checked><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearRegression</label><div class=\"sk-toggleable__content\"><pre>LinearRegression()</pre></div></div></div></div></div>"
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"linreg = LinearRegression()\n",
"linreg.fit(trainingFeatures, trainingResponse)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "561e0217",
"metadata": {},
"outputs": [],
"source": [
"predictedResponse=linreg.predict(testingFeatures)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "63262225",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.631472016955943"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2_score(testingResponse,predictedResponse)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "6ce55d10",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"982324037.5717765"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_squared_error(testingResponse, predictedResponse)"
]
},
{
"cell_type": "markdown",
"id": "5827371f",
"metadata": {},
"source": [
"## Linear Regression Summary\n",
"\n",
"Since the R2 value almost doubled with MSE decreasing, we can confidently say that by adding more variables to predict odometer you can get a much stronger prediction model. Clearly, selling price did have a strong relationship with odometer so the model needed more variables in order to make better predictions. If we were to take this model a step further with even more variables the model would more than likely improve more than it already has. "
]
},
{
"cell_type": "markdown",
"id": "d4ad7f24",
"metadata": {},
"source": [
"# Final Summary"
]
},
{
"cell_type": "markdown",
"id": "5f80c5af",
"metadata": {},
"source": [
"Initially, we had no idea how impactful the characteristics of the auctioned cars would be for creating accurate prediction models. Especially variables like interior, body, and trasmission. After performing our analysis using cluster analysis, logistic regression, and linear regression we can confidently say that the details or characteristics of these auctioned cars in our data set are great for prediction. In our cluster analysis we found the amount of clusters to be low but accurate because of the high distance between groups as well as having a pretty good silhouette score. Additionally, our logistic and linear regression models were very effective in prediction with both having high R2 score values. To wrap up, our models were very successful using all of the dataset variables including characteristics and other mathematical details in order to make accurate predictions. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}