MGMT388 Group Project Notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0a2ad01d",
   "metadata": {},
   "source": [
    "# MGMT388 Group Project Notebook\n",
    "Caleb Hammoudeh | Colby McDowell | Tyler Perryman\n",
    "\n",
    "# Abstract / Source\n",
    "\n",
    "For this project we plan to perform 3 modeling techniques in order to analyze this auctioned car dataset. We will first perform a cluster analysis to look to identify groups of auctioned cars based on their selling price and model characteristics. Next we will use logistic regression in order to predict whether a car is automatic or manual based on its characteristics. Lastly, we are going to perform a linear regression test on odometer to see whether or not the given data of auctioned cars can accurately predict the odometer reading at the time of the sale. \n",
    "\n",
    "Goal: After using all of the tests/models discussed above, we want to see how important the characteristics of an auctioned car are. Such as make, model, interior, or condition.\n",
    "\n",
    "## Car Sales Dataset\n",
    "* Tunguz, Bojan. “Used Car Auction Prices.” Kaggle, 18 May 2021, https://www.kaggle.com/datasets/tunguz/used-car-auction-prices.\n",
    "* Shape - (472336, 23) - Data Records = 472,336\n",
    "* Usage: Cluster - sample of 100,000 records | Logistic & Linear Regression - entire dataset\n",
    "* NOTE: MMR stands for Manheim Market Report - this is a leading indicator of wholesale prices across the country to determine an estimated selling price based on millions of sales transactions for the specific model and year of the car for sale.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f4def57b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import Series, DataFrame, Index, Categorical\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.model_selection import train_test_split,cross_val_predict,cross_val_score\n",
    "from sklearn.metrics import r2_score,mean_squared_error\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "import numpy as np\n",
    "from sklearn.datasets import make_blobs\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn.metrics import silhouette_score\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d308ecd4",
   "metadata": {},
   "source": [
    "## Data Gathering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "b955bbac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>make</th>\n",
       "      <th>model</th>\n",
       "      <th>trim</th>\n",
       "      <th>body</th>\n",
       "      <th>transmission</th>\n",
       "      <th>vin</th>\n",
       "      <th>state</th>\n",
       "      <th>condition</th>\n",
       "      <th>odometer</th>\n",
       "      <th>color</th>\n",
       "      <th>interior</th>\n",
       "      <th>seller</th>\n",
       "      <th>mmr</th>\n",
       "      <th>sellingprice</th>\n",
       "      <th>saledate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>13886</th>\n",
       "      <td>2009</td>\n",
       "      <td>Honda</td>\n",
       "      <td>Accord</td>\n",
       "      <td>EX-L</td>\n",
       "      <td>Coupe</td>\n",
       "      <td>automatic</td>\n",
       "      <td>1hgcs12859a003890</td>\n",
       "      <td>wa</td>\n",
       "      <td>2.9</td>\n",
       "      <td>77345.0</td>\n",
       "      <td>gray</td>\n",
       "      <td>black</td>\n",
       "      <td>northtown auto liquidators</td>\n",
       "      <td>10000</td>\n",
       "      <td>9500</td>\n",
       "      <td>Tue Dec 23 2014 14:30:00 GMT-0800 (PST)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416120</th>\n",
       "      <td>2006</td>\n",
       "      <td>Ford</td>\n",
       "      <td>Expedition</td>\n",
       "      <td>XLT</td>\n",
       "      <td>SUV</td>\n",
       "      <td>automatic</td>\n",
       "      <td>1fmpu15576lb00788</td>\n",
       "      <td>ga</td>\n",
       "      <td>3.6</td>\n",
       "      <td>151388.0</td>\n",
       "      <td>silver</td>\n",
       "      <td>gray</td>\n",
       "      <td>fifth third bank</td>\n",
       "      <td>3225</td>\n",
       "      <td>4100</td>\n",
       "      <td>Thu May 21 2015 02:30:00 GMT-0700 (PDT)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>395589</th>\n",
       "      <td>2011</td>\n",
       "      <td>Chevrolet</td>\n",
       "      <td>Equinox</td>\n",
       "      <td>LS</td>\n",
       "      <td>SUV</td>\n",
       "      <td>automatic</td>\n",
       "      <td>2cnflcec5b6395773</td>\n",
       "      <td>mn</td>\n",
       "      <td>4.3</td>\n",
       "      <td>105800.0</td>\n",
       "      <td>white</td>\n",
       "      <td>gray</td>\n",
       "      <td>select lane</td>\n",
       "      <td>10100</td>\n",
       "      <td>10800</td>\n",
       "      <td>Thu Mar 05 2015 02:30:00 GMT-0800 (PST)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        year       make       model  trim   body transmission  \\\n",
       "13886   2009      Honda      Accord  EX-L  Coupe    automatic   \n",
       "416120  2006       Ford  Expedition   XLT    SUV    automatic   \n",
       "395589  2011  Chevrolet     Equinox    LS    SUV    automatic   \n",
       "\n",
       "                      vin state  condition  odometer   color interior  \\\n",
       "13886   1hgcs12859a003890    wa        2.9   77345.0    gray    black   \n",
       "416120  1fmpu15576lb00788    ga        3.6  151388.0  silver     gray   \n",
       "395589  2cnflcec5b6395773    mn        4.3  105800.0   white     gray   \n",
       "\n",
       "                            seller    mmr  sellingprice  \\\n",
       "13886   northtown auto liquidators  10000          9500   \n",
       "416120            fifth third bank   3225          4100   \n",
       "395589                 select lane  10100         10800   \n",
       "\n",
       "                                       saledate  \n",
       "13886   Tue Dec 23 2014 14:30:00 GMT-0800 (PST)  \n",
       "416120  Thu May 21 2015 02:30:00 GMT-0700 (PDT)  \n",
       "395589  Thu Mar 05 2015 02:30:00 GMT-0800 (PST)  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cars = pd.read_csv('car_prices.csv',on_bad_lines='skip').dropna()\n",
    "cars.sample(n=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "7e2a65c5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(472336, 16)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cars.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e37bb004",
   "metadata": {},
   "source": [
    "## Data Cleaning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c85a3ccc",
   "metadata": {},
   "source": [
    "In order to clean the data, variables like make, model, trim, body, transmission, state, color, and interior must be changed to numerical values in order to be used as features in regression and cluster analysis. Make, model, trim, body, state, color, and interior will be changed to category variables where each specific value will be given a number to represent it. For the transmission variable there are only 2 outcomes so the new values for transimission will be binary where 1 = automatic and 0 = manual.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe8819d7",
   "metadata": {},
   "source": [
    "First we am going to change transmission to a binary variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "733ff26a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    455974\n",
       "0     16362\n",
       "Name: transmission, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cars.transmission = np.where(cars.transmission == 'automatic',1,0)\n",
    "cars.transmission.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfbf76de",
   "metadata": {},
   "source": [
    "Next we are going to give all of the categorical variables we plan to use for our model a numerical representation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "a7773e34",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>make</th>\n",
       "      <th>model</th>\n",
       "      <th>trim</th>\n",
       "      <th>body</th>\n",
       "      <th>transmission</th>\n",
       "      <th>vin</th>\n",
       "      <th>state</th>\n",
       "      <th>condition</th>\n",
       "      <th>odometer</th>\n",
       "      <th>...</th>\n",
       "      <th>mmr</th>\n",
       "      <th>sellingprice</th>\n",
       "      <th>saledate</th>\n",
       "      <th>makeCode</th>\n",
       "      <th>modelCode</th>\n",
       "      <th>trimCode</th>\n",
       "      <th>bodyCode</th>\n",
       "      <th>stateCode</th>\n",
       "      <th>colorCode</th>\n",
       "      <th>interiorCode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>489058</th>\n",
       "      <td>2012</td>\n",
       "      <td>Ford</td>\n",
       "      <td>Focus</td>\n",
       "      <td>SE</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>1</td>\n",
       "      <td>1fahp3k20cl404844</td>\n",
       "      <td>mi</td>\n",
       "      <td>3.3</td>\n",
       "      <td>59353.0</td>\n",
       "      <td>...</td>\n",
       "      <td>8250</td>\n",
       "      <td>9400</td>\n",
       "      <td>Thu Jun 11 2015 02:30:00 GMT-0700 (PDT)</td>\n",
       "      <td>14</td>\n",
       "      <td>275</td>\n",
       "      <td>1044</td>\n",
       "      <td>65</td>\n",
       "      <td>12</td>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>528471</th>\n",
       "      <td>2007</td>\n",
       "      <td>Honda</td>\n",
       "      <td>Accord</td>\n",
       "      <td>Value Package</td>\n",
       "      <td>sedan</td>\n",
       "      <td>1</td>\n",
       "      <td>jhmcm56107c008377</td>\n",
       "      <td>fl</td>\n",
       "      <td>2.3</td>\n",
       "      <td>113249.0</td>\n",
       "      <td>...</td>\n",
       "      <td>5250</td>\n",
       "      <td>4200</td>\n",
       "      <td>Fri Jun 12 2015 02:30:00 GMT-0700 (PDT)</td>\n",
       "      <td>18</td>\n",
       "      <td>54</td>\n",
       "      <td>1337</td>\n",
       "      <td>76</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>216428</th>\n",
       "      <td>2013</td>\n",
       "      <td>Chrysler</td>\n",
       "      <td>300</td>\n",
       "      <td>Base</td>\n",
       "      <td>Sedan</td>\n",
       "      <td>1</td>\n",
       "      <td>2c3ccaag5dh508315</td>\n",
       "      <td>tx</td>\n",
       "      <td>3.0</td>\n",
       "      <td>70542.0</td>\n",
       "      <td>...</td>\n",
       "      <td>15250</td>\n",
       "      <td>14625</td>\n",
       "      <td>Wed Jan 28 2015 02:00:00 GMT-0800 (PST)</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>395</td>\n",
       "      <td>36</td>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        year      make   model           trim       body  transmission  \\\n",
       "489058  2012      Ford   Focus             SE  hatchback             1   \n",
       "528471  2007     Honda  Accord  Value Package      sedan             1   \n",
       "216428  2013  Chrysler     300           Base      Sedan             1   \n",
       "\n",
       "                      vin state  condition  odometer  ...    mmr sellingprice  \\\n",
       "489058  1fahp3k20cl404844    mi        3.3   59353.0  ...   8250         9400   \n",
       "528471  jhmcm56107c008377    fl        2.3  113249.0  ...   5250         4200   \n",
       "216428  2c3ccaag5dh508315    tx        3.0   70542.0  ...  15250        14625   \n",
       "\n",
       "                                       saledate  makeCode  modelCode trimCode  \\\n",
       "489058  Thu Jun 11 2015 02:30:00 GMT-0700 (PDT)        14        275     1044   \n",
       "528471  Fri Jun 12 2015 02:30:00 GMT-0700 (PDT)        18         54     1337   \n",
       "216428  Wed Jan 28 2015 02:00:00 GMT-0800 (PST)         8         10      395   \n",
       "\n",
       "        bodyCode  stateCode  colorCode  interiorCode  \n",
       "489058        65         12         14             1  \n",
       "528471        76          4          2             6  \n",
       "216428        36         29          1             1  \n",
       "\n",
       "[3 rows x 23 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cars.make = pd.Categorical(cars.make)\n",
    "cars['makeCode'] = cars.make.cat.codes\n",
    "cars.model = pd.Categorical(cars.model)\n",
    "cars['modelCode'] = cars.model.cat.codes\n",
    "cars.trim = pd.Categorical(cars.trim)\n",
    "cars['trimCode'] = cars.trim.cat.codes\n",
    "cars.body = pd.Categorical(cars.body)\n",
    "cars['bodyCode'] = cars.body.cat.codes\n",
    "cars.state = pd.Categorical(cars.state)\n",
    "cars['stateCode'] = cars.state.cat.codes\n",
    "cars.color = pd.Categorical(cars.color)\n",
    "cars['colorCode'] = cars.color.cat.codes\n",
    "cars.interior = pd.Categorical(cars.interior)\n",
    "cars['interiorCode'] = cars.interior.cat.codes\n",
    "cars.sample(n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88b37751",
   "metadata": {},
   "source": [
    "Now that all categorical variables have been converted, they can be used in a linear regression model in order to predicted the selling price of the car based on its characteristics. We simply updated the transmission column but created new columns for the numerical representation of the categorical variables that will be used in our models."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ba37d7e",
   "metadata": {},
   "source": [
    "## Understanding the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "75e92f75",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['year', 'make', 'model', 'trim', 'body', 'transmission', 'vin', 'state',\n",
       "       'condition', 'odometer', 'color', 'interior', 'seller', 'mmr',\n",
       "       'sellingprice', 'saledate', 'makeCode', 'modelCode', 'trimCode',\n",
       "       'bodyCode', 'stateCode', 'colorCode', 'interiorCode'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cars.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "1da554a0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>transmission</th>\n",
       "      <th>condition</th>\n",
       "      <th>odometer</th>\n",
       "      <th>mmr</th>\n",
       "      <th>sellingprice</th>\n",
       "      <th>makeCode</th>\n",
       "      <th>modelCode</th>\n",
       "      <th>trimCode</th>\n",
       "      <th>bodyCode</th>\n",
       "      <th>stateCode</th>\n",
       "      <th>colorCode</th>\n",
       "      <th>interiorCode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.089430</td>\n",
       "      <td>0.548307</td>\n",
       "      <td>-0.773104</td>\n",
       "      <td>0.588605</td>\n",
       "      <td>0.578918</td>\n",
       "      <td>-0.007196</td>\n",
       "      <td>-0.047812</td>\n",
       "      <td>0.084704</td>\n",
       "      <td>0.053416</td>\n",
       "      <td>-0.000645</td>\n",
       "      <td>0.068540</td>\n",
       "      <td>-0.189505</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>transmission</th>\n",
       "      <td>0.089430</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.029010</td>\n",
       "      <td>-0.032237</td>\n",
       "      <td>0.043743</td>\n",
       "      <td>0.046125</td>\n",
       "      <td>-0.042781</td>\n",
       "      <td>-0.014385</td>\n",
       "      <td>0.004080</td>\n",
       "      <td>0.062315</td>\n",
       "      <td>-0.016593</td>\n",
       "      <td>0.020776</td>\n",
       "      <td>0.020557</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>condition</th>\n",
       "      <td>0.548307</td>\n",
       "      <td>0.029010</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.537544</td>\n",
       "      <td>0.481460</td>\n",
       "      <td>0.535990</td>\n",
       "      <td>-0.019163</td>\n",
       "      <td>-0.011467</td>\n",
       "      <td>0.057427</td>\n",
       "      <td>0.010612</td>\n",
       "      <td>0.007658</td>\n",
       "      <td>0.070712</td>\n",
       "      <td>-0.110408</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>odometer</th>\n",
       "      <td>-0.773104</td>\n",
       "      <td>-0.032237</td>\n",
       "      <td>-0.537544</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.582648</td>\n",
       "      <td>-0.577385</td>\n",
       "      <td>-0.027418</td>\n",
       "      <td>0.062646</td>\n",
       "      <td>-0.031531</td>\n",
       "      <td>-0.014305</td>\n",
       "      <td>0.016755</td>\n",
       "      <td>-0.036899</td>\n",
       "      <td>0.164859</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mmr</th>\n",
       "      <td>0.588605</td>\n",
       "      <td>0.043743</td>\n",
       "      <td>0.481460</td>\n",
       "      <td>-0.582648</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.983492</td>\n",
       "      <td>-0.061550</td>\n",
       "      <td>-0.003597</td>\n",
       "      <td>0.038344</td>\n",
       "      <td>-0.040206</td>\n",
       "      <td>-0.018688</td>\n",
       "      <td>0.011533</td>\n",
       "      <td>-0.123567</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sellingprice</th>\n",
       "      <td>0.578918</td>\n",
       "      <td>0.046125</td>\n",
       "      <td>0.535990</td>\n",
       "      <td>-0.577385</td>\n",
       "      <td>0.983492</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.059856</td>\n",
       "      <td>-0.003213</td>\n",
       "      <td>0.036328</td>\n",
       "      <td>-0.037849</td>\n",
       "      <td>-0.022587</td>\n",
       "      <td>0.017575</td>\n",
       "      <td>-0.123299</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>makeCode</th>\n",
       "      <td>-0.007196</td>\n",
       "      <td>-0.042781</td>\n",
       "      <td>-0.019163</td>\n",
       "      <td>-0.027418</td>\n",
       "      <td>-0.061550</td>\n",
       "      <td>-0.059856</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.050214</td>\n",
       "      <td>-0.083977</td>\n",
       "      <td>0.040430</td>\n",
       "      <td>-0.030963</td>\n",
       "      <td>-0.002738</td>\n",
       "      <td>-0.004095</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>modelCode</th>\n",
       "      <td>-0.047812</td>\n",
       "      <td>-0.014385</td>\n",
       "      <td>-0.011467</td>\n",
       "      <td>0.062646</td>\n",
       "      <td>-0.003597</td>\n",
       "      <td>-0.003213</td>\n",
       "      <td>0.050214</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.121473</td>\n",
       "      <td>-0.011997</td>\n",
       "      <td>0.019903</td>\n",
       "      <td>0.004428</td>\n",
       "      <td>0.011523</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>trimCode</th>\n",
       "      <td>0.084704</td>\n",
       "      <td>0.004080</td>\n",
       "      <td>0.057427</td>\n",
       "      <td>-0.031531</td>\n",
       "      <td>0.038344</td>\n",
       "      <td>0.036328</td>\n",
       "      <td>-0.083977</td>\n",
       "      <td>0.121473</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.012417</td>\n",
       "      <td>0.057665</td>\n",
       "      <td>0.031270</td>\n",
       "      <td>0.004997</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bodyCode</th>\n",
       "      <td>0.053416</td>\n",
       "      <td>0.062315</td>\n",
       "      <td>0.010612</td>\n",
       "      <td>-0.014305</td>\n",
       "      <td>-0.040206</td>\n",
       "      <td>-0.037849</td>\n",
       "      <td>0.040430</td>\n",
       "      <td>-0.011997</td>\n",
       "      <td>-0.012417</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.005593</td>\n",
       "      <td>-0.001303</td>\n",
       "      <td>-0.001420</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>stateCode</th>\n",
       "      <td>-0.000645</td>\n",
       "      <td>-0.016593</td>\n",
       "      <td>0.007658</td>\n",
       "      <td>0.016755</td>\n",
       "      <td>-0.018688</td>\n",
       "      <td>-0.022587</td>\n",
       "      <td>-0.030963</td>\n",
       "      <td>0.019903</td>\n",
       "      <td>0.057665</td>\n",
       "      <td>0.005593</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.008656</td>\n",
       "      <td>0.063051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>colorCode</th>\n",
       "      <td>0.068540</td>\n",
       "      <td>0.020776</td>\n",
       "      <td>0.070712</td>\n",
       "      <td>-0.036899</td>\n",
       "      <td>0.011533</td>\n",
       "      <td>0.017575</td>\n",
       "      <td>-0.002738</td>\n",
       "      <td>0.004428</td>\n",
       "      <td>0.031270</td>\n",
       "      <td>-0.001303</td>\n",
       "      <td>0.008656</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.023422</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>interiorCode</th>\n",
       "      <td>-0.189505</td>\n",
       "      <td>0.020557</td>\n",
       "      <td>-0.110408</td>\n",
       "      <td>0.164859</td>\n",
       "      <td>-0.123567</td>\n",
       "      <td>-0.123299</td>\n",
       "      <td>-0.004095</td>\n",
       "      <td>0.011523</td>\n",
       "      <td>0.004997</td>\n",
       "      <td>-0.001420</td>\n",
       "      <td>0.063051</td>\n",
       "      <td>0.023422</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  year  transmission  condition  odometer       mmr  \\\n",
       "year          1.000000      0.089430   0.548307 -0.773104  0.588605   \n",
       "transmission  0.089430      1.000000   0.029010 -0.032237  0.043743   \n",
       "condition     0.548307      0.029010   1.000000 -0.537544  0.481460   \n",
       "odometer     -0.773104     -0.032237  -0.537544  1.000000 -0.582648   \n",
       "mmr           0.588605      0.043743   0.481460 -0.582648  1.000000   \n",
       "sellingprice  0.578918      0.046125   0.535990 -0.577385  0.983492   \n",
       "makeCode     -0.007196     -0.042781  -0.019163 -0.027418 -0.061550   \n",
       "modelCode    -0.047812     -0.014385  -0.011467  0.062646 -0.003597   \n",
       "trimCode      0.084704      0.004080   0.057427 -0.031531  0.038344   \n",
       "bodyCode      0.053416      0.062315   0.010612 -0.014305 -0.040206   \n",
       "stateCode    -0.000645     -0.016593   0.007658  0.016755 -0.018688   \n",
       "colorCode     0.068540      0.020776   0.070712 -0.036899  0.011533   \n",
       "interiorCode -0.189505      0.020557  -0.110408  0.164859 -0.123567   \n",
       "\n",
       "              sellingprice  makeCode  modelCode  trimCode  bodyCode  \\\n",
       "year              0.578918 -0.007196  -0.047812  0.084704  0.053416   \n",
       "transmission      0.046125 -0.042781  -0.014385  0.004080  0.062315   \n",
       "condition         0.535990 -0.019163  -0.011467  0.057427  0.010612   \n",
       "odometer         -0.577385 -0.027418   0.062646 -0.031531 -0.014305   \n",
       "mmr               0.983492 -0.061550  -0.003597  0.038344 -0.040206   \n",
       "sellingprice      1.000000 -0.059856  -0.003213  0.036328 -0.037849   \n",
       "makeCode         -0.059856  1.000000   0.050214 -0.083977  0.040430   \n",
       "modelCode        -0.003213  0.050214   1.000000  0.121473 -0.011997   \n",
       "trimCode          0.036328 -0.083977   0.121473  1.000000 -0.012417   \n",
       "bodyCode         -0.037849  0.040430  -0.011997 -0.012417  1.000000   \n",
       "stateCode        -0.022587 -0.030963   0.019903  0.057665  0.005593   \n",
       "colorCode         0.017575 -0.002738   0.004428  0.031270 -0.001303   \n",
       "interiorCode     -0.123299 -0.004095   0.011523  0.004997 -0.001420   \n",
       "\n",
       "              stateCode  colorCode  interiorCode  \n",
       "year          -0.000645   0.068540     -0.189505  \n",
       "transmission  -0.016593   0.020776      0.020557  \n",
       "condition      0.007658   0.070712     -0.110408  \n",
       "odometer       0.016755  -0.036899      0.164859  \n",
       "mmr           -0.018688   0.011533     -0.123567  \n",
       "sellingprice  -0.022587   0.017575     -0.123299  \n",
       "makeCode      -0.030963  -0.002738     -0.004095  \n",
       "modelCode      0.019903   0.004428      0.011523  \n",
       "trimCode       0.057665   0.031270      0.004997  \n",
       "bodyCode       0.005593  -0.001303     -0.001420  \n",
       "stateCode      1.000000   0.008656      0.063051  \n",
       "colorCode      0.008656   1.000000      0.023422  \n",
       "interiorCode   0.063051   0.023422      1.000000  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cars.corr(numeric_only='TRUE')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1d10f72",
   "metadata": {},
   "source": [
    "There aren't many strong correlations other than selling price and mmr. This shows that the estimated price based on millions of other transactions is quite accurate and should definitely be used in the model to make prediction as good as possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "05953e64",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3.0, 4.0]      157960\n",
       "(4.0, 5.0]      147264\n",
       "(2.0, 3.0]      106286\n",
       "(0.995, 2.0]     60826\n",
       "Name: condition, dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cars.condition.value_counts(ascending= False, bins=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60169969",
   "metadata": {},
   "source": [
    "# Data Modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00ad95cb",
   "metadata": {},
   "source": [
    "## Model 1: K Means Cluster Analysis on Auctioned Cars"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0584e9fc",
   "metadata": {},
   "source": [
    "Since we are dealing with such a large dataset of auctioned cars, we are definitely experiencing a large range in values among the data records. So, by using cluster analysis we hope to group the auctioned cars into accurate groups so that there can be prediction models used exclusively for different categories of auctioned cars. For the features we will use sellingprice, mmr, condition, interiorCode, and odometer as we feel these are the most classifying regarding auctioned cars.\n",
    "\n",
    "NOTE: This cluster analysis will be done on a sample of 100,000 data records. This is done so that the analysis won't take a long period in order to perform. In addition, please note that if running the silhouette score function, it may take a couple minutes to officially finish running. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "564c5dc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "features=['sellingprice','mmr','condition','interiorCode','odometer']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "274d9a82",
   "metadata": {},
   "outputs": [],
   "source": [
    "carsSubset=cars[features].copy().dropna().sample(n=100000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "96b675ad",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>KMeans(max_iter=1000, n_clusters=3, n_init=&#x27;auto&#x27;)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">KMeans</label><div class=\"sk-toggleable__content\"><pre>KMeans(max_iter=1000, n_clusters=3, n_init=&#x27;auto&#x27;)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "KMeans(max_iter=1000, n_clusters=3, n_init='auto')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numClusters= 3\n",
    "carsCluster=KMeans(n_clusters=numClusters,n_init=\"auto\",max_iter=1000)\n",
    "carsCluster.fit(carsSubset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "bc669de9",
   "metadata": {},
   "outputs": [],
   "source": [
    "carsSubset['cluster']=carsCluster.labels_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ed3f21ea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "58745911800330.86"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "carsCluster.inertia_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "1262d1cc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5512702976030737"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "silhouette_score(carsSubset, carsCluster.labels_)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce4e2d38",
   "metadata": {},
   "source": [
    "## K-Means Evaluation\n",
    "This inertia score is very high signaling a great distance between clusters in this kmeans cluster analysis. Additionally, we have a silhouette sore of 0.55 which signals that most data points belong in the correct cluster they were placed in. \n",
    "\n",
    "To attempt to update this cluster analysis, we are going to change the number of clusters to 5 to see if that increases our silhouette score and increases the accuracy of our clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "922e285f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>KMeans(max_iter=1000, n_clusters=5, n_init=&#x27;auto&#x27;)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" checked><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">KMeans</label><div class=\"sk-toggleable__content\"><pre>KMeans(max_iter=1000, n_clusters=5, n_init=&#x27;auto&#x27;)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "KMeans(max_iter=1000, n_clusters=5, n_init='auto')"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numClusters= 5\n",
    "carsCluster=KMeans(n_clusters=numClusters,n_init=\"auto\",max_iter=1000)\n",
    "carsCluster.fit(carsSubset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "7d2203a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "carsSubset['cluster']=carsCluster.labels_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "63f04b1f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "33936955765386.645"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "carsCluster.inertia_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "76b15575",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4310201521817097"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "silhouette_score(carsSubset, carsCluster.labels_)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fdd36e4",
   "metadata": {},
   "source": [
    "## K-Means Summary\n",
    "Increasing the amount of clusters actually reduced our silhouette score which means that increasing the amount of clusters had a negative effect on accuracy. This means that 3 clusters was most likely our best cluster analysis possible. \n",
    "\n",
    "Our intertia score was still very high which means that the clusters are very unique in their placement but with a lower silhouette our data records weren't properly put in the correct cluster. \n",
    "\n",
    "Conclusion: The best number of clusters is a lower number. This indicates that there isn't a huge variety in types of cars that are being auctioned based on the dataset used, given that most data records were accurately placed in their cluster. One possible takeaway from this is that the auctioned cars fall into 3 groups that would be tiered by quality. This quality indicator would be based on the condition, sales price, and car characteristics (interior/odometer). The 3 groups could be described as Low Quality, Medium Quality, and High Quality. If you chose to isolate these groups into their own subsets, it could lead to more accurate models since prices and other characteristics of the cars being auctioned wouldn't vary as much since the auction outcomes are similar. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67fbd873",
   "metadata": {},
   "source": [
    "## Model 2: Logistic Regression Analysis on Transmission\n",
    "\n",
    "We will be using a logistic analysis of this data set next. We will be using the \"transmission\" variable as our response variable and seeing how accurate the rest of the variables are to prediciting whether or not a car's transmission is automatic or manual."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "2ea644de",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = ['transmission']\n",
    "features = ['year','makeCode','modelCode','trimCode','bodyCode'\n",
    "             ,'colorCode','interiorCode', 'mmr', 'condition','stateCode','sellingprice']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "6772bc6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "modelCars = cars[['year','makeCode','modelCode','trimCode','bodyCode'\n",
    "             ,'colorCode','interiorCode', 'mmr','transmission','condition','stateCode','odometer','sellingprice']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "a892a0a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainingData, testData = train_test_split(modelCars, test_size=0.1, train_size=0.8, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "6e5a2006",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainFeatures = trainingData[features]\n",
    "trainResponse = trainingData['transmission']\n",
    "testFeatures = testData[features]\n",
    "testResponse = testData['transmission']\n",
    "trainTransmission = trainResponse == 1\n",
    "testTransmission = testResponse == 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "23e797da",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>SGDClassifier(random_state=42)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SGDClassifier</label><div class=\"sk-toggleable__content\"><pre>SGDClassifier(random_state=42)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "SGDClassifier(random_state=42)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transmissionClassifier = SGDClassifier(random_state=42)\n",
    "transmissionClassifier.fit(trainFeatures,trainResponse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "62a97a87",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, ..., 1, 1, 1])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "testPredictions = transmissionClassifier.predict(testFeatures)\n",
    "testPredictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "575cf0e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_predict\n",
    "\n",
    "testPredictTransmission = cross_val_predict(transmissionClassifier, testFeatures, testTransmission, cv=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05324578",
   "metadata": {},
   "source": [
    "## Logistic Regression Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "144a0db8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>predicted</th>\n",
       "      <th>neg</th>\n",
       "      <th>pos</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>expected</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>54</td>\n",
       "      <td>1593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>432</td>\n",
       "      <td>45155</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "predicted  neg    pos\n",
       "expected             \n",
       "neg         54   1593\n",
       "pos        432  45155"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "transmissionConfusion = confusion_matrix(testTransmission, testPredictTransmission)\n",
    "confusion=DataFrame(transmissionConfusion)\n",
    "confusion.index=['neg','pos']\n",
    "confusion.index.name='expected'\n",
    "confusion.columns=['neg','pos']\n",
    "confusion.columns.name='predicted'\n",
    "confusion"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2da3b2a6",
   "metadata": {},
   "source": [
    "## Changing It Up\n",
    "\n",
    "Accuracy = 45,209/47,234 = 0.9571 ~ 96%\n",
    "\n",
    "Success of Predicted 1 = Sensitivity = 45,155/45,587 = 0.9905 ~ 99%\n",
    "\n",
    "Success of Predicted 0 = Specificity = 45,294/46,857 = 0.0328 ~ 3%\n",
    "\n",
    "Overall this initial logistic regression model is very accurate. Our only note to take away from this model is that the positive value 1 (automatic) was predicted at a very high rate while the negative value 0 (manual) was predicted at a very low rate. We concluded that this is most likely associated with the low count of manual transmission cars in the dataset. \n",
    "\n",
    "For a different analysis, we will be taking out the highest correlated variable towards transmission in the table shown above, which is bodycode, and seeing if this changes the outcome of our model. As accuracy and sensitivity are high already, it is going to be interesting to see if taking out the highest correlated variable of transmission will change things. In our opinion we think there will be a drop off, as the highest correlated variable should hold some sort of impact when doing the prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "43c1d30c",
   "metadata": {},
   "outputs": [],
   "source": [
    "features = ['makeCode','modelCode','trimCode',\n",
    "             'year','colorCode','interiorCode', 'mmr', 'stateCode','condition','sellingprice']\n",
    "modelCars = cars[['year','makeCode','modelCode','trimCode','bodyCode'\n",
    "             ,'colorCode','interiorCode','mmr','transmission','condition','stateCode','odometer','sellingprice']]\n",
    "trainingData, testData = train_test_split(modelCars, test_size=0.1, train_size=0.8, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "65b7b034",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainFeatures = trainingData[features]\n",
    "trainResponse = trainingData['transmission']\n",
    "testFeatures = testData[features]\n",
    "testResponse = testData['transmission']\n",
    "trainTransmission = trainResponse == 1\n",
    "testTransmission = testResponse == 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "aa3cf2b9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>SGDClassifier(random_state=42)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" checked><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SGDClassifier</label><div class=\"sk-toggleable__content\"><pre>SGDClassifier(random_state=42)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "SGDClassifier(random_state=42)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transmissionClassifier = SGDClassifier(random_state=42)\n",
    "transmissionClassifier.fit(trainFeatures,trainResponse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "abca1fe3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, ..., 1, 1, 1])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "testPredictions = transmissionClassifier.predict(testFeatures)\n",
    "testPredictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "74f76f02",
   "metadata": {},
   "outputs": [],
   "source": [
    "testPredictTransmission = cross_val_predict(transmissionClassifier, testFeatures, testTransmission, cv=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "4420980e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>predicted</th>\n",
       "      <th>neg</th>\n",
       "      <th>pos</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>expected</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>111</td>\n",
       "      <td>1544</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>1393</td>\n",
       "      <td>44186</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "predicted   neg    pos\n",
       "expected              \n",
       "neg         111   1544\n",
       "pos        1393  44186"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transmissionConfusion = confusion_matrix(testTransmission, testPredictTransmission)\n",
    "confusion=DataFrame(transmissionConfusion)\n",
    "confusion.index=['neg','pos']\n",
    "confusion.index.name='expected'\n",
    "confusion.columns=['neg','pos']\n",
    "confusion.columns.name='predicted'\n",
    "confusion"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3da8c92d",
   "metadata": {},
   "source": [
    "## Logistic Regression Summary\n",
    "\n",
    "Accuracy = 44,297/47,234 = 0.9378 ~ 94%\n",
    "\n",
    "Success of Predicted 1 = Sensitivity = 44,186/45,579 = 0.9694 ~ 97% \n",
    "\n",
    "Success of Predicted 0 = Specificity = 111/1,655 = 0.0671 ~ 7%\n",
    "\n",
    "From the full features logistic analysis, where we took all of the features to predict the transmission, there really is no difference whenever you take out the highest correlation associated to transmission (bodycode). Accuracy and specificty stayed the same, but specificity increased by about 4%. So, you could say taking out the bodycode feature could be an improvement to prediciting whether or not the car's transmission is automatic or manual, but all in all, taking out the bodycode variable didn't lead to any significant in our model. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daa582be",
   "metadata": {},
   "source": [
    "## Data Model 3: Linear Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b68fac42",
   "metadata": {},
   "source": [
    "### Our Linear Regression model will focus on predicting the odometer(Number of miles driven)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1779804c",
   "metadata": {},
   "source": [
    "To attempt to predict the number of miles driven for each car(odometer), we will use the column sellingprice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "ec1cea53",
   "metadata": {},
   "outputs": [],
   "source": [
    "features=['sellingprice']\n",
    "response=['odometer']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "2af578cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "selectedData=cars[['odometer','sellingprice']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "07d242f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "training, testing = train_test_split(selectedData, test_size=0.3, train_size=0.7, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "baeec1f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainingFeatures=training.loc[:,features]\n",
    "trainingResponse=training.loc[:,response]\n",
    "testingFeatures=testing.loc[:,features]\n",
    "testingResponse=testing.loc[:,response]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "73b1ea0d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-3 {color: black;background-color: white;}#sk-container-id-3 pre{padding: 0;}#sk-container-id-3 div.sk-toggleable {background-color: white;}#sk-container-id-3 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-3 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-3 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-3 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-3 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-3 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-3 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-3 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-3 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-3 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-3 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-3 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-3 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-3 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-3 div.sk-item {position: relative;z-index: 1;}#sk-container-id-3 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-3 div.sk-item::before, #sk-container-id-3 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-3 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-3 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-3 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-3 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-3 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-3 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-3 div.sk-label-container {text-align: center;}#sk-container-id-3 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-3 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-3\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LinearRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" checked><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearRegression</label><div class=\"sk-toggleable__content\"><pre>LinearRegression()</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "LinearRegression()"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "linreg = LinearRegression()\n",
    "linreg.fit(trainingFeatures, trainingResponse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "b6822cc9",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictedResponse=linreg.predict(testingFeatures)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "938640aa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3355164521922719"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r2_score(testingResponse,predictedResponse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "40bd96ae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1805298849.2598186"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_squared_error(testingResponse, predictedResponse)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "824545e5",
   "metadata": {},
   "source": [
    "## Linear Regression Evaluation\n",
    "Due to the low R-squared score and high MSE value, we will attempt to add additional features(mmr, year, and condition) in order to more accurately predict the odometer reading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "f685100b",
   "metadata": {},
   "outputs": [],
   "source": [
    "features=['sellingprice','mmr','condition','year']\n",
    "response=['odometer']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "8e184ccd",
   "metadata": {},
   "outputs": [],
   "source": [
    "selectedData=cars[['odometer','sellingprice','mmr','condition','year']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "bfbe1c64",
   "metadata": {},
   "outputs": [],
   "source": [
    "training, testing = train_test_split(selectedData, test_size=0.3, train_size=0.7, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "331c6b0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainingFeatures=training.loc[:,features]\n",
    "trainingResponse=training.loc[:,response]\n",
    "testingFeatures=testing.loc[:,features]\n",
    "testingResponse=testing.loc[:,response]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "9303b64d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-4 {color: black;background-color: white;}#sk-container-id-4 pre{padding: 0;}#sk-container-id-4 div.sk-toggleable {background-color: white;}#sk-container-id-4 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-4 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-4 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-4 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-4 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-4 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-4 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-4 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-4 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-4 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-4 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-4 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-4 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-4 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-4 div.sk-item {position: relative;z-index: 1;}#sk-container-id-4 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-4 div.sk-item::before, #sk-container-id-4 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-4 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-4 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-4 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-4 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-4 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-4 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-4 div.sk-label-container {text-align: center;}#sk-container-id-4 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-4 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-4\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LinearRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" checked><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearRegression</label><div class=\"sk-toggleable__content\"><pre>LinearRegression()</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "LinearRegression()"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "linreg = LinearRegression()\n",
    "linreg.fit(trainingFeatures, trainingResponse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "561e0217",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictedResponse=linreg.predict(testingFeatures)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "63262225",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.631472016955943"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r2_score(testingResponse,predictedResponse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "6ce55d10",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "982324037.5717765"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_squared_error(testingResponse, predictedResponse)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5827371f",
   "metadata": {},
   "source": [
    "## Linear Regression Summary\n",
    "\n",
    "Since the R2 value almost doubled with MSE decreasing, we can confidently say that by adding more variables to predict odometer you can get a much stronger prediction model. Clearly, selling price did have a strong relationship with odometer so the model needed more variables in order to make better predictions. If we were to take this model a step further with even more variables the model would more than likely improve more than it already has. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4ad7f24",
   "metadata": {},
   "source": [
    "# Final Summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f80c5af",
   "metadata": {},
   "source": [
    "Initially, we had no idea how impactful the characteristics of the auctioned cars would be for creating accurate prediction models. Especially variables like interior, body, and trasmission. After performing our analysis using cluster analysis, logistic regression, and linear regression we can confidently say that the details or characteristics of these auctioned cars in our data set are great for prediction. In our cluster analysis we found the amount of clusters to be low but accurate because of the high distance between groups as well as having a pretty good silhouette score. Additionally, our logistic and linear regression models were very effective in prediction with both having high R2 score values. To wrap up, our models were very successful using all of the dataset variables including characteristics and other mathematical details in order to make accurate predictions. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}