{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e8dbfa07",
   "metadata": {},
   "source": [
    "# Capstone 2 — Session 2: Data Processing and Statistical Analysis\n",
    "\n",
    "**Run timestamp:** `2026-02-19 01:49:22`\n",
    "\n",
    "## Goal\n",
    "- Process the cleaned NSMES dataset from Capstone 1, transform encoded variables (`age`, `income`) into real-world units, and produce a statistically summarized dataset for downstream modeling.\n",
    "- Deliver documented evidence for memory comparison, transformation correctness, descriptive statistics, and export readiness for Capstone 3.\n",
    "\n",
    "## Inputs\n",
    "- `NSMES1988new.csv` (copied from Capstone 1 outputs if not already present locally)\n",
    "\n",
    "## Outputs\n",
    "- All exports go to `./outputs/` (and plots to `./outputs/plots/` when applicable)\n",
    "\n",
    "## Libraries (documented)\n",
    "- `pandas`: needed for DataFrame operations and descriptive statistics; enabled loading, transforming, validating, and exporting tabular data.\n",
    "\n",
    "## Key dataset note\n",
    "- `age` is encoded as **Age in years (divided by 10)** (e.g., `6.9` = 69 years)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d192d9c1",
   "metadata": {},
   "source": [
    "## C?-T0 - Runtime setup (local + Colab paths)\n",
    "This setup cell prepares reproducible working folders for local runs and Google Colab.\n",
    "When the notebook runs in Colab, it stages the PDF, notebook copy, and input CSV files from the public Francis Burnet GitHub repository before the requirement steps begin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4bc71eaf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Capstone: 2 | Session: Session 2: Data Processing and Statistical Analysis\n",
      "Run timestamp: 2026-02-19 00:31:20\n",
      "Output directory: c:\\DEV_Projects\\SIMPLILEARN\\CAPSTONE_Applied_Data_Science_with _Python\\Incremental_Capstone\\Capstone 2\\outputs\n",
      "Plots directory: c:\\DEV_Projects\\SIMPLILEARN\\CAPSTONE_Applied_Data_Science_with _Python\\Incremental_Capstone\\Capstone 2\\outputs\\plots\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "from datetime import datetime\n",
    "from urllib.parse import quote\n",
    "from urllib.request import urlretrieve\n",
    "import os\n",
    "import sys\n",
    "\n",
    "try:\n",
    "    from IPython.display import display\n",
    "except Exception:\n",
    "    def display(value):\n",
    "        print(value)\n",
    "\n",
    "# --- Project metadata ---\n",
    "CAPSTONE = 2\n",
    "SESSION_TITLE = 'Session 2: Data Processing and Statistical Analysis'\n",
    "IS_COLAB = 'google.colab' in sys.modules\n",
    "RAW_BASE = os.environ.get('FRANCISBURNET_RAW_BASE', 'https://raw.githubusercontent.com/FrancisBurnet/francisburnet/main')\n",
    "\n",
    "print(f\"Capstone: {CAPSTONE} | Session: {SESSION_TITLE}\")\n",
    "print(\"Run timestamp:\", datetime.now().strftime(\"%Y-%m-%d %H:%M:%S\"))\n",
    "print(\"Runtime:\", \"Google Colab\" if IS_COLAB else \"Local / notebook runtime\")\n",
    "\n",
    "CWD = Path.cwd()\n",
    "if IS_COLAB:\n",
    "    BASE_DIR = Path('/content/francisburnet_capstone_2')\n",
    "elif (CWD / f\"Capstone {CAPSTONE}\").exists():\n",
    "    BASE_DIR = CWD / f\"Capstone {CAPSTONE}\"\n",
    "elif CWD.name == f\"Capstone {CAPSTONE}\":\n",
    "    BASE_DIR = CWD\n",
    "elif (CWD / 'Incremental_Capstone' / f\"Capstone {CAPSTONE}\").exists():\n",
    "    BASE_DIR = CWD / 'Incremental_Capstone' / f\"Capstone {CAPSTONE}\"\n",
    "else:\n",
    "    BASE_DIR = CWD\n",
    "\n",
    "INPUT_DIR = BASE_DIR / 'inputs'\n",
    "OUTPUT_DIR = BASE_DIR / 'outputs'\n",
    "PLOTS_DIR = OUTPUT_DIR / 'plots'\n",
    "INPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "PLOTS_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# --- Paths ---\n",
    "def first_existing_path(candidates):\n",
    "    for candidate in candidates:\n",
    "        candidate = Path(candidate).expanduser()\n",
    "        if candidate.exists():\n",
    "            return candidate\n",
    "    return None\n",
    "\n",
    "def github_raw_url(relative_path: str) -> str:\n",
    "    normalized = relative_path.replace('\\\\', '/')\n",
    "    return RAW_BASE.rstrip('/') + '/' + quote(normalized, safe='/')\n",
    "\n",
    "def stage_colab_asset(relative_path: str, destination_name=None) -> Path:\n",
    "    destination = INPUT_DIR / (destination_name or Path(relative_path).name)\n",
    "    if not destination.exists():\n",
    "        asset_url = github_raw_url(relative_path)\n",
    "        urlretrieve(asset_url, destination)\n",
    "        print('Downloaded:', destination.name, 'from', asset_url)\n",
    "    return destination\n",
    "\n",
    "def resolve_dataset_path(default_filename: str) -> Path:\n",
    "    \"\"\"Resolve dataset path from local folders or the Colab staging folder.\"\"\"\n",
    "    path = first_existing_path([\n",
    "        INPUT_DIR / default_filename,\n",
    "        BASE_DIR / default_filename,\n",
    "        CWD / default_filename,\n",
    "        CWD / 'Incremental_Capstone' / f\"Capstone {CAPSTONE}\" / default_filename,\n",
    "    ])\n",
    "    if path is None:\n",
    "        searched_paths = [\n",
    "            str(INPUT_DIR / default_filename),\n",
    "            str(BASE_DIR / default_filename),\n",
    "            str(CWD / default_filename),\n",
    "            str(CWD / 'Incremental_Capstone' / f\"Capstone {CAPSTONE}\" / default_filename),\n",
    "        ]\n",
    "        raise FileNotFoundError(\n",
    "            f\"Dataset not found: {default_filename}. Searched: {searched_paths}\"\n",
    "        )\n",
    "    return path\n",
    "\n",
    "if IS_COLAB:\n",
    "    staged_assets = [\n",
    "        'Incremental Capstones/Applied Data Science with Python/Capstone 2/Capstone_Session_2.pdf',\n",
    "        'Incremental Capstones/Applied Data Science with Python/Capstone 2/NSMES1988.csv',\n",
    "        'Incremental Capstones/Applied Data Science with Python/Capstone 2/NSMES1988new.csv',\n",
    "        'Incremental Capstones/Applied Data Science with Python/Capstone 2/capstone_2.ipynb',\n",
    "    ]\n",
    "    for relative_path in staged_assets:\n",
    "        stage_colab_asset(relative_path)\n",
    "\n",
    "print('Base directory:', BASE_DIR)\n",
    "print('Input directory:', INPUT_DIR)\n",
    "print('Output directory:', OUTPUT_DIR)\n",
    "print('Plots directory:', PLOTS_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "435ed670",
   "metadata": {},
   "source": [
    "## C?-T1 — Imports (ONLY what you use)\n",
    "I documented each import with why I used it and what it enabled in this capstone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "485587d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd  # DataFrames + CSV/JSON IO + analysis tables"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3790eaa3",
   "metadata": {},
   "source": [
    "## C?-T2 - Load dataset\n",
    "This step loads the required Capstone 2 input file from the local project folder or the Colab staging folder created in setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7349219d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded: c:\\DEV_Projects\\SIMPLILEARN\\CAPSTONE_Applied_Data_Science_with _Python\\Incremental_Capstone\\Capstone 2\\NSMES1988new.csv\n",
      "Shape: (4406, 18)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>visits</th>\n",
       "      <th>nvisits</th>\n",
       "      <th>ovisits</th>\n",
       "      <th>novisits</th>\n",
       "      <th>emergency</th>\n",
       "      <th>hospital</th>\n",
       "      <th>health</th>\n",
       "      <th>chronic</th>\n",
       "      <th>adl</th>\n",
       "      <th>region</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "      <th>married</th>\n",
       "      <th>school</th>\n",
       "      <th>income</th>\n",
       "      <th>employed</th>\n",
       "      <th>insurance</th>\n",
       "      <th>medicaid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>average</td>\n",
       "      <td>2</td>\n",
       "      <td>normal</td>\n",
       "      <td>other</td>\n",
       "      <td>6.9</td>\n",
       "      <td>male</td>\n",
       "      <td>yes</td>\n",
       "      <td>6</td>\n",
       "      <td>2.8810</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>average</td>\n",
       "      <td>2</td>\n",
       "      <td>normal</td>\n",
       "      <td>other</td>\n",
       "      <td>7.4</td>\n",
       "      <td>female</td>\n",
       "      <td>yes</td>\n",
       "      <td>10</td>\n",
       "      <td>2.7478</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>poor</td>\n",
       "      <td>4</td>\n",
       "      <td>limited</td>\n",
       "      <td>other</td>\n",
       "      <td>6.6</td>\n",
       "      <td>female</td>\n",
       "      <td>no</td>\n",
       "      <td>10</td>\n",
       "      <td>0.6532</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>poor</td>\n",
       "      <td>2</td>\n",
       "      <td>limited</td>\n",
       "      <td>other</td>\n",
       "      <td>7.6</td>\n",
       "      <td>male</td>\n",
       "      <td>yes</td>\n",
       "      <td>3</td>\n",
       "      <td>0.6588</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>average</td>\n",
       "      <td>2</td>\n",
       "      <td>limited</td>\n",
       "      <td>other</td>\n",
       "      <td>7.9</td>\n",
       "      <td>female</td>\n",
       "      <td>yes</td>\n",
       "      <td>6</td>\n",
       "      <td>0.6588</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   visits  nvisits  ovisits  novisits  emergency  hospital   health  chronic  \\\n",
       "0       5        0        0         0          0         1  average        2   \n",
       "1       1        0        2         0          2         0  average        2   \n",
       "2      13        0        0         0          3         3     poor        4   \n",
       "3      16        0        5         0          1         1     poor        2   \n",
       "4       3        0        0         0          0         0  average        2   \n",
       "\n",
       "       adl region  age  gender married  school  income employed insurance  \\\n",
       "0   normal  other  6.9    male     yes       6  2.8810      yes       yes   \n",
       "1   normal  other  7.4  female     yes      10  2.7478       no       yes   \n",
       "2  limited  other  6.6  female      no      10  0.6532       no        no   \n",
       "3  limited  other  7.6    male     yes       3  0.6588       no       yes   \n",
       "4  limited  other  7.9  female     yes       6  0.6588       no       yes   \n",
       "\n",
       "  medicaid  \n",
       "0       no  \n",
       "1       no  \n",
       "2      yes  \n",
       "3       no  \n",
       "4       no  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "DEFAULT_DATASET = 'NSMES1988new.csv'\n",
    "\n",
    "try:\n",
    "    dataset_path = resolve_dataset_path(DEFAULT_DATASET)\n",
    "except FileNotFoundError:\n",
    "    fallback = first_existing_path([\n",
    "        BASE_DIR.parent / 'Capstone 1' / 'outputs' / 'NSMES1988new.csv',\n",
    "        CWD / 'Capstone 1' / 'outputs' / 'NSMES1988new.csv',\n",
    "        CWD / 'Incremental_Capstone' / 'Capstone 1' / 'outputs' / 'NSMES1988new.csv',\n",
    "    ])\n",
    "    if fallback is None and IS_COLAB:\n",
    "        fallback = stage_colab_asset(\n",
    "            'Incremental Capstones/Applied Data Science with Python/Capstone 2/NSMES1988new.csv'\n",
    "        )\n",
    "    if fallback is None:\n",
    "        raise\n",
    "    dataset_path = fallback\n",
    "    print('Default dataset not found; using fallback:', dataset_path)\n",
    "\n",
    "df = pd.read_csv(dataset_path)\n",
    "\n",
    "print('Loaded:', dataset_path)\n",
    "print('Shape:', df.shape)\n",
    "display(df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "634c0dea",
   "metadata": {},
   "source": [
    "## C?-T3 — Validation checks\n",
    "- Confirm expected columns exist\n",
    "- Confirm key dtypes\n",
    "- Check missing values\n",
    "\n",
    "**Results Capture:**\n",
    "- What I did: I validated expected schema, reviewed dtypes, and computed missing-value counts across all columns.\n",
    "- What I found: all expected 18 columns are present, no missing values were detected, and `age`/`income` are numeric (`float64`) as required for scaling.\n",
    "- Caveats: categorical concepts (e.g., `region`, `gender`) are integer-encoded and should be interpreted as labels, not continuous measures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "35ccde58",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Missing expected columns: []\n",
      "\n",
      "Dtypes:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "visits         int64\n",
       "nvisits        int64\n",
       "ovisits        int64\n",
       "novisits       int64\n",
       "emergency      int64\n",
       "hospital       int64\n",
       "health           str\n",
       "chronic        int64\n",
       "adl              str\n",
       "region           str\n",
       "age          float64\n",
       "gender           str\n",
       "married          str\n",
       "school         int64\n",
       "income       float64\n",
       "employed         str\n",
       "insurance        str\n",
       "medicaid         str\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Missing values (count):\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "visits       0\n",
       "nvisits      0\n",
       "ovisits      0\n",
       "novisits     0\n",
       "emergency    0\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "expected_cols = [\n",
    "    \"visits\", \"nvisits\", \"ovisits\", \"novisits\", \"emergency\", \"hospital\",\n",
    "    \"health\", \"chronic\", \"adl\", \"region\", \"age\", \"gender\",\n",
    "    \"married\", \"school\", \"income\", \"employed\", \"insurance\", \"medicaid\"\n",
    "]\n",
    "\n",
    "missing_cols = [c for c in expected_cols if c not in df.columns]\n",
    "print(\"Missing expected columns:\", missing_cols)\n",
    "\n",
    "print(\"\\nDtypes:\")\n",
    "display(df.dtypes)\n",
    "\n",
    "print(\"\\nMissing values (count):\")\n",
    "na_counts = df.isna().sum().sort_values(ascending=False)\n",
    "display(na_counts[na_counts > 0] if (na_counts > 0).any() else na_counts.head())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dda7d9f0",
   "metadata": {},
   "source": [
    "## C2-T4 — Load NSMES1988new.csv and compare memory with Week 1\n",
    "**PDF requirement:** Import NSMES1988new.csv and provide memory analysis compared to Week 1.\n",
    "\n",
    "### What I completed\n",
    "- I loaded `NSMES1988new.csv`, measured memory usage, and compared it against Capstone 1 memory evidence.\n",
    "\n",
    "### Results Capture\n",
    "- Current dataframe memory: **2,228,671 bytes (2.125 MB)**.\n",
    "- Capstone 1 memory reference: **2,263,919 bytes (2.159 MB)**.\n",
    "- Difference: **-35,248 bytes (-0.034 MB)**, indicating a modest reduction from Week 1 after the cleaned-schema handoff.\n",
    "\n",
    "### Code evidence\n",
    "- The next cell shows the exact memory comparison code I executed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c9cd2e92",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total memory (bytes): 2228671\n",
      "Total memory (MB): 2.125\n",
      "Capstone 1 memory (bytes): 2263919\n",
      "Difference vs Capstone 1 (bytes): -35248\n",
      "Difference vs Capstone 1 (MB): -0.034\n"
     ]
    }
   ],
   "source": [
    "# Memory comparison against Capstone 1 reference\n",
    "\n",
    "mem2 = df.memory_usage(deep=True).sum()\n",
    "mem1 = 2263919  # from Capstone 1 WORK_SUMMARY\n",
    "diff = mem2 - mem1\n",
    "print(\"Total memory (bytes):\", mem2)\n",
    "print(\"Total memory (MB):\", round(mem2 / (1024**2), 3))\n",
    "print(\"Capstone 1 memory (bytes):\", mem1)\n",
    "print(\"Difference vs Capstone 1 (bytes):\", diff)\n",
    "print(\"Difference vs Capstone 1 (MB):\", round(diff / (1024**2), 3))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc4ea581",
   "metadata": {},
   "source": [
    "## C2-T5 — Transform age and income (scale to real units)\n",
    "\n",
    "**PDF requirement:** Multiply age by 10 and income by 10000.\n",
    "\n",
    "\n",
    "\n",
    "### What I completed\n",
    "\n",
    "- I created `age_years` and `income_dollars` to preserve raw values while adding real-unit scaled fields.\n",
    "\n",
    "\n",
    "\n",
    "### Results Capture\n",
    "\n",
    "- `age` before scaling: min=6.6, max=10.9 → `age_years` after scaling: min=66, max=109.\n",
    "\n",
    "- `income` before scaling: min=-1.0125, max=54.8351 → `income_dollars` after scaling: min=-10,125, max=548,351.\n",
    "\n",
    "- New columns preserve raw source fields while exposing interpretable units for analysis.\n",
    "\n",
    "\n",
    "\n",
    "### Code evidence\n",
    "\n",
    "- The next cell contains the exact transformation logic and preview output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1bf97545",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>age_years</th>\n",
       "      <th>income</th>\n",
       "      <th>income_dollars</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6.9</td>\n",
       "      <td>69</td>\n",
       "      <td>2.8810</td>\n",
       "      <td>28810</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>7.4</td>\n",
       "      <td>74</td>\n",
       "      <td>2.7478</td>\n",
       "      <td>27478</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.6</td>\n",
       "      <td>66</td>\n",
       "      <td>0.6532</td>\n",
       "      <td>6532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7.6</td>\n",
       "      <td>76</td>\n",
       "      <td>0.6588</td>\n",
       "      <td>6588</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7.9</td>\n",
       "      <td>79</td>\n",
       "      <td>0.6588</td>\n",
       "      <td>6588</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age  age_years  income  income_dollars\n",
       "0  6.9         69  2.8810           28810\n",
       "1  7.4         74  2.7478           27478\n",
       "2  6.6         66  0.6532            6532\n",
       "3  7.6         76  0.6588            6588\n",
       "4  7.9         79  0.6588            6588"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Transformations (recommended: keep raw + create scaled)\n",
    "df2 = df.copy()\n",
    "\n",
    "if \"age\" in df2.columns:\n",
    "    df2[\"age_years\"] = (df2[\"age\"] * 10).round(0).astype(\"Int64\")\n",
    "\n",
    "if \"income\" in df2.columns:\n",
    "    df2[\"income_dollars\"] = (df2[\"income\"] * 10000).round(0).astype(\"Int64\")\n",
    "\n",
    "cols = [c for c in [\"age\",\"age_years\",\"income\",\"income_dollars\"] if c in df2.columns]\n",
    "display(df2[cols].head())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cb51a61",
   "metadata": {},
   "source": [
    "## C2-T6 — Basic statistical analysis + brief report\n",
    "\n",
    "**PDF requirement:** Provide basic statistical analysis and a brief report on the dataset.\n",
    "\n",
    "\n",
    "\n",
    "### What I completed\n",
    "\n",
    "- I computed descriptive statistics and interpreted the key metrics for visits, age, and income.\n",
    "\n",
    "\n",
    "\n",
    "### Results Capture\n",
    "\n",
    "- Key stats (`mean | median | min | max`):\n",
    "\n",
    "  - `visits`: 5.774 | 4.0 | 0 | 89\n",
    "\n",
    "  - `age_years`: 74.024 | 73.0 | 66 | 109\n",
    "\n",
    "  - `income_dollars`: 25,271.321 | 16,981.5 | -10,125 | 548,351\n",
    "\n",
    "- Brief report:\n",
    "\n",
    "  - Visit counts are right-skewed (mean > median), with a small high-utilization tail.\n",
    "\n",
    "  - The sample is concentrated in older age bands (median 73 years).\n",
    "\n",
    "  - Income is highly right-skewed with a large upper tail, so median is more robust than mean.\n",
    "\n",
    "  - Negative income values appear and should be preserved/documented rather than dropped blindly.\n",
    "\n",
    "  - Scaled features (`age_years`, `income_dollars`) are now directly interpretable in business terms.\n",
    "\n",
    "\n",
    "\n",
    "### Code evidence\n",
    "\n",
    "- The next cell contains the descriptive summary tables used in this report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "94b0cc77",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>visits</th>\n",
       "      <th>nvisits</th>\n",
       "      <th>ovisits</th>\n",
       "      <th>novisits</th>\n",
       "      <th>emergency</th>\n",
       "      <th>hospital</th>\n",
       "      <th>chronic</th>\n",
       "      <th>age</th>\n",
       "      <th>school</th>\n",
       "      <th>income</th>\n",
       "      <th>age_years</th>\n",
       "      <th>income_dollars</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.0</td>\n",
       "      <td>4406.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>5.774399</td>\n",
       "      <td>1.618021</td>\n",
       "      <td>0.750794</td>\n",
       "      <td>0.536087</td>\n",
       "      <td>0.263504</td>\n",
       "      <td>0.295960</td>\n",
       "      <td>1.541988</td>\n",
       "      <td>7.402406</td>\n",
       "      <td>10.290286</td>\n",
       "      <td>2.527132</td>\n",
       "      <td>74.024058</td>\n",
       "      <td>25271.320699</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>6.759225</td>\n",
       "      <td>5.317056</td>\n",
       "      <td>3.652759</td>\n",
       "      <td>3.879506</td>\n",
       "      <td>0.703659</td>\n",
       "      <td>0.746398</td>\n",
       "      <td>1.349632</td>\n",
       "      <td>0.633405</td>\n",
       "      <td>3.738736</td>\n",
       "      <td>2.924648</td>\n",
       "      <td>6.33405</td>\n",
       "      <td>29246.475762</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>6.600000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-1.012500</td>\n",
       "      <td>66.0</td>\n",
       "      <td>-10125.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>6.900000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>0.912150</td>\n",
       "      <td>69.0</td>\n",
       "      <td>9121.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>7.300000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>1.698150</td>\n",
       "      <td>73.0</td>\n",
       "      <td>16981.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>8.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>7.800000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>3.172850</td>\n",
       "      <td>78.0</td>\n",
       "      <td>31728.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>89.000000</td>\n",
       "      <td>104.000000</td>\n",
       "      <td>141.000000</td>\n",
       "      <td>155.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>10.900000</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>54.835100</td>\n",
       "      <td>109.0</td>\n",
       "      <td>548351.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            visits      nvisits      ovisits     novisits    emergency  \\\n",
       "count  4406.000000  4406.000000  4406.000000  4406.000000  4406.000000   \n",
       "mean      5.774399     1.618021     0.750794     0.536087     0.263504   \n",
       "std       6.759225     5.317056     3.652759     3.879506     0.703659   \n",
       "min       0.000000     0.000000     0.000000     0.000000     0.000000   \n",
       "25%       1.000000     0.000000     0.000000     0.000000     0.000000   \n",
       "50%       4.000000     0.000000     0.000000     0.000000     0.000000   \n",
       "75%       8.000000     1.000000     0.000000     0.000000     0.000000   \n",
       "max      89.000000   104.000000   141.000000   155.000000    12.000000   \n",
       "\n",
       "          hospital      chronic          age       school       income  \\\n",
       "count  4406.000000  4406.000000  4406.000000  4406.000000  4406.000000   \n",
       "mean      0.295960     1.541988     7.402406    10.290286     2.527132   \n",
       "std       0.746398     1.349632     0.633405     3.738736     2.924648   \n",
       "min       0.000000     0.000000     6.600000     0.000000    -1.012500   \n",
       "25%       0.000000     1.000000     6.900000     8.000000     0.912150   \n",
       "50%       0.000000     1.000000     7.300000    11.000000     1.698150   \n",
       "75%       0.000000     2.000000     7.800000    12.000000     3.172850   \n",
       "max       8.000000     8.000000    10.900000    18.000000    54.835100   \n",
       "\n",
       "       age_years  income_dollars  \n",
       "count     4406.0          4406.0  \n",
       "mean   74.024058    25271.320699  \n",
       "std      6.33405    29246.475762  \n",
       "min         66.0        -10125.0  \n",
       "25%         69.0          9121.5  \n",
       "50%         73.0         16981.5  \n",
       "75%         78.0         31728.5  \n",
       "max        109.0        548351.0  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>median</th>\n",
       "      <th>min</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>visits</th>\n",
       "      <td>5.774399</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>89.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>age_years</th>\n",
       "      <td>74.024058</td>\n",
       "      <td>73.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>109.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>income_dollars</th>\n",
       "      <td>25271.320699</td>\n",
       "      <td>16981.5</td>\n",
       "      <td>-10125.0</td>\n",
       "      <td>548351.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        mean   median      min       max\n",
       "visits              5.774399      4.0      0.0      89.0\n",
       "age_years          74.024058     73.0     66.0     109.0\n",
       "income_dollars  25271.320699  16981.5 -10125.0  548351.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Basic stats\n",
    "numeric_cols = df2.select_dtypes(include=[\"number\"]).columns\n",
    "display(df2[numeric_cols].describe())\n",
    "\n",
    "summary = df2[[\"visits\", \"age_years\", \"income_dollars\"]].agg([\"mean\", \"median\", \"min\", \"max\"]).T\n",
    "display(summary)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc8a8098",
   "metadata": {},
   "source": [
    "## C2-T7 — Export updated dataset for next capstone\n",
    "**PDF requirement:** Export as NSMES1988updated.csv.\n",
    "\n",
    "### What I completed\n",
    "- I exported the transformed dataframe to the required handoff file for Capstone 3.\n",
    "\n",
    "### Results Capture\n",
    "- Saved file: `outputs/NSMES1988updated.csv`.\n",
    "- Exported shape: `(4406, 20)` with two added columns (`age_years`, `income_dollars`).\n",
    "\n",
    "### Artifacts\n",
    "- `outputs/NSMES1988updated.csv`\n",
    "\n",
    "### Code evidence\n",
    "- The next cell contains the export command and shape confirmation output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bf07994c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved: c:\\DEV_Projects\\SIMPLILEARN\\CAPSTONE_Applied_Data_Science_with _Python\\Incremental_Capstone\\Capstone 2\\outputs\\NSMES1988updated.csv\n",
      "Shape: (4406, 20)\n"
     ]
    }
   ],
   "source": [
    "out_csv = OUTPUT_DIR / \"NSMES1988updated.csv\"\n",
    "df2.to_csv(out_csv, index=False)\n",
    "print(\"Saved:\", out_csv)\n",
    "print(\"Shape:\", df2.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89cd2f80",
   "metadata": {},
   "source": [
    "## C2-T8 — Describe() comparison + identify non-eligible columns\n",
    "**PDF requirement:** Use describe() and compare; identify columns not eligible for statistical analysis and recommend dtype changes.\n",
    "\n",
    "### What I completed\n",
    "- I ran numeric/all-column describe comparisons and documented non-eligible columns plus dtype recommendations.\n",
    "\n",
    "### Results Capture\n",
    "- Non-eligible for continuous numeric interpretation: `health`, `adl`, `region`, `gender`, `married`, `employed`, `insurance`, `medicaid` (encoded labels/flags where mean/std are not substantively meaningful).\n",
    "- Dtype recommendations:\n",
    "  - `category`: `health`, `adl`, `region`, `gender`, `married`, `employed`, `insurance`, `medicaid`\n",
    "  - Small integer optimization candidates: `visits`, `nvisits`, `emergency`, `hospital`, `chronic`, `school`, `age_years` → `int8`; `ovisits`, `novisits` → `int16` (after validation in production pipeline).\n",
    "\n",
    "### Code evidence\n",
    "- The next cell shows the full comparison output and recommendation table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "621e646d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>visits</th>\n",
       "      <th>nvisits</th>\n",
       "      <th>ovisits</th>\n",
       "      <th>novisits</th>\n",
       "      <th>emergency</th>\n",
       "      <th>hospital</th>\n",
       "      <th>health</th>\n",
       "      <th>chronic</th>\n",
       "      <th>adl</th>\n",
       "      <th>region</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "      <th>married</th>\n",
       "      <th>school</th>\n",
       "      <th>income</th>\n",
       "      <th>employed</th>\n",
       "      <th>insurance</th>\n",
       "      <th>medicaid</th>\n",
       "      <th>age_years</th>\n",
       "      <th>income_dollars</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406</td>\n",
       "      <td>4406</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406</td>\n",
       "      <td>4406</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406.000000</td>\n",
       "      <td>4406</td>\n",
       "      <td>4406</td>\n",
       "      <td>4406</td>\n",
       "      <td>4406.0</td>\n",
       "      <td>4406.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>average</td>\n",
       "      <td>NaN</td>\n",
       "      <td>normal</td>\n",
       "      <td>other</td>\n",
       "      <td>NaN</td>\n",
       "      <td>female</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3509</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3507</td>\n",
       "      <td>1614</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2628</td>\n",
       "      <td>2406</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3951</td>\n",
       "      <td>3421</td>\n",
       "      <td>4004</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>5.774399</td>\n",
       "      <td>1.618021</td>\n",
       "      <td>0.750794</td>\n",
       "      <td>0.536087</td>\n",
       "      <td>0.263504</td>\n",
       "      <td>0.295960</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.541988</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.402406</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.290286</td>\n",
       "      <td>2.527132</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74.024058</td>\n",
       "      <td>25271.320699</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>6.759225</td>\n",
       "      <td>5.317056</td>\n",
       "      <td>3.652759</td>\n",
       "      <td>3.879506</td>\n",
       "      <td>0.703659</td>\n",
       "      <td>0.746398</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.349632</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.633405</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.738736</td>\n",
       "      <td>2.924648</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.33405</td>\n",
       "      <td>29246.475762</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.600000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-1.012500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>66.0</td>\n",
       "      <td>-10125.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.900000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>0.912150</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>69.0</td>\n",
       "      <td>9121.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.300000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>1.698150</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>73.0</td>\n",
       "      <td>16981.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>8.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.800000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>3.172850</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>78.0</td>\n",
       "      <td>31728.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>89.000000</td>\n",
       "      <td>104.000000</td>\n",
       "      <td>141.000000</td>\n",
       "      <td>155.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.900000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>54.835100</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109.0</td>\n",
       "      <td>548351.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             visits      nvisits      ovisits     novisits    emergency  \\\n",
       "count   4406.000000  4406.000000  4406.000000  4406.000000  4406.000000   \n",
       "unique          NaN          NaN          NaN          NaN          NaN   \n",
       "top             NaN          NaN          NaN          NaN          NaN   \n",
       "freq            NaN          NaN          NaN          NaN          NaN   \n",
       "mean       5.774399     1.618021     0.750794     0.536087     0.263504   \n",
       "std        6.759225     5.317056     3.652759     3.879506     0.703659   \n",
       "min        0.000000     0.000000     0.000000     0.000000     0.000000   \n",
       "25%        1.000000     0.000000     0.000000     0.000000     0.000000   \n",
       "50%        4.000000     0.000000     0.000000     0.000000     0.000000   \n",
       "75%        8.000000     1.000000     0.000000     0.000000     0.000000   \n",
       "max       89.000000   104.000000   141.000000   155.000000    12.000000   \n",
       "\n",
       "           hospital   health      chronic     adl region          age  gender  \\\n",
       "count   4406.000000     4406  4406.000000    4406   4406  4406.000000    4406   \n",
       "unique          NaN        3          NaN       2      4          NaN       2   \n",
       "top             NaN  average          NaN  normal  other          NaN  female   \n",
       "freq            NaN     3509          NaN    3507   1614          NaN    2628   \n",
       "mean       0.295960      NaN     1.541988     NaN    NaN     7.402406     NaN   \n",
       "std        0.746398      NaN     1.349632     NaN    NaN     0.633405     NaN   \n",
       "min        0.000000      NaN     0.000000     NaN    NaN     6.600000     NaN   \n",
       "25%        0.000000      NaN     1.000000     NaN    NaN     6.900000     NaN   \n",
       "50%        0.000000      NaN     1.000000     NaN    NaN     7.300000     NaN   \n",
       "75%        0.000000      NaN     2.000000     NaN    NaN     7.800000     NaN   \n",
       "max        8.000000      NaN     8.000000     NaN    NaN    10.900000     NaN   \n",
       "\n",
       "       married       school       income employed insurance medicaid  \\\n",
       "count     4406  4406.000000  4406.000000     4406      4406     4406   \n",
       "unique       2          NaN          NaN        2         2        2   \n",
       "top        yes          NaN          NaN       no       yes       no   \n",
       "freq      2406          NaN          NaN     3951      3421     4004   \n",
       "mean       NaN    10.290286     2.527132      NaN       NaN      NaN   \n",
       "std        NaN     3.738736     2.924648      NaN       NaN      NaN   \n",
       "min        NaN     0.000000    -1.012500      NaN       NaN      NaN   \n",
       "25%        NaN     8.000000     0.912150      NaN       NaN      NaN   \n",
       "50%        NaN    11.000000     1.698150      NaN       NaN      NaN   \n",
       "75%        NaN    12.000000     3.172850      NaN       NaN      NaN   \n",
       "max        NaN    18.000000    54.835100      NaN       NaN      NaN   \n",
       "\n",
       "        age_years  income_dollars  \n",
       "count      4406.0          4406.0  \n",
       "unique       <NA>            <NA>  \n",
       "top          <NA>            <NA>  \n",
       "freq         <NA>            <NA>  \n",
       "mean    74.024058    25271.320699  \n",
       "std       6.33405    29246.475762  \n",
       "min          66.0        -10125.0  \n",
       "25%          69.0          9121.5  \n",
       "50%          73.0         16981.5  \n",
       "75%          78.0         31728.5  \n",
       "max         109.0        548351.0  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Categorical/label-like columns: ['health', 'adl', 'region', 'gender', 'married', 'employed', 'insurance', 'medicaid']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>column</th>\n",
       "      <th>eligible_for_continuous_stats</th>\n",
       "      <th>suggested_dtype</th>\n",
       "      <th>reason</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>health</td>\n",
       "      <td>No</td>\n",
       "      <td>category</td>\n",
       "      <td>Encoded category/flag; arithmetic moments are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>adl</td>\n",
       "      <td>No</td>\n",
       "      <td>category</td>\n",
       "      <td>Encoded category/flag; arithmetic moments are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>region</td>\n",
       "      <td>No</td>\n",
       "      <td>category</td>\n",
       "      <td>Encoded category/flag; arithmetic moments are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>gender</td>\n",
       "      <td>No</td>\n",
       "      <td>category</td>\n",
       "      <td>Encoded category/flag; arithmetic moments are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>married</td>\n",
       "      <td>No</td>\n",
       "      <td>category</td>\n",
       "      <td>Encoded category/flag; arithmetic moments are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>employed</td>\n",
       "      <td>No</td>\n",
       "      <td>category</td>\n",
       "      <td>Encoded category/flag; arithmetic moments are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>insurance</td>\n",
       "      <td>No</td>\n",
       "      <td>category</td>\n",
       "      <td>Encoded category/flag; arithmetic moments are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>medicaid</td>\n",
       "      <td>No</td>\n",
       "      <td>category</td>\n",
       "      <td>Encoded category/flag; arithmetic moments are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>visits</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int8</td>\n",
       "      <td>Observed range fits int8; optimize memory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>nvisits</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int8</td>\n",
       "      <td>Observed range fits int8; optimize memory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>emergency</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int8</td>\n",
       "      <td>Observed range fits int8; optimize memory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>hospital</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int8</td>\n",
       "      <td>Observed range fits int8; optimize memory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>chronic</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int8</td>\n",
       "      <td>Observed range fits int8; optimize memory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>school</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int8</td>\n",
       "      <td>Observed range fits int8; optimize memory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>age_years</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int8</td>\n",
       "      <td>Observed range fits int8; optimize memory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>ovisits</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int16</td>\n",
       "      <td>Observed range fits int16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>novisits</td>\n",
       "      <td>Yes</td>\n",
       "      <td>int16</td>\n",
       "      <td>Observed range fits int16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       column eligible_for_continuous_stats suggested_dtype  \\\n",
       "0      health                            No        category   \n",
       "1         adl                            No        category   \n",
       "2      region                            No        category   \n",
       "3      gender                            No        category   \n",
       "4     married                            No        category   \n",
       "5    employed                            No        category   \n",
       "6   insurance                            No        category   \n",
       "7    medicaid                            No        category   \n",
       "8      visits                           Yes            int8   \n",
       "9     nvisits                           Yes            int8   \n",
       "10  emergency                           Yes            int8   \n",
       "11   hospital                           Yes            int8   \n",
       "12    chronic                           Yes            int8   \n",
       "13     school                           Yes            int8   \n",
       "14  age_years                           Yes            int8   \n",
       "15    ovisits                           Yes           int16   \n",
       "16   novisits                           Yes           int16   \n",
       "\n",
       "                                               reason  \n",
       "0   Encoded category/flag; arithmetic moments are ...  \n",
       "1   Encoded category/flag; arithmetic moments are ...  \n",
       "2   Encoded category/flag; arithmetic moments are ...  \n",
       "3   Encoded category/flag; arithmetic moments are ...  \n",
       "4   Encoded category/flag; arithmetic moments are ...  \n",
       "5   Encoded category/flag; arithmetic moments are ...  \n",
       "6   Encoded category/flag; arithmetic moments are ...  \n",
       "7   Encoded category/flag; arithmetic moments are ...  \n",
       "8           Observed range fits int8; optimize memory  \n",
       "9           Observed range fits int8; optimize memory  \n",
       "10          Observed range fits int8; optimize memory  \n",
       "11          Observed range fits int8; optimize memory  \n",
       "12          Observed range fits int8; optimize memory  \n",
       "13          Observed range fits int8; optimize memory  \n",
       "14          Observed range fits int8; optimize memory  \n",
       "15                          Observed range fits int16  \n",
       "16                          Observed range fits int16  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Describe comparison\n",
    "display(df2.describe(include=\"all\"))\n",
    "\n",
    "cat_like = [\"health\", \"adl\", \"region\", \"gender\", \"married\", \"employed\", \"insurance\", \"medicaid\"]\n",
    "print(\"Categorical/label-like columns:\", cat_like)\n",
    "\n",
    "recommend_rows = []\n",
    "for c in cat_like:\n",
    "    recommend_rows.append({\n",
    "        \"column\": c,\n",
    "        \"eligible_for_continuous_stats\": \"No\",\n",
    "        \"suggested_dtype\": \"category\",\n",
    "        \"reason\": \"Encoded category/flag; arithmetic moments are weakly interpretable\"\n",
    "    })\n",
    "\n",
    "for c in [\"visits\", \"nvisits\", \"emergency\", \"hospital\", \"chronic\", \"school\", \"age_years\"]:\n",
    "    recommend_rows.append({\n",
    "        \"column\": c,\n",
    "        \"eligible_for_continuous_stats\": \"Yes\",\n",
    "        \"suggested_dtype\": \"int8\",\n",
    "        \"reason\": \"Observed range fits int8; optimize memory\"\n",
    "    })\n",
    "\n",
    "for c in [\"ovisits\", \"novisits\"]:\n",
    "    recommend_rows.append({\n",
    "        \"column\": c,\n",
    "        \"eligible_for_continuous_stats\": \"Yes\",\n",
    "        \"suggested_dtype\": \"int16\",\n",
    "        \"reason\": \"Observed range fits int16\"\n",
    "    })\n",
    "\n",
    "display(pd.DataFrame(recommend_rows))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f39ae31",
   "metadata": {},
   "source": [
    "## Final section — Conclusions (required)\n",
    "- I successfully processed the Capstone 1 cleaned dataset and prepared it for downstream use.\n",
    "- I observed a slightly lower memory footprint than Week 1 (2.125 MB vs 2.159 MB).\n",
    "- I made age and income directly interpretable by adding `age_years` and `income_dollars`.\n",
    "- I documented right-skew patterns in utilization and income distributions.\n",
    "- I identified categorical/flag columns suitable for `category` typing and non-continuous interpretation.\n",
    "- I produced the required artifact: `outputs/NSMES1988updated.csv`.\n",
    "- I updated `WORK_SUMMARY.md` with evidence and marked all Capstone 2 tasks complete.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "03b0d6a7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Capstone 2 completed: C2-T4 to C2-T8\n",
      "Primary artifact: outputs/NSMES1988updated.csv\n"
     ]
    }
   ],
   "source": [
    "print(\"Capstone 2 completed: C2-T4 to C2-T8\")\n",
    "print(\"Primary artifact: outputs/NSMES1988updated.csv\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}