Francis Burnet – AI Engineering Portfolio

Capstone portfolio spanning AI engineering, applied data science, machine learning, and deep learning.

Francis Burnet headshot

Capstone 3 Infographic

Capstone 3 infographic
Capstone Summary

This source documents FrancisBurnet's Capstone 3, a data science project centered on exploratory analysis using the Pandas library in Python. The portfolio details a structured workflow that processes a healthcare dataset to examine relationships between demographics, medical utilization, and socio-economic factors. Key technical tasks include performing dtype audits, generating complex pivot tables, and creating distribution summaries based on health status and geography. The analysis highlights specific findings, such as how gender, insurance status, and regional location correlate with healthcare visits and income levels. Documentation is maintained through notebook evidence, code screenshots, and a checklist to ensure all educational requirements are met. Ultimately, the project serves as a comprehensive evidence map for advanced data manipulation and reporting.

Capstone 3 Scope

Capstone 3 turns the Capstone 2 handoff dataset into a structured Pandas analysis package built around categorical profiling, pivot tables, grouped criteria analysis, and distribution tables.

Primary notebook input: Capstone 2/outputs/NSMES1988updated.csv.

Notebook evidence, code screenshots, and downloadable artifacts.

Original Project PDF

The original Capstone 3 directions are embedded here.

Requirement Checklist

3a

Import relevant Python libraries necessary for Python and Pandas analysis.

Source mapping: Notebook setup and imports

Evidence note: The notebook imports `pandas` after the runtime path setup and uses it across pivot, crosstab, and grouped summary sections.

3b

Import the CSV file `NSMES1988updated.csv` and create a new dataframe for working with Pandas.

Source mapping: Notebook load step

Evidence note: The notebook loads the Capstone 2 updated handoff dataset and confirms the working dataframe shape as `(4406, 20)`.

3c

Identify different types of data and report it.

Source mapping: Notebook C3-T4

Evidence note: The dtype audit table classifies all 20 fields into numeric versus categorical/flag groups.

3d

Identify categorical types in the data.

Source mapping: Notebook C3-T4

Evidence note: The notebook explicitly lists label-like fields and confirms category values for `health` and `region`.

3e

Perform a detailed data pivoting on the dataframe and report it.

Source mapping: Notebook C3-T5

Evidence note: Capstone 3 builds detailed `region × health` pivots for count, mean visits, and mean income.

3f

Include Health and Region categorical data in your analysis.

Source mapping: Notebook C3-T5

Evidence note: The pivot section uses `health` and `region` as the categorical keys for the core segmented analysis.

3g

Prepare a detailed report on your analysis and observations.

Source mapping: Notebook markdown sections

Evidence note: The notebook markdown cells record results capture notes and conclusions after each analysis block.

3h

Perform analysis based on different types of visits, gender, marital status, school, income, employment status, insurance, and medical aid.

Source mapping: Notebook C3-T6

Evidence note: The notebook groups utilization and income metrics by each required demographic and coverage criterion.

3i

Generate age and gender distribution.

Source mapping: Notebook C3-T7

Evidence note: A cross-tab distribution table is created for age groups by gender.

3j

Create health status by gender distribution.

Source mapping: Notebook C3-T7

Evidence note: A health-by-gender crosstab is rendered as part of the required distribution set.

3k

Compile income distribution by gender, regional income distribution, and age-wise income analysis.

Source mapping: Notebook C3-T7

Evidence note: The notebook generates grouped income tables for gender, region, and age-group analysis.

3l

Report your findings.

Source mapping: Final conclusions

Evidence note: The final conclusion cell summarizes the segmented insights and confirms completion of the distribution-table work.

Requirement Walkthrough

Each walkthrough block stays tied to one requirement family at a time, with the code sample and the notebook observations kept together for evidence review.

3a

Runtime Setup And Load The Updated Handoff Dataset

Notebook section: Setup and load cells

Requirement: Import Pandas, load `NSMES1988updated.csv`, and create the working dataframe.

Capstone 3 begins from the Capstone 2 handoff artifact and uses a path fallback so the notebook can resolve the updated dataset from the prior capstone output.

Results Capture
  • Main input: `Capstone 2/outputs/NSMES1988updated.csv`.
  • Working dataframe shape: `(4406, 20)` with the `age_years` and `income_dollars` columns already present from Capstone 2.
  • Analysis starts from the updated handoff file.
import pandas as pd

DEFAULT_DATASET = "NSMES1988updated.csv"

try:
    dataset_path = resolve_dataset_path(DEFAULT_DATASET)
except FileNotFoundError:
    fallback = first_existing_path([
        BASE_DIR.parent / "Capstone 2" / "outputs" / "NSMES1988updated.csv",
        CWD / "Capstone 2" / "outputs" / "NSMES1988updated.csv",
        CWD / "Incremental_Capstone" / "Capstone 2" / "outputs" / "NSMES1988updated.csv",
    ])
    if fallback is None:
        raise
    dataset_path = fallback

df = pd.read_csv(dataset_path)
print("Loaded:", dataset_path)
print("Shape:", df.shape)
display(df.head())
3b

Audit Data Types And Identify Categorical Fields

Notebook section: C3-T4

Requirement: Identify different data types and identify categorical types in the data.

The notebook produces a full dtype audit and then confirms which fields must be treated as categories rather than continuous numeric measures.

Results Capture
  • A 20-column dtype audit table is created with a `data_type_group` classification.
  • `health` categories: `average`, `excellent`, `poor`.
  • `region` categories: `midwest`, `northeast`, `other`, `west`.
categorical_cols = {"health", "adl", "region", "gender", "married", "employed", "insurance", "medicaid"}
dtype_audit = pd.DataFrame({
    "column": df.columns,
    "dtype": [str(df[c].dtype) for c in df.columns],
    "data_type_group": ["categorical/flag" if c in categorical_cols else "numeric" for c in df.columns]
})
display(dtype_audit)

for col in ["health", "region"]:
    print(sorted(df[col].dropna().unique().tolist()))
3c

Perform Detailed Pivoting For Health And Region

Notebook section: C3-T5

Requirement: Perform detailed pivoting and include Health and Region categorical analysis.

The core segmented analysis uses `region × health` pivot tables to measure record counts, average visits, and average income across the required categorical dimensions.

Results Capture
  • Pivot count size: `(4, 3)` covering all 4,406 records.
  • Three pivot outputs are generated: counts, mean visits, and mean `income_dollars`.
  • The pivots make the region-health utilization and income differences explicit in one place.
pivot_count = pd.pivot_table(df, index="region", columns="health", values="visits", aggfunc="count", fill_value=0)
pivot_mean_visits = pd.pivot_table(df, index="region", columns="health", values="visits", aggfunc="mean", fill_value=0)
pivot_mean_income = pd.pivot_table(df, index="region", columns="health", values="income_dollars", aggfunc="mean", fill_value=0)

display(pivot_count)
display(pivot_mean_visits.round(3))
display(pivot_mean_income.round(2))
3d

Analyze Required Criteria Tables

Notebook section: C3-T6

Requirement: Analyze visits, gender, marital status, school, income, employment status, insurance, and medical aid.

Capstone 3 groups utilization and income metrics by each requested demographic and coverage criterion so the requested comparison tables stay directly aligned to the PDF.

Results Capture
  • Grouped criteria tables are produced for `gender`, `married`, `school`, `employed`, `insurance`, and `medicaid`.
  • Female mean visits exceed male mean visits in the notebook summary, while male mean income is higher.
  • Medicaid enrollment aligns with higher mean visits and lower mean income in the staged notebook output.
criteria_cols = ["gender", "married", "school", "employed", "insurance", "medicaid"]
metric_cols = [c for c in ["visits", "nvisits", "ovisits", "novisits", "emergency", "hospital"] if c in df.columns]

for c in criteria_cols:
    table = df.groupby(c)[metric_cols + ["income_dollars"]].mean(numeric_only=True).round(3)
    display(table)
3e

Build The Required Distribution Tables And Findings

Notebook section: C3-T7 and conclusions

Requirement: Create age and gender, health by gender, income by gender, regional income, and age-wise income distributions and report findings.

The notebook closes the capstone by building all required distribution tables and then documenting the cross-group patterns that matter for the report.

Results Capture
  • Required distribution table sizes reported in the notebook: Age+Gender `(5x2)`, Health by Gender `(3x2)`, Income by Gender `(2x3)`, Regional Income `(4x3)`, Age-wise Income `(5x3)`.
  • The notebook notes that `west` shows the highest regional mean income and `other` the lowest.
  • The conclusion section confirms the PDF-required segmented analysis was completed and written up in the notebook markdown.
df3 = df.copy()
df3["age_group"] = pd.cut(df3["age_years"].astype(float), bins=[65, 70, 75, 80, 85, 110], include_lowest=True)

display(pd.crosstab(df3["age_group"], df3["gender"]))
display(pd.crosstab(df3["health"], df3["gender"]))
display(pd.pivot_table(df3, index="gender", values="income_dollars", aggfunc=["count", "mean", "median"]).round(2))
display(pd.pivot_table(df3, index="region", values="income_dollars", aggfunc=["count", "mean", "median"]).round(2))
display(pd.pivot_table(df3, index="age_group", values="income_dollars", aggfunc=["count", "mean", "median"]).round(2))

Colab Notebook

This section provides the notebook preview, launch link, and project file links.

The notebook preview is the primary execution record for this capstone.

Capstone 3 Notebook Workspace
Launch Colab
Embedded Notebook Preview
Cell 1 Markdown

Capstone 3 — Session 3: Data Analysis with Pandas

Run timestamp: 2026-02-19 01:49:22

Goal

  • Perform structured exploratory analysis on the Capstone 2 processed dataset using categorical profiling, pivot analysis, and distribution tables.
  • Produce interpretable tables for health, region, demographic, and income patterns to support downstream business insights.

Inputs

  • NSMES1988updated.csv (primary source from Capstone 2/outputs/)

Outputs

  • All exports go to ./outputs/ (and plots to ./outputs/plots/ when applicable)

Libraries (documented)

  • pandas: needed for pivots, crosstabs, grouping, and distribution tables; enabled all required session analyses.

Key dataset note

  • age is encoded as Age in years (divided by 10) (e.g., 6.9 = 69 years).
Cell 2 Markdown

C?-T0 — Runtime setup (paths + output folders)

I used this setup block first to configure reproducible paths and output folders.

Cell 3 Code · python
from pathlib import Path
from datetime import datetime
from urllib.parse import quote
from urllib.request import urlretrieve
import os
import sys

try:
    from IPython.display import display
except Exception:
    def display(value):
        print(value)

# --- Project metadata ---
CAPSTONE = 3
SESSION_TITLE = 'Session 3: Data Analysis with Pandas'
IS_COLAB = 'google.colab' in sys.modules
RAW_BASE = os.environ.get('FRANCISBURNET_RAW_BASE', 'https://raw.githubusercontent.com/FrancisBurnet/francisburnet/main')

print(f"Capstone: {CAPSTONE} | Session: {SESSION_TITLE}")
print("Run timestamp:", datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print("Runtime:", "Google Colab" if IS_COLAB else "Local / notebook runtime")

CWD = Path.cwd()
if IS_COLAB:
    BASE_DIR = Path('/content/francisburnet_capstone_3')
elif (CWD / f"Capstone {CAPSTONE}").exists():
    BASE_DIR = CWD / f"Capstone {CAPSTONE}"
elif CWD.name == f"Capstone {CAPSTONE}":
    BASE_DIR = CWD
elif (CWD / 'Incremental_Capstone' / f"Capstone {CAPSTONE}").exists():
    BASE_DIR = CWD / 'Incremental_Capstone' / f"Capstone {CAPSTONE}"
elif (CWD / 'Incremental Capstones' / 'Applied Data Science with Python' / f"Capstone {CAPSTONE}").exists():
    BASE_DIR = CWD / 'Incremental Capstones' / 'Applied Data Science with Python' / f"Capstone {CAPSTONE}"
else:
    BASE_DIR = CWD

INPUT_DIR = BASE_DIR / 'inputs'
OUTPUT_DIR = BASE_DIR / 'outputs'
PLOTS_DIR = OUTPUT_DIR / 'plots'
INPUT_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
PLOTS_DIR.mkdir(parents=True, exist_ok=True)

def first_existing_path(candidates):
    for candidate in candidates:
        candidate = Path(candidate).expanduser()
        if candidate.exists():
            return candidate
    return None

def github_raw_url(relative_path: str) -> str:
    normalized = relative_path.replace('\\', '/')
    return RAW_BASE.rstrip('/') + '/' + quote(normalized, safe='/')

def stage_colab_asset(relative_path: str, destination_name=None) -> Path:
    destination = INPUT_DIR / (destination_name or Path(relative_path).name)
    if not destination.exists():
        asset_url = github_raw_url(relative_path)
        urlretrieve(asset_url, destination)
        print('Downloaded:', destination.name, 'from', asset_url)
    return destination

LOCAL_SEARCH_ROOTS = [
    INPUT_DIR,
    BASE_DIR,
    CWD,
    CWD / f"Capstone {CAPSTONE}",
    CWD / 'Incremental_Capstone' / f"Capstone {CAPSTONE}",
    CWD / 'Incremental Capstones' / 'Applied Data Science with Python' / f"Capstone {CAPSTONE}",
    CWD.parent / 'Incremental_Capstone' / f"Capstone {CAPSTONE}",
    CWD.parent / 'Incremental Capstones' / 'Applied Data Science with Python' / f"Capstone {CAPSTONE}",
]

def resolve_dataset_path(default_filename: str) -> Path:
    candidates = [Path(root) / default_filename for root in LOCAL_SEARCH_ROOTS]
    path = first_existing_path(candidates)
    if path is None:
        searched_paths = [str(candidate) for candidate in candidates]
        raise FileNotFoundError(f"Dataset not found: {default_filename}. Searched: {searched_paths}")
    return path

if IS_COLAB:
    staged_assets = [
        'Incremental Capstones/Applied Data Science with Python/Capstone 3/Capstone_Session_3.pdf',
        'Incremental Capstones/Applied Data Science with Python/Capstone 3/NSMES1988.csv',
        'Incremental Capstones/Applied Data Science with Python/Capstone 3/capstone_3.ipynb',
    ]
    for relative_path in staged_assets:
        stage_colab_asset(relative_path)

print('Base directory:', BASE_DIR)
print('Input directory:', INPUT_DIR)
print('Output directory:', OUTPUT_DIR)
print('Plots directory:', PLOTS_DIR)
Output
Capstone: 3 | Session: Session 3: Data Analysis with Pandas
Run timestamp: 2026-05-12 03:57:44
Runtime: Local / notebook runtime
Base directory: x:\SIMPLILEARN\FrancisBurnetCom\Incremental Capstones\Applied Data Science with Python\Capstone 3
Input directory: x:\SIMPLILEARN\FrancisBurnetCom\Incremental Capstones\Applied Data Science with Python\Capstone 3\inputs
Output directory: x:\SIMPLILEARN\FrancisBurnetCom\Incremental Capstones\Applied Data Science with Python\Capstone 3\outputs
Plots directory: x:\SIMPLILEARN\FrancisBurnetCom\Incremental Capstones\Applied Data Science with Python\Capstone 3\outputs\plots
Cell 4 Markdown

C?-T1 — Imports I used

I documented each import with why I used it and what it enabled in this analysis.

Cell 5 Code · python
import pandas as pd  # DataFrames + pivot/crosstab/groupby analysis for categorical and distribution tables
Cell 6 Markdown

C?-T2 — Load dataset

I loaded the required Capstone 3 dataset with the configured default and fallback path.

Cell 7 Code · python
DEFAULT_DATASET = 'NSMES1988updated.csv'

fallback_candidates = [
    BASE_DIR.parent / 'Capstone 2' / 'outputs' / DEFAULT_DATASET,
    CWD / 'Capstone 2' / 'outputs' / DEFAULT_DATASET,
    CWD / 'Incremental_Capstone' / 'Capstone 2' / 'outputs' / DEFAULT_DATASET,
    CWD / 'Incremental Capstones' / 'Applied Data Science with Python' / 'Capstone 2' / 'outputs' / DEFAULT_DATASET,
    CWD.parent / 'Incremental_Capstone' / 'Capstone 2' / 'outputs' / DEFAULT_DATASET,
    CWD.parent / 'Incremental Capstones' / 'Applied Data Science with Python' / 'Capstone 2' / 'outputs' / DEFAULT_DATASET,
]

try:
    dataset_path = resolve_dataset_path(DEFAULT_DATASET)
except FileNotFoundError:
    fallback = first_existing_path(fallback_candidates)
    if fallback is None and IS_COLAB:
        fallback = stage_colab_asset(
            'Incremental Capstones/Applied Data Science with Python/Capstone 2/outputs/NSMES1988updated.csv'
        )
    if fallback is None:
        searched_paths = [str(candidate) for candidate in fallback_candidates]
        raise FileNotFoundError(f"Dataset not found: {DEFAULT_DATASET}. Searched: {searched_paths}")
    dataset_path = fallback
    print('Default dataset not found; using fallback:', dataset_path)

df = pd.read_csv(dataset_path)

print('Loaded:', dataset_path)
print('Shape:', df.shape)
display(df.head())
Output
Local default not found; using fallback: c:\DEV_Projects\SIMPLILEARN\CAPSTONE_Applied_Data_Science_with _Python\Incremental_Capstone\Capstone 2\outputs\NSMES1988updated.csv
Loaded: c:\DEV_Projects\SIMPLILEARN\CAPSTONE_Applied_Data_Science_with _Python\Incremental_Capstone\Capstone 2\outputs\NSMES1988updated.csv
Shape: (4406, 20)
   visits  nvisits  ovisits  novisits  emergency  hospital   health  chronic  \
0       5        0        0         0          0         1  average        2   
1       1        0        2         0          2         0  average        2   
2      13        0        0         0          3         3     poor        4   
3      16        0        5         0          1         1     poor        2   
4       3        0        0         0          0         0  average        2   

       adl region  age  gender married  school  income employed insurance  \
0   normal  other  6.9    male     yes       6  2.8810      yes       yes   
1   normal  other  7.4  female     yes      10  2.7478       no       yes   
2  limited  other  6.6  female      no      10  0.6532       no        no   
3  limited  other  7.6    male     yes       3  0.6588       no       yes   
4  limited  other  7.9  female     yes       6  0.6588       no       yes   

  medicaid  age_years  income_dollars  
0       no         69           28810  
1       no         74           27478  
2      yes         66            6532  
3       no         76            6588  
4       no         79            6588  
Cell 8 Markdown

C?-T3 — Validation checks

  • I confirmed expected columns exist
  • I confirmed key dtypes
  • I checked missing values

Results Capture:

  • What I did: I validated expected Capstone 3 schema from Capstone 2 output, checked dtypes, and computed missing values.
  • What I found: dataframe loaded as (4406, 20) with all expected columns including age_years and income_dollars; total missing values = 0.
  • Caveats: label columns (health, region, gender, married, employed, insurance, medicaid) are categorical and should not be interpreted as continuous numeric values.
Cell 9 Code · python
expected_cols = [
    "visits", "nvisits", "ovisits", "novisits", "emergency", "hospital",
    "health", "chronic", "adl", "region", "age", "gender",
    "married", "school", "income", "employed", "insurance", "medicaid",
    "age_years", "income_dollars"
]

missing_cols = [c for c in expected_cols if c not in df.columns]
print("Missing expected columns:", missing_cols)

print("\nDtypes:")
display(df.dtypes)

print("\nMissing values (count):")
na_counts = df.isna().sum().sort_values(ascending=False)
display(na_counts[na_counts > 0] if (na_counts > 0).any() else na_counts.head())
Output
Missing expected columns: []

Dtypes:
visits              int64
nvisits             int64
ovisits             int64
novisits            int64
emergency           int64
hospital            int64
health                str
chronic             int64
adl                   str
region                str
age               float64
gender                str
married               str
school              int64
income            float64
employed              str
insurance             str
medicaid              str
age_years           int64
income_dollars      int64
dtype: object
Missing values (count):
visits       0
nvisits      0
ovisits      0
novisits     0
emergency    0
dtype: int64
Cell 10 Markdown

C3-T4 — Identify categorical types + key categories

PDF requirement: Identify different types of data and identify categorical types.

What I completed

  • I audited dtypes, grouped variables by type, and listed unique category values for health and region.

Results Capture

  • Built a type-audit table (column, dtype, data_type_group) for all 20 columns.
  • health categories: ['average', 'excellent', 'poor'].
  • region categories: ['midwest', 'northeast', 'other', 'west'].

Code evidence

  • The next cell shows the dtype audit and category-value extraction I executed.
Cell 11 Code · python
categorical_cols = {"health", "adl", "region", "gender", "married", "employed", "insurance", "medicaid"}
dtype_audit = pd.DataFrame({
    "column": df.columns,
    "dtype": [str(df[c].dtype) for c in df.columns],
    "data_type_group": ["categorical/flag" if c in categorical_cols else "numeric" for c in df.columns]
})
display(dtype_audit)

for col in ["health", "region"]:
    if col in df.columns:
        print(f"\nUnique values for {col}:")
        print(sorted(df[col].dropna().unique().tolist()))
Output
            column    dtype   data_type_group
0           visits    int64           numeric
1          nvisits    int64           numeric
2          ovisits    int64           numeric
3         novisits    int64           numeric
4        emergency    int64           numeric
5         hospital    int64           numeric
6           health      str  categorical/flag
7          chronic    int64           numeric
8              adl      str  categorical/flag
9           region      str  categorical/flag
10             age  float64           numeric
11          gender      str  categorical/flag
12         married      str  categorical/flag
13          school    int64           numeric
14          income  float64           numeric
15        employed      str  categorical/flag
16       insurance      str  categorical/flag
17        medicaid      str  categorical/flag
18       age_years    int64           numeric
19  income_dollars    int64           numeric
Unique values for health:
['average', 'excellent', 'poor']

Unique values for region:
['midwest', 'northeast', 'other', 'west']
Cell 12 Markdown

C3-T5 — Pivoting including Health and Region

PDF requirement: Perform detailed pivoting and include categorical data analysis for Health and Region.

What I completed

  • I generated pivot tables for counts, mean visits, and mean income by region and health.

Results Capture

  • Generated 3 detailed pivots by region × health: count matrix, mean visits, and mean income (dollars).
  • Pivot count size = (4, 3) and total records represented = 4406.
  • Interpretation highlights:
  • All four regions contain observations across all three health categories.
  • Mean utilization varies by region-health combinations, supporting segmented analysis.
  • Mean income differences by region-health indicate socio-economic heterogeneity across segments.

Code evidence

  • The next cell contains the exact pivot computations and displayed outputs.
Cell 13 Code · python
# Region-health pivots (counts + average visits + average income)
if set(["region", "health"]).issubset(df.columns):
    pivot_count = pd.pivot_table(df, index="region", columns="health", values="visits", aggfunc="count", fill_value=0)
    pivot_mean_visits = pd.pivot_table(df, index="region", columns="health", values="visits", aggfunc="mean", fill_value=0)
    pivot_mean_income = pd.pivot_table(df, index="region", columns="health", values="income_dollars", aggfunc="mean", fill_value=0)
    print("Pivot 1: Count by region x health")
    display(pivot_count)
    print("Pivot 2: Mean visits by region x health")
    display(pivot_mean_visits.round(3))
    print("Pivot 3: Mean income_dollars by region x health")
    display(pivot_mean_income.round(2))
Output
Pivot 1: Count by region x health
health     average  excellent  poor
region                             
midwest        957         90   110
northeast      694         57    86
other         1237        105   272
west           621         91    86
Pivot 2: Mean visits by region x health
health     average  excellent    poor
region                               
midwest      5.282      3.444   8.118
northeast    5.695      4.018  10.674
other        5.088      3.143   8.746
west         6.499      3.374   8.593
Pivot 3: Mean income_dollars by region x health
health      average  excellent      poor
region                                  
midwest    24984.23   31053.89  21618.12
northeast  27015.35   33400.12  20659.30
other      22010.96   30024.58  16851.79
west       31663.77   37255.87  21118.85
Cell 14 Markdown

C3-T6 — Criteria-based analysis (visits, gender, marital, school, income, employed, insurance, medicaid)

PDF requirement: Analyze the data based on multiple criteria including visits type, gender, marital status, school, income, employed, insurance, medical aid.

What I completed

  • I produced grouped mean tables for each requested criterion across utilization and income metrics.

Results Capture

  • Produced one compact mean table per criterion (gender, married, school, employed, insurance, medicaid) for visit metrics and income_dollars.
  • Key findings:
  • gender: mean visits female=6.014 vs male=5.420; mean income female=22,493.476, male=29,377.156.
  • insurance: insured group has higher mean visits (6.023) than uninsured (4.913).
  • medicaid: enrollment corresponds to higher mean visits (6.714) but lower mean income (11,365.388) vs non-enrolled.

Code evidence

  • The next cell shows all criterion-based grouped tables I generated.
Cell 15 Code · python
criteria_cols = ["gender", "married", "school", "employed", "insurance", "medicaid"]
metric_cols = [c for c in ["visits", "nvisits", "ovisits", "novisits", "emergency", "hospital"] if c in df.columns]

print("Metric columns guess:", metric_cols)

for c in criteria_cols:
    if c in df.columns and metric_cols:
        print(f"\nCriteria table: {c}")
        table = df.groupby(c)[metric_cols + ["income_dollars"]].mean(numeric_only=True).round(3)
        display(table)
Output
Metric columns guess: ['visits', 'nvisits', 'ovisits', 'novisits', 'emergency', 'hospital']

Criteria table: gender
        visits  nvisits  ovisits  novisits  emergency  hospital  \
gender                                                            
female   6.014    1.732    0.683     0.513      0.269     0.286   
male     5.420    1.449    0.851     0.570      0.255     0.311   

        income_dollars  
gender                  
female       22493.476  
male         29377.156  
Criteria table: married
         visits  nvisits  ovisits  novisits  emergency  hospital  \
married                                                            
no        5.932    1.549    0.650     0.507      0.302     0.311   
yes       5.644    1.675    0.835     0.560      0.232     0.283   

         income_dollars  
married                  
no            17194.294  
yes           31985.391  
Criteria table: school
        visits  nvisits  ovisits  novisits  emergency  hospital  \
school                                                            
0        4.214    0.806    1.748     1.709      0.282     0.340   
1        5.538    0.154    0.385     0.077      0.231     0.462   
2        5.079    0.184    0.421     0.263      0.184     0.342   
3        6.239    0.535    1.014     0.296      0.662     0.423   
4        4.750    0.910    0.520     0.180      0.320     0.260   
5        5.854    0.417    0.718     0.058      0.350     0.359   
6        5.358    1.116    0.428     1.006      0.353     0.341   
7        5.448    0.982    0.529     0.457      0.285     0.285   
8        5.582    1.431    0.785     0.507      0.281     0.307   
9        4.795    1.043    0.765     0.436      0.333     0.380   
10       5.696    1.913    0.817     0.554      0.315     0.284   
11       5.789    1.599    0.938     0.339      0.264     0.233   
12       5.887    1.932    0.776     0.574      0.222     0.304   
13       6.378    1.866    0.488     0.372      0.331     0.291   
14       6.126    1.828    0.504     0.483      0.206     0.227   
15       7.081    1.663    0.535     0.605      0.105     0.186   
16       6.092    2.616    0.900     0.592      0.184     0.248   
17       6.478    3.065    0.761     0.391      0.152     0.196   
18       7.887    1.915    0.679     0.500      0.151     0.274   

        income_dollars  
school                  
0            21664.155  
1            12929.154  
2            13085.395  
3            14183.507  
4            13851.000  
5            14735.214  
6            17893.104  
7            16157.104  
8            20111.042  
9            18457.111  
10           19863.163  
11           21885.476  
12           27290.700  
13           29762.744  
14           32537.803  
15           34549.686  
16           43962.132  
17           51312.761  
18           54112.377  
Criteria table: employed
          visits  nvisits  ovisits  novisits  emergency  hospital  \
employed                                                            
no         5.807    1.640    0.786     0.549      0.269     0.304   
yes        5.495    1.426    0.444     0.424      0.220     0.226   

          income_dollars  
employed                  
no             23600.560  
yes            39779.402  
Criteria table: insurance
           visits  nvisits  ovisits  novisits  emergency  hospital  \
insurance                                                            
no          4.913    0.934    1.056     0.436      0.333     0.311   
yes         6.023    1.815    0.663     0.565      0.243     0.292   

           income_dollars  
insurance                  
no              16629.839  
yes             27759.441  
Criteria table: medicaid
          visits  nvisits  ovisits  novisits  emergency  hospital  \
medicaid                                                            
no         5.680    1.654    0.725     0.534      0.246     0.284   
yes        6.714    1.259    1.007     0.555      0.443     0.415   

          income_dollars  
medicaid                  
no             26667.471  
yes            11365.388  
Cell 16 Markdown

C3-T7 — Create required distribution tables

PDF requirement: Create distribution tables: Age+Gender; Health by Gender; Income by Gender; Regional Income; Age-wise Income.

What I completed

  • I built all required distribution tables using age groups, crosstabs, and grouped income summaries.

Results Capture

  • Built all required distribution tables:
  • Age + Gender (5x2), Health by Gender (3x2), Income by Gender (2x3), Regional Income (4x3), Age-wise Income (5x3).
  • Notable patterns:
  • Regional mean income is highest in west and lowest in other.
  • Male mean income exceeds female mean income, while female mean visits are higher.
  • Older age groups remain strongly represented across both genders.

Code evidence

  • The next cell contains the exact table-generation logic for all required distributions.
Cell 17 Code · python
df3 = df.copy()

age_col = "age_years" if "age_years" in df3.columns else "age"
income_col = "income_dollars" if "income_dollars" in df3.columns else "income"

if age_col in df3.columns:
    df3["age_group"] = pd.cut(df3[age_col].astype(float), bins=[65, 70, 75, 80, 85, 110], include_lowest=True)

if income_col in df3.columns:
    quantiles = df3[income_col].quantile([0, .2, .4, .6, .8, 1.0]).values
    edges = [quantiles[0]]
    for value in quantiles[1:]:
        edges.append(value if value > edges[-1] else edges[-1] + 1)
    df3["income_band"] = pd.cut(df3[income_col].astype(float), bins=edges, include_lowest=True)

if set(["age_group", "gender"]).issubset(df3.columns):
    print("Distribution 1: Age + Gender")
    display(pd.crosstab(df3["age_group"], df3["gender"]))

if set(["health", "gender"]).issubset(df3.columns):
    print("Distribution 2: Health by Gender")
    display(pd.crosstab(df3["health"], df3["gender"]))

if set(["income_dollars", "gender"]).issubset(df3.columns):
    print("Distribution 3: Income by Gender")
    display(pd.pivot_table(df3, index="gender", values="income_dollars", aggfunc=["count", "mean", "median"]).round(2))

if set(["income_dollars", "region"]).issubset(df3.columns):
    print("Distribution 4: Regional Income")
    display(pd.pivot_table(df3, index="region", values="income_dollars", aggfunc=["count", "mean", "median"]).round(2))

if set(["age_group", "income_dollars"]).issubset(df3.columns):
    print("Distribution 5: Age-wise Income")
    display(pd.pivot_table(df3, index="age_group", values="income_dollars", aggfunc=["count", "mean", "median"]).round(2))
Output
Distribution 1: Age + Gender
gender          female  male
age_group                   
(64.999, 70.0]     897   671
(70.0, 75.0]       736   530
(75.0, 80.0]       525   321
(80.0, 85.0]       303   176
(85.0, 110.0]      167    80
Distribution 2: Health by Gender
gender     female  male
health                 
average      2093  1416
excellent     193   150
poor          342   212
Distribution 3: Income by Gender
                count           mean         median
       income_dollars income_dollars income_dollars
gender                                             
female           2628       22493.48        14160.0
male             1778       29377.16        20574.0
Distribution 4: Regional Income
                   count           mean         median
          income_dollars income_dollars income_dollars
region                                                
midwest             1157       25136.34        17875.0
northeast            837       26797.09        17413.0
other               1614       21662.84        14220.0
west                 798       31165.05        20656.0
Distribution 5: Age-wise Income
                        count           mean         median
               income_dollars income_dollars income_dollars
age_group                                                  
(64.999, 70.0]           1568       27488.76        19872.0
(70.0, 75.0]             1266       26194.35        16986.0
(75.0, 80.0]              846       22997.60        15135.0
(80.0, 85.0]              479       20269.47        12886.0
(85.0, 110.0]             247       23951.36        13940.0
Cell 18 Markdown

Final section — Conclusions (required)

  • I used Capstone 2/outputs/NSMES1988updated.csv as the input for this analysis.
  • I completed categorical profiling and pivoting for health and region with clear segmentation outputs.
  • I documented meaningful criteria-based differences across gender, insurance, medicaid, marital status, employment, and schooling groups.
  • I generated and interpreted all required distribution tables.
  • I updated WORK_SUMMARY.md with evidence and marked all Capstone 3 tasks complete.
Cell 19 Code · python
print("Capstone 3 completed: C3-T4 to C3-T7")
print("Input used: Capstone 2/outputs/NSMES1988updated.csv")
Output
Capstone 3 completed: C3-T4 to C3-T7
Input used: Capstone 2/outputs/NSMES1988updated.csv
Project Notes
  • Input dataset from Capstone 2.
  • Dtype audit and categorical review.
  • Pivot, grouped analysis, and distribution tables.
  • Code screenshots and notebook evidence.
Project File Links
  • Capstone 2 Updated Input CSV: Open Capstone 2 Updated Input CSV
    Primary handoff dataset loaded by the Capstone 3 notebook.
  • Notebook File: Open Notebook File
    Staged Capstone 3 notebook used as the main evidence source for the page walkthrough.
  • Reference Source CSV: Open Reference Source CSV
    Original NSMES source copy retained in the Capstone 3 folder for lineage reference.
  • Notebook Source: Open Notebook Source
    Public notebook source path used when a Colab launch URL is configured.

Screenshot Evidence

Code Screenshot

Dtype Audit Code Screenshot

Captured code evidence for the dtype audit and categorical-field identification section.

Dtype Audit Code Screenshot
Code Screenshot

Pivot Analysis Code Screenshot

Captured code evidence for the region-health pivot tables.

Pivot Analysis Code Screenshot
Code Screenshot

Distribution Tables Code Screenshot

Captured code evidence for the required distribution-table section.

Distribution Tables Code Screenshot

Outputs And Results

Key Outputs
  • The notebook produces pivot tables and grouped distribution tables directly in the notebook outputs rather than exporting new CSV handoff files.
  • The code screenshots preserve the pivot, grouped-criteria, and distribution-table logic.
  • The requirements file documents the assignment task list.
Key Findings
  • Region-health pivoting shows measurable variation in both visits and income across segments.
  • The grouped criteria tables surface clear visit and income differences across insurance and medicaid status.
  • The required distribution tables keep the Week 3 and Week 4 reporting aligned to age, gender, region, and income patterns without inventing extra requirements.