Francis Burnet – AI Engineering Portfolio

Capstone portfolio spanning AI engineering, applied data science, machine learning, and deep learning.

Francis Burnet headshot

Capstone 8 Evidence Map

Capstone 8 evidence image
Capstone Summary

This documentation details Capstone 8 of the Microsoft AI Engineering Program 2026, which focuses on developing a comprehensive movie recommendation system. Using Python and datasets containing film titles and viewer ratings, the project demonstrates three distinct filtering methodologies: user-based, item-based, and model-based collaborative filtering. The technical workflow includes merging dataframes, creating a user-item pivot table, and calculating Pearson correlations to predict ratings and identify similar content. Performance is measured through 5-fold cross-validation, comparing the accuracy of SVD, NMF, and KNN models using Root Mean Square Error metrics. Ultimately, the project provides a structured portfolio of notebook evidence, statistical charts, and JSON summaries that validate the execution of these machine learning techniques.

Capstone 8 Scope

Capstone 8 turns the copied recommendation assignment into an executed notebook with user-based, item-based, and model-based recommendation outputs staged for the site workflow.

Primary staged datasets: movies.csv and ratings.csv.

The notebook exports recommendation outputs, model-cross-validation results, and a structured summary JSON.

Original Project PDF

The copied project directions are embedded here for direct comparison against the notebook and output artifacts.

Requirement Checklist

1a

Study the recommendation techniques for recommending movies using `movies.csv` and `ratings.csv`.

Source mapping: Requirements file

1b

Load `movies.csv` and `ratings.csv`.

Source mapping: Requirements file

1c

Merge both dataframes on `movieId`.

Source mapping: Requirements file

1d

Create the user-item matrix using `pivot_table` with index `userId`, columns `title`, and values `rating`.

Source mapping: Requirements file

1e

Perform user-based collaborative filtering.

Source mapping: Requirements file

1f

Fill row-wise NaN values in the user-item matrix with the corresponding user's mean ratings.

Source mapping: Requirements file

1g

Find the Pearson correlation between users.

Source mapping: Requirements file

1h

Choose the correlation of all users with only User 1.

Source mapping: Requirements file

1i

Sort the User 1 correlation in descending order.

Source mapping: Requirements file

1j

Drop the NaN values generated in the correlation matrix.

Source mapping: Requirements file

1k

Choose the top 50 users that are highly correlated to User 1.

Source mapping: Requirements file

1l

Predict the rating that User 1 might give for the movie with `movieId 32` based on the top 50 user correlation matrix.

Source mapping: Requirements file

1m

Perform item-based collaborative filtering.

Source mapping: Requirements file

1n

Fill column-wise NaN values in the user-item matrix with the corresponding movie mean ratings.

Source mapping: Requirements file

1o

Find the Pearson correlation between movies.

Source mapping: Requirements file

1p

Choose the correlation of all movies with `Jurassic Park (1993)` only.

Source mapping: Requirements file

1q

Sort the `Jurassic Park (1993)` movie correlation in descending order.

Source mapping: Requirements file

1r

Drop the NaN values generated in the movie correlation matrix.

Source mapping: Requirements file

1s

Find 10 movies similar to `Jurassic Park (1993)`.

Source mapping: Requirements file

1t

Perform KNNBasic model-based collaborative filtering.

Source mapping: Requirements file

1u

Initialize KNNBasic with Mean Squared Distance Similarity (`msd`), 20 neighbors, and 5-fold cross-validation against RMSE.

Source mapping: Requirements file

1v

Initialize Singular Value Decomposition (SVD) and cross-validate 5 folds against RMSE.

Source mapping: Requirements file

1w

Initialize Non-Negative Matrix Factorization (NMF) and cross-validate 5 folds against RMSE.

Source mapping: Requirements file

1x

Print the best score and best parameters from cross validation on all built models.

Source mapping: Requirements file

Requirement Walkthrough

Each walkthrough block maps the copied PDF requirements to the executed notebook cells, exported outputs, and reviewable evidence staged with this capstone.

8a

Merge The Ratings Data And Build The User-Item Matrix

Notebook section: Load, merge, and pivot-table cells

Requirement: Load both CSV files, merge on movieId, and create the user-item matrix required for collaborative filtering.

The notebook merges ratings with movie titles and creates the full user-item matrix that anchors the user-based, item-based, and model-based recommendation steps.

Results Capture
  • The staged movie with movieId 32 is Twelve Monkeys (a.k.a. 12 Monkeys) (1995).
  • User-based and item-based recommendation steps both start from the same merged pivot-table structure.
merged = ratings.merge(movies[['movieId', 'title']], on='movieId', how='left')
user_item = merged.pivot_table(index='userId', columns='title', values='rating')
8b

Run User-Based And Item-Based Collaborative Filtering

Notebook section: Correlation and recommendation cells

Requirement: Compute user correlations for User 1, predict the rating for movieId 32, and find 10 movies similar to Jurassic Park (1993).

The notebook fills row-wise and column-wise NaN values, computes the required correlation views, predicts User 1 rating for movieId 32, and exports the top similar movies for Jurassic Park.

Results Capture
  • Predicted User 1 rating for movieId 32: 4.1369.
  • Similar-movie results for Jurassic Park are exported as CSV.
user_corr = user_filled.T.corr()
jurassic_similar = movie_corr['Jurassic Park (1993)'].drop(index='Jurassic Park (1993)').dropna().sort_values(ascending=False).head(10)
Associated Artifact

Model-Based RMSE Comparison

Saved comparison chart for the model-based recommendation workflows.

Model-Based RMSE Comparison
8c

Evaluate The Model-Based Recommendation Workflows

Notebook section: KNN-style MSD, SVD, and NMF evaluation cells

Requirement: Evaluate the model-based recommendation approaches and compare the best RMSE result.

The notebook records a compatible environment fallback for the unavailable scikit-surprise wheel, then evaluates KNN-style MSD, SVD, and NMF over 5-fold RMSE using the staged ratings matrix.

Results Capture
  • Current best model by average RMSE: SVD.
  • The environment note explains why scikit-surprise could not be built in the current Windows Python 3.12 environment.
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_results = pd.DataFrame(fold_records)
summary_results = fold_results.groupby(['model', 'parameters'], as_index=False)['rmse'].mean()

Colab Notebook

This section provides the notebook preview, launch link, and project file links.

The notebook opens in Google Colab when a launch URL is configured, and the project files and outputs remain available here on the site.

Capstone 8 Notebook Workspace
Launch Colab
Embedded Notebook Preview
Cell 1 Markdown

Capstone Session 8

This notebook is generated from the copied Capstone_Session_8.pdf directions and the staged movies.csv and ratings.csv datasets.

Cell 2 Markdown

Objective

Demonstrate user-based, item-based, and model-based recommendation techniques using the staged movie ratings data.

Cell 3 Markdown

Environment Note

This notebook uses scikit-surprise directly for the model-based recommendation tasks required by the PDF. In Google Colab, the setup cell installs any missing build dependency and then installs scikit-surprise before running KNNBasic, SVD, and NMF.

Cell 4 Code · python
from pathlib import Path
import importlib
from importlib import metadata as importlib_metadata
import json
import subprocess
import sys
from urllib.parse import quote

IS_COLAB = 'google.colab' in sys.modules
GITHUB_REPO_OWNER = 'FrancisBurnet'
GITHUB_REPO_NAME = 'francisburnet'
GITHUB_REPO_BRANCH = 'main'
CAPSTONE_ROOT = Path('Incremental Capstones/Machine Learning Using Python/Capstone Session 8')
MOVIES_FILENAME = 'movies.csv'
RATINGS_FILENAME = 'ratings.csv'


def build_raw_github_url(relative_path: Path) -> str:
    encoded_path = quote(relative_path.as_posix(), safe='/')
    return (
        f"https://raw.githubusercontent.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}/"
        f"{GITHUB_REPO_BRANCH}/{encoded_path}"
    )


def resolve_capstone_dir() -> Path | None:
    current = Path.cwd().resolve()
    capstone_parts = CAPSTONE_ROOT.parts
    for candidate in [current, *current.parents]:
        if len(candidate.parts) >= len(capstone_parts) and candidate.parts[-len(capstone_parts):] == capstone_parts:
            return candidate
        nested_candidate = candidate / CAPSTONE_ROOT
        if nested_candidate.exists():
            return nested_candidate
    return None


CAPSTONE_DIR = resolve_capstone_dir()
MOVIES_URL = build_raw_github_url(CAPSTONE_ROOT / MOVIES_FILENAME)
RATINGS_URL = build_raw_github_url(CAPSTONE_ROOT / RATINGS_FILENAME)

if CAPSTONE_DIR is not None:
    OUTPUT_ROOT = CAPSTONE_DIR
    OUTPUT_MODE = 'permanent capstone outputs'
    OUTPUT_DISPLAY = (CAPSTONE_ROOT / 'outputs').as_posix()
else:
    runtime_root = Path('/content/capstone-session-8-runtime') if IS_COLAB else Path.cwd().resolve() / 'capstone-session-8-runtime'
    OUTPUT_ROOT = runtime_root
    OUTPUT_MODE = 'runtime scratch outputs; export final artifacts back into the capstone outputs folder'
    OUTPUT_DISPLAY = 'capstone-session-8-runtime/outputs'

OUTPUTS_DIR = (OUTPUT_ROOT / 'outputs').resolve()
PLOTS_DIR = OUTPUTS_DIR / 'plots'
OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)
PLOTS_DIR.mkdir(parents=True, exist_ok=True)


def installed_version(package_name: str) -> str | None:
    try:
        return importlib_metadata.version(package_name)
    except importlib_metadata.PackageNotFoundError:
        return None


def surprise_import_ready() -> bool:
    try:
        importlib.import_module('surprise')
        return True
    except Exception:
        return False


numpy_version = installed_version('numpy')
needs_numpy_pin = numpy_version is None or int(numpy_version.split('.')[0]) >= 2
needs_surprise_setup = needs_numpy_pin or not surprise_import_ready()

if needs_surprise_setup:
    try:
        if IS_COLAB:
            subprocess.run(['apt-get', 'update', '-qq'], check=True)
            subprocess.run(['apt-get', 'install', '-y', 'build-essential'], check=True)
        subprocess.run([sys.executable, '-m', 'pip', 'install', '--force-reinstall', 'numpy<2'], check=True)
        subprocess.run([sys.executable, '-m', 'pip', 'install', '--force-reinstall', '--no-deps', 'scikit-surprise'], check=True)
        importlib.invalidate_caches()
    except subprocess.CalledProcessError as exc:
        if not IS_COLAB:
            raise RuntimeError(
                'Session 8 requires Microsoft Visual C++ Build Tools and a NumPy 1.x runtime for scikit-surprise. '
                'Install the Visual Studio C++ workload, then rerun this cell.'
            ) from exc
        raise

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from surprise import Dataset, KNNBasic, NMF as SurpriseNMF, Reader, SVD
from surprise.model_selection import KFold as SurpriseKFold, cross_validate

sns.set_theme(style='whitegrid')
pd.set_option('display.max_columns', 100)

print('Runtime:', 'Google Colab' if IS_COLAB else 'Notebook runtime')
print('Capstone artifact path:', CAPSTONE_ROOT.as_posix())
print('Movies source:', MOVIES_URL)
print('Ratings source:', RATINGS_URL)
print('Output mode:', OUTPUT_MODE)
print('Output target:', OUTPUT_DISPLAY)
print('NumPy version:', np.__version__)
print('scikit-surprise import ready')
Output
Runtime: Local / notebook runtime
Capstone directory: X:\SIMPLILEARN\FrancisBurnetCom\Incremental Capstones\Machine Learning Using Python\Capstone Session 8
Movies source: https://raw.githubusercontent.com/FrancisBurnet/francisburnet/main/Incremental%20Capstones/Machine%20Learning%20Using%20Python/Capstone%20Session%208/movies.csv
Ratings source: https://raw.githubusercontent.com/FrancisBurnet/francisburnet/main/Incremental%20Capstones/Machine%20Learning%20Using%20Python/Capstone%20Session%208/ratings.csv
Output mode: permanent capstone outputs
Outputs directory: X:\SIMPLILEARN\FrancisBurnetCom\Incremental Capstones\Machine Learning Using Python\Capstone Session 8\outputs
NumPy version: 1.26.4
scikit-surprise import ready
Cell 5 Code · python
movies = pd.read_csv(MOVIES_URL)
ratings = pd.read_csv(RATINGS_URL)
merged = ratings.merge(movies[['movieId', 'title']], on='movieId', how='left')
user_item = merged.pivot_table(index='userId', columns='title', values='rating')
display(merged.head())
print('Movies source used:', MOVIES_URL)
print('Ratings source used:', RATINGS_URL)
print('Movies shape:', movies.shape)
print('Ratings shape:', ratings.shape)
print('Merged shape:', merged.shape)
print('User-item shape:', user_item.shape)
Output
   userId  movieId  rating  timestamp                        title
0       1        1     4.0  964982703             Toy Story (1995)
1       1        3     4.0  964981247      Grumpier Old Men (1995)
2       1        6     4.0  964982224                  Heat (1995)
3       1       47     5.0  964983815  Seven (a.k.a. Se7en) (1995)
4       1       50     5.0  964982931   Usual Suspects, The (1995)
Movies source used: https://raw.githubusercontent.com/FrancisBurnet/francisburnet/main/Incremental%20Capstones/Machine%20Learning%20Using%20Python/Capstone%20Session%208/movies.csv
Ratings source used: https://raw.githubusercontent.com/FrancisBurnet/francisburnet/main/Incremental%20Capstones/Machine%20Learning%20Using%20Python/Capstone%20Session%208/ratings.csv
Movies shape: (9742, 3)
Ratings shape: (100836, 4)
Merged shape: (100836, 5)
User-item shape: (610, 9719)
Cell 6 Code · python
user_filled = user_item.apply(lambda row: row.fillna(row.mean()), axis=1)
user_corr = user_filled.T.corr()
user_1_corr = user_corr.loc[1].drop(index=1).dropna().sort_values(ascending=False)
top_50_users = user_1_corr.head(50)
movie_32_title = movies.loc[movies['movieId'] == 32, 'title'].iloc[0]
movie_32_ratings = merged.loc[merged['movieId'] == 32, ['userId', 'rating']].set_index('userId')
eligible = top_50_users[top_50_users.index.isin(movie_32_ratings.index)]
if eligible.empty:
    predicted_user_1_rating = float(merged.loc[merged['movieId'] == 32, 'rating'].mean())
else:
    weighted_ratings = movie_32_ratings.loc[eligible.index, 'rating']
    denominator = float(np.abs(eligible).sum())
    predicted_user_1_rating = float(np.dot(eligible.values, weighted_ratings.values) / denominator) if denominator else float(weighted_ratings.mean())

top_50_df = top_50_users.reset_index()
top_50_df.columns = ['userId', 'correlation']
top_50_df.to_csv(OUTPUTS_DIR / 'session_8_top_50_user_correlations.csv', index=False)
display(top_50_df.head(10))
{'movieId_32_title': movie_32_title, 'predicted_user_1_rating_for_movie_32': round(predicted_user_1_rating, 4)}
Cell 7 Code · python
item_filled = user_item.apply(lambda column: column.fillna(column.mean()), axis=0)
movie_corr = item_filled.corr()
jurassic_title = 'Jurassic Park (1993)'
jurassic_similar = movie_corr[jurassic_title].drop(index=jurassic_title).dropna().sort_values(ascending=False).head(10)
similar_movies_df = jurassic_similar.reset_index()
similar_movies_df.columns = ['title', 'correlation']
similar_movies_df.to_csv(OUTPUTS_DIR / 'session_8_similar_movies.csv', index=False)
display(similar_movies_df)
Cell 8 Code · python
reader = Reader(rating_scale=(float(ratings['rating'].min()), float(ratings['rating'].max())))
surprise_data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
surprise_cv = SurpriseKFold(n_splits=5, random_state=42, shuffle=True)

model_specs = [
    (
        'KNNBasic',
        KNNBasic(k=20, sim_options={'name': 'msd', 'user_based': True}),
        {'k': 20, 'sim_options': {'name': 'msd', 'user_based': True}},
    ),
    (
        'SVD',
        SVD(random_state=42),
        {'random_state': 42},
    ),
    (
        'NMF',
        SurpriseNMF(random_state=42),
        {'random_state': 42},
    ),
]

fold_records = []
model_summaries = []
for model_name, algorithm, parameters in model_specs:
    cv_result = cross_validate(
        algorithm,
        surprise_data,
        measures=['RMSE'],
        cv=surprise_cv,
        verbose=False,
        n_jobs=1,
    )
    rmse_scores = [float(score) for score in cv_result['test_rmse']]
    for fold_index, rmse_score in enumerate(rmse_scores, start=1):
        fold_records.append(
            {
                'fold': fold_index,
                'model': model_name,
                'rmse': rmse_score,
                'parameters': json.dumps(parameters, sort_keys=True),
            }
        )
    model_summaries.append(
        {
            'model': model_name,
            'parameters': parameters,
            'rmse': float(np.mean(rmse_scores)),
            'best_score': float(np.min(rmse_scores)),
        }
    )
Cell 9 Code · python
fold_results = pd.DataFrame(fold_records)
display(fold_results.head(9))
summary_results = pd.DataFrame(model_summaries).sort_values('rmse').reset_index(drop=True)
display(summary_results)
best_model = summary_results.iloc[0].to_dict()
best_model
Cell 10 Code · python
fig, ax = plt.subplots(figsize=(10, 5))
bar_colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
bars = ax.bar(summary_results['model'], summary_results['rmse'], color=bar_colors)
ax.set_title('Session 8 Model-Based RMSE Comparison')
ax.set_ylabel('Average 5-Fold RMSE')
ax.set_xlabel('Model')
ax.bar_label(bars, fmt='%.3f', padding=3)
ax.set_ylim(0, summary_results['rmse'].max() + 0.08)
fig.tight_layout()
fig.savefig(PLOTS_DIR / 'model_based_rmse.png', dpi=150)
plt.show()
plt.close(fig)

fold_results.to_csv(OUTPUTS_DIR / 'session_8_model_cv_results.csv', index=False)
summary = {
    'movie_id_32_title': movie_32_title,
    'predicted_user_1_rating_for_movie_32': round(predicted_user_1_rating, 4),
    'top_50_user_correlations_saved': 'session_8_top_50_user_correlations.csv',
    'similar_movies_for_jurassic_park': similar_movies_df.to_dict(orient='records'),
    'model_cv_results': summary_results.to_dict(orient='records'),
    'best_model': best_model,
    'environment_note': 'Model-based recommendation is executed with scikit-surprise using KNNBasic, SVD, and NMF.',
}
with open(OUTPUTS_DIR / 'session_8_summary.json', 'w', encoding='utf-8') as handle:
    json.dump(summary, handle, indent=2)
summary
Project Notes
  • Merged ratings matrix and pivot-table setup.
  • User-based rating prediction for movieId 32.
  • Item-based similar-movie search for Jurassic Park (1993).
  • Model-based RMSE comparison with environment-note fallback.
Launch Controls

Notebook Launch

Open the matching notebook in Google Colab or review the tracked notebook source in GitHub.

Project File Links

Outputs And Results

Key Outputs
  • Executed notebook artifact saved as capstone_session_8.ipynb.
  • CSV exports capture the top-50 user correlations, the Jurassic Park similar-movie list, and the model-CV RMSE table.
  • The site now has a saved comparison chart for the model-based recommendation workflows.
Key Findings
  • MovieId 32 maps to Twelve Monkeys (a.k.a. 12 Monkeys) (1995).
  • The current best model-based recommendation result is SVD.
  • The environment note is preserved in the outputs so the site explains the scikit-surprise build constraint directly instead of hiding it.