Projects | Benjamin Freeman

Featured R/RStudio project

Taxi fare pricing analysis.

A group final project for OPIM 5603 - Statistics in Business Analytics, built in RStudio and R Markdown. Responsibility was shared across the project; my main contribution was the random forest model and evaluation section.

What the project did

The project analyzed a 1,000-row taxi pricing dataset with 11 original fields. As a group, we moved through cleaning, feature engineering, visualization, regression modeling, random forest comparison, diagnostics, and interpretation. My main section focused on fitting the random forest model and evaluating RMSE, MAE, and R-squared. The image shown here is an actual excerpt from that random forest section of the knitted R Markdown report.

This is group coursework evidence, not a deployed pricing system. The random forest result is presented as an in-project model comparison, not a production validation claim.

Actual final project excerpt showing random forest performance metrics: RMSE, MAE, and R-squared.

R Markdown excerpt

taxi <- read_csv("taxi_trip_pricing.csv") |> clean_names()
taxi <- taxi |> mutate(speed_kph = trip_distance_km / trip_duration_minutes * 60)
model_full <- lm(trip_price ~ trip_distance_km + trip_duration_minutes + speed_kph, data = taxi)
rf_model <- randomForest(trip_price ~ ., data = taxi)

R
RStudio
R Markdown
tidyverse
skimr
corrplot
lm()
randomForest

Model comparison

Model	RMSE	MAE	R-squared	What it showed
Distance-only regression	17.65	14.22	0.4349	Distance mattered, but one predictor left too much unexplained.
Multiple regression	15.22	11.91	0.5796	Distance and trip duration were the strongest interpretable drivers.
Random forest	4.34	3.12	0.9785	Nonlinear modeling fit the project dataset much more closely.

Skills shown

Data cleaning, feature engineering, exploratory visualization, regression interpretation, model comparison, diagnostics, and communicating results in a reproducible report.

Personal takeaway

Funny enough, working with black-and-white numbers has taught me a lot about creative thinking. It pushed me to look at business issues less linearly: what is connected, what is missing, and what question should come next.

Business interpretation

The analysis identified trip distance and duration as the clearest fare drivers, with traffic, weather, and engineered speed adding useful context around trip efficiency.

AI assistance

AI was useful for syntax support, debugging, and checking explanations. I still treat the judgment work as human: choosing the question, validating outputs, and explaining what the results mean.

Additional Python project

Credit default risk modeling.

A group final project for OPIM 5603 - Predictive Modeling, built in Python. Work was shared across the team; my section focused on the Random Forest classifier and the model-comparison assessment.

What the project did

The project used a 30,000-row credit-card default dataset to predict whether a client would default the next month. As a group, we moved through sampling, exploration, feature preparation, model building, and assessment. My main section fit the Random Forest model and helped compare test-set ROC AUC across candidate classifiers.

This is group coursework evidence, not a deployed credit-decisioning system. The metrics are presented as in-project validation results, not a production lending or underwriting claim.

Credit default model comparison chart showing test ROC AUC for logistic regression, XGBoost, Random Forest, neural network, and Naive Bayes models.

Python excerpt

from sklearn.ensemble import RandomForestClassifier

model_6 = RandomForestClassifier(random_state=42)
model_6.fit(X_train_scaled, Y_train)

sns.barplot(
    x=metrics_df.index,
    y="ROC AUC",
    data=metrics_df
)

Python
pandas
scikit-learn
seaborn
Logistic regression
Random Forest
XGBoost
Neural network
ROC AUC

Model comparison

Model	Accuracy	Precision	Recall	F1	ROC AUC	What it showed
Random Forest	0.8148	0.6429	0.3662	0.4666	0.7512	Strong tree-based comparison model with competitive discrimination.
XGBoost	0.8075	0.6114	0.3557	0.4497	0.7518	Similar ROC AUC to Random Forest with a different precision-recall balance.
Neural Network	0.8180	0.7102	0.2992	0.4210	0.7700	Highest ROC AUC in the project, with lower recall at the default threshold.
Logistic regression with interaction	0.8152	0.7034	0.2841	0.4047	0.7165	Useful interpretable baseline for comparing more flexible classifiers.

Skills shown

Train/validation/test splitting, feature scaling, PCA for correlated billing variables, classifier fitting, model comparison, ROC AUC interpretation, and clear communication of validation results.

Business interpretation

The model comparison framed default prediction as a risk-ranking problem, where accuracy alone is not enough and recall, precision, F1, ROC AUC, and lift all matter.

Contribution boundary

The project was completed by a group. My public claim is limited to the Random Forest model and the model-comparison assessment shown here.

SQL / ACO operations project

ACO operations reporting database.

A recreated version of the OPIM 5272 final project concept, rebuilt around the provider, clinical, member, attribution, and quality data an ACO operations analyst would need to document, query, and summarize for stakeholders. The original final SQL was not recovered, so this version uses the recovered Sprint 1 writeup and ERD as context, then adds cleaner normalization, synthetic source files, loadable SQL, and reporting views.

What the project models

The original project centered on a New England hospital network with patients, hospitals, departments, doctors, diagnoses, procedures, medications, emergency contacts, and administrative reports. I rebuilt that concept as a MariaDB schema and created a small-ACO source package around it: attributed members, provider roster, payer contracts, encounters, diagnoses, procedures, prescription history, and quality gaps. The public display shows the source-to-report workflow: Excel/CSV intake, normalized SQL tables, reporting views, and stakeholder-ready outputs for high-risk panels, quality gaps, controlled prescriptions, and department resources.

This is a recreated coursework artifact, not the original submitted final SQL and not a live hospital or ACO system. All records are synthetic and the claim is limited to the schema rebuild, data-source design, SQL logic, and portfolio presentation shown here.

Live-rendered database schema browser for the recreated ACO operations database.

Live-rendered SQL editor and query result grids from the recreated ACO operations database.

MariaDB reporting view

CREATE VIEW vw_aco_patient_panel AS
SELECT
  p.patient_id,
  pc.contract_name,
  pa.risk_score,
  CONCAT_WS(' ', d.first_name, d.last_name) AS primary_care_doctor,
  COUNT(DISTINCT hv.visit_id) AS encounters,
  SUM(CASE WHEN qg.gap_status IN ('open', 'scheduled') THEN 1 ELSE 0 END)
    AS open_quality_gaps
FROM patient p
JOIN patient_attribution pa ON pa.patient_id = p.patient_id AND pa.active = TRUE
JOIN payer_contract pc ON pc.contract_id = pa.contract_id
JOIN doctor d ON d.doctor_id = pa.primary_care_doctor_id
LEFT JOIN hospital_visit hv ON hv.patient_id = p.patient_id
LEFT JOIN patient_quality_gap qg ON qg.patient_id = p.patient_id
GROUP BY p.patient_id, pc.contract_name, pa.risk_score,
  d.first_name, d.last_name;

Download SQL Download report queries Download source workbook Download CSV source

SQL
MariaDB
MySQL
DBeaver
Excel
CSV
ERD
Normalization
Foreign keys
Indexes
Views
Synthetic data
Population health
Patient attribution
ACO reporting

Data package and schema rebuild

Area	Public artifact	What it demonstrates
Synthetic source package	12 CSV files and a workbook with facilities, providers, attributed members, payer contracts, encounters, prescriptions, and quality gaps.	Shows how provider, clinical, and member data can feed an analyst-ready database without exposing real patients or operations.
Core schema	18 MariaDB tables, 31 foreign-key references, 18 check constraints, and 5 views.	Normalizes visits, diagnoses, procedures, prescriptions, provider privileges, attribution, payer contracts, and quality gaps.
Synthetic records	120 attributed patients, 320 encounters, 393 diagnosis rows, 224 prescriptions, and 170 quality-gap records.	Gives enough density to show report behavior while staying small enough for a portfolio reviewer to inspect.
Reporting outputs	High-risk patient panel, quality-gap summary, controlled-prescription history, and department resource profile.	Connects the original hospital-admin report ideas to ACO operations, population-health follow-up, and stakeholder reporting.

Role signal

Mirrors ACO analyst work with provider, clinical, and member data; attribution logic; quality-gap follow-up; report documentation; and clear summaries for operational stakeholders.

Rebuild discipline

Normalized relationship tables, consistent naming, explicit primary and foreign keys, indexes for report paths, check constraints, a deterministic data generator, and a source-file-to-SQL workflow.

Reports supported

Attributed patient panel, quality-gap summary, controlled prescription history, doctor activity summary, and department resource profile.

Public boundary

This is a recreated portfolio version from surviving evidence and synthetic data. It is not represented as a production medical record system, a real ACO extract, or the exact group submission.

Public dashboard view

A private owner dashboard, shown safely.

The original setup was Square -> Zapier -> Google Sheet -> Python dashboard. Square activity flowed through Zapier into a private Google Sheet, then a Python Dash app used Plotly charts, Pandas data prep, and linear regression to turn that spreadsheet activity into interactive membership and revenue views. I also built an R Shiny version that read from a private Excel workbook in RStudio, added a year filter, and was designed to run locally on the owner's computer. A shinyapps.io path was explored, but the private app is no longer deployed publicly.

Data pipeline: Square transaction activity moved through Zapier into a private Google Sheet, then into the Python dashboard
Local prototype: R Shiny read a private Excel workbook, filtered by year, and plotted monthly membership trend with ggplot2
Business purpose: Help a small-business owner see membership trends, estimated revenue, short-range forecasts, and break-even context
Public visual: The public image uses synthetic data with the same Pandas, regression, Plotly, and Dash-style charting flow, then exports as a static PNG
Privacy boundary: Live sheet access, real member counts, company names, and private operating details are removed

Analytics examples for evaluating fit.

Taxi fare pricing analysis.

What the project did

Model comparison

Skills shown

Personal takeaway

Business interpretation

AI assistance

Credit default risk modeling.

What the project did

Model comparison

Skills shown

Business interpretation

Contribution boundary

ACO operations reporting database.

What the project models

Data package and schema rebuild

Role signal

Rebuild discipline

Reports supported

Public boundary

A private owner dashboard, shown safely.