University of San Diego - M.S. Applied Data Science Portfolio
1
About
1.1
Acknowledgments
2
Probability and Statistics
2.1
Real Estate Prices in Los Angeles County
2.2
Abstract
2.3
Real Estate Prices in Los Angeles – A Case Study from Redfin
2.4
Exploratory Data Analysis
2.5
Generalized Linear Model (GLM)
2.6
Limitations
2.7
Regression Analysis
2.8
Conclusion
2.9
References
3
Data Science Programming
4
Foundations of Data Science and Data Ethics
4.1
Walmart Sales Forecasting
4.2
Business Understanding
4.2.1
Background
4.2.2
Business objectives and success criteria
4.2.3
Inventory of resources
4.2.4
Requirements, assumptions, and constraints
4.2.5
Risks and contingencies
4.2.6
Terminology
4.2.7
Data mining goals and success criteria
4.2.8
Project plan and order of tasks
4.3
Data Understanding
4.3.1
Initial data collection report
4.3.2
Data description report
4.3.3
Data exploration report
4.3.4
Data quality report
4.3.5
References
5
Applied Data Mining
5.1
Predicting Student Performance in a Portuguese Secondary Institution
5.2
Abstract
5.3
Predicting Student Performance in a Portuguese Secondary Institution
5.4
Methodology
5.5
C5.0
5.6
CART
5.7
Logistic Regression
5.8
Random Forest
5.9
Naïve Bayes
5.10
Neural Network
5.11
Results
5.12
Conclusion
5.13
References
6
Applied Predictive Modeling
6.1
Predicting Cervical Cancer From Biopsy Results
6.2
Abstract
6.3
Background - Predicting Cervical Cancer From Biopsy Results
6.4
Exploratory Data Analysis (EDA)
6.5
Preprocessing
6.6
Principal Component Analysis (PCA)
6.7
Train-Test Split and Class Imbalance
6.8
Methodology: Metrics and Train Control Parameters
6.9
Models and Their Methods
6.9.1
Generalized Linear Model (GLM)
6.9.2
Linear Discriminant Analysis (LDA)
6.9.3
Mixture Discriminant Analysis (MDA)
6.9.4
Partial Least Squares Discriminant Analysis
6.9.5
Nearest Shrunken Centroids
6.9.6
Neural Network
6.9.7
GLMNET – A Penalized Model
6.9.8
Random Forest
6.9.9
K
-Nearest Neighbors
6.9.10
Naïve Bayes
6.9.11
Support Vector Machines
6.10
Results – Model Summary Statistics and Performance Metrics
6.11
Conclusion
6.12
References
7
Machine Learning and Deep Learning for Data Science
7.1
In-Vehicle Marketing Engagement Optimization
7.2
Abstract
7.3
Background: In-Vehicle Marketing Engagement Optimization
7.4
Exploratory Data Analysis (EDA)
7.5
Pre-Processing
7.6
Models
7.7
Results – Model Summary Statistics and Performance Metrics
7.8
Conclusion
7.9
References
8
Applied Data Science for Business
8.1
Los Angeles County Building and Safety Permits (New Buildings Only)
8.2
Business Problem
8.3
Dataset
8.4
Preprocessing Step
8.5
Tableau Dashboard
8.6
References
9
Applied Time Series Analysis
9.1
Litecoin Cryptocurrency Forecast – Variations on the Autoregressive Moving Average Model: A Time Series Analysis
9.2
Abstract
9.3
Background: LTC Forecast - Variations on the Autoregressive Moving Average Model
9.4
Literature Review
9.4.1
Existing and Alternative Methods
9.5
Forecasting Prices with R
9.6
Forecasting Comparison by Bayesian Time-Varying Volatility Models
9.7
Half-Life Volatility Measure
9.8
Exploratory Data Analysis (EDA) and Initial Preprocessing Steps
9.9
Figure 1
9.10
Figure 2
9.11
Spectral Analysis Cyclical Behavior Periodogram Filters
9.12
Methodology
9.13
Differencing and Stationarity
9.14
Figure 3
9.15
Figure 4
9.16
ARIMA Models
9.17
GARCH Model
9.18
Figure 5
9.19
Figure 6
9.20
Summarized Results
9.21
Figure 7
9.22
Figure 8
9.23
Figure 9
9.24
Limitations
9.25
Conclusion
9.26
References
10
Data Science with Cloud Computing
10.1
Impacting the Business with a Distributed Data Science Pipeline
10.2
San Diego Street Conditions Classification
10.3
Abstract
10.4
Problem Statement
10.5
Goals
10.6
Non-Goals
10.7
Data Sources
10.8
Data Exploration
10.9
Exploratory Data Analysis (EDA)
10.10
Summary Statistics and Outlier Detection
10.11
Data Ingestion
10.12
GitHub Repository Information
10.13
Bias Exploration
10.14
Class Imbalance
10.15
Measuring Impact
10.16
Security Checklist, Privacy and Other Risks
10.17
Data Preparation and Data Scrubbing visa vie Pre-Processing
10.18
Balancing the Dataset
10.19
Train, Test, Validation Splits
10.20
Data Training and Modeling (Classical Approach)
10.21
Hyperparameters
10.22
Data Training on Refined Algorithm Conducive to a Cloud-Centric Environment
10.23
Parameters
10.24
Instance Size and Count
10.25
Model Evaluation
10.26
Future Enhancements
10.27
Enhancement #1: Standardizing/Normalizing The Data
10.28
Enhancement #2: Different Algorithms and Tuning Mechanisms
10.29
Enhancement #3: Additional Features
10.30
Data Inspection Report
10.31
References
11
Applied Text Mining
11.1
Classifying Emotions in Tweets
11.2
Data Source
11.3
Logistic Regression Results
11.4
Topic Modeling
11.5
Next Steps
12
Capstone Project
12.1
Identifying Safer Pedestrian Routes in Los Angeles
12.2
Abstract
12.3
Keywords
12.4
Introduction
12.5
Background
12.5.1
Problem Identification and Motivation
12.5.2
Definition of Objectives
12.6
Literature Review
12.6.1
SafeRoute: Learning to Navigate Streets Safely in an Urban Environment
12.6.2
Predicting Secure and Safe Route for Women using Google Maps
12.6.3
Applying Google Maps and Google Street View in Criminological Research
12.6.4
Algorithm to Determine the Safest Route
12.6.5
Envision of Route Safety Direction Using Machine Learning
12.7
Methodology
12.8
Data Acquisition and Aggregation
12.8.1
Data Quality
12.8.2
Feature Engineering
12.8.3
Modeling
12.9
Results and Findings
12.9.1
Evaluation of Results
12.10
Discussion
12.10.1
Conclusion
12.10.2
Recommended Next Steps/Future Studies
12.11
References
University of San Diego - M.S. Applied Data Science Portfolio
3
Data Science Programming
Jupyter Notebook
Video Presentation
Leonid Shpaner and Jose Luis Estrada