Data
Projects
This page shows a collection of my recent data analytics work. All of these projects were built to enhance the social good. This includes public health, female rights, safer streets, equity, sustainability, and more.
Colorado Springs, CO - A Mobility Decision Support System at Fort Carson Army Base
Context: As a military base will have restricted access to and from its premises, there is a ring of seven gates surrounding Fort Carson. These gates control traffic movements in and out of the base. During hours of heavy congestion, base operators will selectively close gates in order to redirect traffic to less congested highways. In addition, during forecasts of inclement weather, the base will go through an overnight protocol to either delay base opening or limit opening to critical operating personnel only.
Overview: The project team and US Ignite’s goal is to develop a sophisticated predictive model for traffic congestion for Fort Carson and Colorado Springs.
​
The use cases for the predictions yielded from this project are twofold. First, we seek to send alerts and live map updates of predicted traffic disruptions, around two hours in advance, for commuters to and from Fort Carson to aid in their drive to and from the base. Secondly, base operators will have access to predictive information at their fingertips in order to direct drivers to specific gates or close them entirely.
Our primary data source and response variable is a Spatio-temporal dataset of traffic jams and irregularities around Carson Springs, provided from the Waze for Cities data sharing program through US Ignite. We also gather other candidate predictors of interest, such as historical accidents, weather patterns, roadway infrastructure features, and more.
​
Several candidate models were considered, and the panel data was trained on each model to produce test R-squared results. These include:
-
Standard linear regression
-
Ridge and lasso regressions, methods to improve on the standard linear regression
-
Random forest regression - an ensemble learning method that takes predictions from multiple models and evaluates using a decision tree to get an averaged best prediction
-
XGBoost (Extreme Gradient Boosting), an implementation of gradient boosted decision trees designed for speed and performance
The winning model is the Random forest with Jam count as the dependent variable, and the R squared here is 0.32.
Conclusion: It seems that with just a few publicly available data sources, we are able to predict to a high degree of predictability the traffic jam intensity in Waze counts, as well as each jam’s pervasiveness through roadway network effects.
However, it is also clear that the model, like any time-space analysis, cannot always predict a particularly severe case of congestion intensity on a particular given day. For that, it is necessary to bring in other data sources that may not be public or less easy to source, including citywide events in Colorado Springs or at Fort Carson.
​
Check out our interactive website and R markdown to learn more! You can also click on the screenshots below.
Female Rights In Today's Media World
- A Data Perspective
How do people talk about female rights on social media, print, and broadcast media? Inspired by the Texas Abortion Ban and its impacts, my two friends and I retrieved data from Reddit, Twitter, News, and the GDELT TV APIs for the 2021 year and did sentiment analysis and coverage analysis with Python on these data to hopefully understand the female rights in the US today.
​
Findings: There are several main takeaways from this project. First, both mainstream media and social media reflect widespread engagement with national events. Spikes in Tweets, Reddit posts, print news stories, and tv air time were identified following large news events including the Texas Abortion ban coming into effect and the Supreme Court engagement with the case Dobbs v. Jackson Women's Health Organization. Because of the limited features and date ranges of free API searches we were unable to fully perform a sentiment analysis between sources. However, we were able to compare the mean polarity and subjectivity of headlines between Reddit and mainstream print media. As expected, the subjectivity of the Reddit post headlines(0.176) was significantly higher than that of the mainstream print headlines (0.091). The Reddit headlines were also slightly more positive. A final finding of note is that Fox News gave significantly higher percentages of air time to abortion coverage than any other network. While we were limited by the constraints of the data available, our findings still show that abortion is very much in the national conversation. The recent headlines regarding abortion have even further pushed the topic to the forefront of debate.
​
Check out our website and webapp to learn more! You can also click on the screenshots below.
Transit Equity In The City Of Los Angeles, CA
What factors affect equity level around transit stops? This project used data in LA as an example to study important equity-relevant factors and design a useful index to measure equity with R using cluster analysis.
Bike-share Demand In Philadelphia, PA
Project used Indego bike-share data from 2018 to visualize over 140 stations in Philadelphia and the demand for each of them across space and time (time of the day and day of the week) based on user pass type using JavaScript, HTML, and CSS.
​
Check out the website to learn more!
Philadelphia Bike-share Demand Prediction
Bike-share is a bike-sharing system that allows people to borrow and ride bikes without owning a bike. It’s comprised of a network of bike-sharing stations with docks for bikes to be checked in, stored, and checked out.
​
Stations in different geographical locations face different demands that vary with time. One paradox is that popular stations usually have a higher demand for bikes and will sooner run out of bikes if not replenished properly.
For the bike-share companies, this deficiency can lead to increasing dissatisfaction and churning rate among users, thereby lowering their profits. For bike-share users, this can seriously cut down their utility to use bike-sharing services and cause inefficiency and inconvenience for their work and life. For the city/region, bike-share services would be useless in raising active travel if they can’t provide enough supply where and when needed. With some of the users turning to car travel again, the congestion and environmental issues will only worsen in the city/region.
Given all these perspectives, bike-share re-balancing, which means re-distributing bikes with certain strategies so that the shortage of supply in a station can be narrowed as much as possible, is much needed as a critical element to make the bike-share system a success.
The strategy for re-balancing here will be relying on trucks to collect and move bikes to a certain station. This is because the bike stations in Philadelphia are quite dispersed, therefore a relatively high incentive should be provided if we want to purely rely on users to re-balance the bikes, adding to be a nonnegligible cost with possibly very limited effects. Compared with this method, small trucks may be more competent in guaranteeing the outcomes.
At any given time, our goal is to predict the demand in the next 1 hour, which should be sufficient for the trucks to relocate bikes to hot-spot areas.
​
​You can find the script of this project on my Github portfolio.
SafeGraph v.s. OpenStreetMap
My friend and I created this tool for comparing POI data from SafeGraph and OpenStreetMap to show the completeness of the Open Street Map data. This can be used to help the users make the decision about which dataset to use in their application and also guide the Open Street Map editors to the next step of implementation. We used using Python, JavaScript, HTML, and CSS in this project, the data was hosted in PostgreSQL on AWS RDS and the website was hosted on AWS Elastic Beanstalk.
​
We terminated the website due to a low budget but here is a presentation that you may refer to for more information.
Arrest Risk Prediction in Chicago, IL
Concerns about racial biases in predictive algorithms have long been expressed and discussed in many news articles of national organizations as well as many research articles. It’s said that minority communities are often targeted in cases of arrest and this leads to discriminatory consequences.
This bias on minority groups can be an issue because it deepens the feeling of distrust between minority groups and the police, which harms the overall stability of the society and adds to the challenges in the police’s job to ensure a safe local environment. This is especially the case when the selection bias results in a higher rate of arrest in minority communities, then more patrols are sent to these communities given more arrest incidents happened (even though some or even many of them may be false arrests).
This vicious cycle will eventually result in more innocent individuals in those minority neighborhoods being arrested and challenging the principles of fairness and equity in this democratic country.
In order to end this vicious cycle, selection bias in the arrest predicting algorithms needs to be fully identified and adjusted.
​
You can find the script of this project on my Github portfolio.
Miami Home Sale Price Prediction
Algorithms for accurate housing price prediction are gaining increasing attention.
In this report, we construct a new hedonic model for Zillow’s housing market predictions. This task is challenging due to the number of factors that affect the real estate market and the non-linear relationship between many factors and prices.
In this model built for Miami and Miami Beach, my teammate and I incorporate local intelligence from open-sourced data and adapt it to local housing and development patterns. We use determinants of home prices including internal characteristics, nearby amenities and dis-amenities, and spatial processes (i.e., clustering) to estimate home sale prices.
Applying this model to a set of 3503 houses, it predicted that home sale prices are highest on the shoreline and in Miami Beach.
​
You can find the script of this project on my Github portfolio.
Home Repair Tax Credit Program
Department of Housing and Community Development (DHC)’s Home Repair Tax Credit Program is a program that helps enhance social good by offering those households in need credits to repair their homes.
By doing so, the house value of these households as well as the surrounding houses will all rise due to the spillover effect. All these enhancements will lead to growth in the neighborhood aesthetics, safety, land development, economic activities, and thus overall tax revenue in this area. The increased tax revenue can then be applied to generate more public good and further raise social welfare.
​
Given the great benefits that this program can bring to society, DHC’s goal is to provide all the needed households with this tax credit. To better promote this program, marketing materials are distributed to attract these potential households, regardless of the costs that the DHC needs to pay.
We choose a logistic model to decide on whether to send house marketing materials. Our final model has a better classification performance and higher accuracy in predicting both house owners accepting and refusing tax credits compared to the kitchen-sink model. The new threshold (0.96) makes it possible that the costs of DHC are properly rewarded by the gains in housing value premium.
​
​You can find the script of this project on my Github portfolio.
Santa Monica Spatiotemporal EMS Call Prediction
In the field of Emergency Medical Services (EMS), response time is everything. When responding to EMS calls for conditions such as cardiac arrest or stroke, minutes can be the difference between life and death.
​
In this environment, increasing the efficiency of ambulance response is critical to minimize the number of lives lost. To do this, my teammate and I used EMS call data to create a spatiotemporal model for forecasting future EMS calls. We performed this analysis in Santa Monica, California because it has a wealth of EMS call data that reaches back to 2009 and is updated daily.
​
As it is, Santa Monica dispatches ambulances from its network of fire stations and a private ambulance company as EMS calls are received. We propose to design an algorithm that forecasts the spatiotemporal patterns of EMS demand so that paramedics can arrive at predicted hotspots before emergencies occur. We believe that a more data-driven approach to ambulance dispatch can help reduce strain on the medical system.
​
You can find the script of this project on my Github portfolio. You can also find a video for our proposed web application Siren here.
Boston Transit Oriented Development Policy Brief
Transit-Oriented-Development (TOD) is a popular planning method that uses the relationship between transportation and land use to achieve city redevelopment.
It’s commonly believed that TOD can promote business, increase the land value around transit, and help form a more environmental-friendly lifestyle for people living in the TOD area. However, it is also commonly related to gentrification because people with higher education levels and income who prefer the city life and commute by transit will be attracted to move back to TOD area and thus replace those city inhabitants who are economically inferior.
​
Boston, Massachusetts has one of the country’s most developed public transportation systems developed in the late 20th century. In the past decade, the Massachusetts Bay Transportation Authority (MBTA) has worked with private developers and the city to support more than 50 TOD projects near its stations. These facts make Boston an interesting case for studying the effects of TOD.
​
Overall, our study shows that TOD doesn’t have a significant influence on population growth and household income increase and is most influential in decreasing car ownership, but this influence is likely dependent on the density of subway lines and stations in the TOD tracts. TOD area also seems to have a strong appeal to a more civilized population compared to the non-TOD area. As for downtown, its development is the most likely factor driving household income growth and increase of land value within the TOD area. Considering robbery, the TOD areas in Boston don't appear to be safer than non-TOD areas due to the complicated factors influencing crime. However, it’s noted that neighborhoods with high rent and also within TOD experience less robbery crime than those with high rent but outside the TOD area.
​
Based on the conclusions, it’s suggested that Boston increase the density of subways lines and add more stations in the fringe of the current TOD area. This action is likely to encourage more people to choose more environmentally-friendly means of travel than driving. Also, more high-skilled jobs are preferred in the TOD area since a more civilized population is moving there. This action will serve as a virtuous cycle to draw more educated people to Boston while boosting the local economics furthermore.
​
Despite all the analysis, it’s worth noting that since this brief is based on ACS tracts data, biases exist in data collection and the accountability of analysis is subject to the Modifiable Areal Unit Problem (MAUP) bias, and the possible ecological fallacy when we use mean value to present data.
​
You can find the script of this project on my Github portfolio.