Chapter 7 Conclusion

According to the results of our analysis, we derive three actionable insights that can help people better stay away from collisions in NYC:

  • During summer time, don’t follow too closely while driving.

  • During rush hours, especially on your way home, stay more vigilant while driving.

  • No evidence has been found to prove the correlation between precipitation and collision caused by distraction. So don’t blame bad weather.

7.1 Limitation of this analysis

There are two types of limitations of our analysis:

  • Technical limitation: due to high volume of missing location variable, especially borough, we are restrained from doing cross-borough analysis, such as locating the most collision-prone borough. Instead, we use map plots to show more detailed spread of collision over the entire NYC. However, while map plots can do a better job identifying spatial clusters, it carries less values in cross variables analysis.

  • Analytical limitation: since traffic is a very complicated system, thus a complicated topic indeed, without further support from more reliable and granular data sources, which are usually not open to public, some of the conclusions are valid only under the context of the datasets we use. For example, when we say precipitation is not a contributing factor to distraction, what we really mean is that precipitation level is not correlated to the number of collisions in our datasets. Until further data comes in, we don’t have enough resources to further decode what constitutes of collision.

7.2 Future directions

  • First, we could utilize paid services such as Google Map API to recover the missing location variables, making the entire analysis more exhaustive.

  • Second, we could combine our analysis with more diverse data sources, trying to disclose more potential influencers of traffic collisions in NYC.

  • Third, we could make this type of analysis a regular task onwards. Because more data usually means more space for more advanced conclusions.

  • Last, using hexagon heat map, we find pattern of accidents with time, locations, severity and contributing factors. In the future, we could further explore the potential reason for the clusters. For example, unthoughtful design of a certain intersection might cause a high number of collisions.

7.3 Lesson learned

  • Pre-project brainstorming is crucial for the success of the project because starting off on a right direction under a right analytical structure definitely saves a lot of time from correcting ourselves and editing on meaningless insights.

  • Keeping all the data preparation in one space makes the entire work flow more efficient. Since, the datasets we use contain lots of un-labeled missing values, duplicated factor and distracting variables, it is exceptionally crucial for us to make our datasets analysis-ready before the actual work begins. Keeping all the data preparation not only allows us to save time from doing duplicated data cleaning, but also makes sure the way we prepare our data is consistent.

  • By comparing different packages with different features helps us to understand the pros and cons of each one of them and find the most suitable one for our goal and data. For example, when exploring the ways to represent spatial data of collision, we try several different approaches. First, we try to use the “choroplethrZip” package to create a zip code choropleth, but we realize the data has many missing zip codes and contain couple zip codes that were not include in the package. Then we turn to ggmap along with ggplot2, which allows us to produce hexagon heatmaps over Google maps. We also explore several options to make an interactive map, such as leaflet, mapdeck, and plot_ly. Yet, they have limitations and often required additional data requirements to achieve the result we want. Finally, we found that using shiny along with ggmap plus ggplot2 can easily provide us interactive map with value-colored hexagon heatmaps.