A project looking into processing, cleaning, and visualizing huge open datasets, by exploring NYC Bike Share Data
Huge credits to Citi Bike System Data New York City, for publishing there bike rides data, this is used as my datasource for the project.
The main objective was to visualize geospatial data in Python.
The bike rides monthly data is a huge dataset. Depending on the season, the trip data files contain one record for each ride, with roughly two million records every month. It’s a standard bike share system with fixed stations where a user takes up a bike from one dock and returns it to another using a key fob or a code. For each ride, the station and time the ride began and ended are recorded.
For this data visualization project a few assumptions were made so as to map only a sample of the data.
1. Only data from the month of December was used for years 2018, 2019 & 2020.
2. The data was cleaned and grouped by Station name and Station id.
3. The resulting cleaned data indicated all the bike station where atleast one bike trip originated from the station
and the number of trips recorder from the station.
For this project the Python language and its packages were used in data reading, observing, cleaning, processing and visualization of the data.
Packages used was; Pandas, Seaborn, Geopandas, Matplotlib, Contextily, and Folium.
The logical steps taken for this task is:
Read in the downloaded data and display using Pandas.
#Read and display the data
data = pd.read_csv("201812-citibike-tripdata.csv")
We display the first 5 rows using head() method.
data.head()
We can get some information about the columns with the info() method.
data.info()
The output shows the number of rows (just over a million) and the number of columns.
We can now prepare the data further using Pandas and explore some descriptive information from the data. Using Seaborn and matplotlib to plot and display graphs of busiest days, and top 10 busiest stations.
After cleaning, grouping the data by (Station ID and Station Name) and aggregating we go ahead,and write to an output csv file.
We use GeoPandas to read in the cleaned data, convert it to a geodataframe. Assign a Coordinate reference system (CRS) to our geodataframe for our mapping
We display the first 5 rows using head() method, the geodataframe has a geometry column that allows us to plot the data to a location map.
geodata.head()
We check the CRS of our geodataframe using .crs method, which shows the EPSG code of the data.
print(geodata.crs)
The CRS of our geodataframe, Web Mercator(EPSG:3857) is projected coordinate system used for rendering maps in Google Maps, OpenStreetMaps. It is mostly used in web mapping and visualization applications.
We will use the geodataframe to plot a Static map with basemap as well as an Interactive map.
A Static Map plotted with Matplotlib and Contextily using information of the 2018 bike stations cleaned data.
An Interactive map plotted with Folium showing the busiest bike stations used in the years 2018, 2019 ,and 2020. You can interact with the map and change the basemap as well as switch and display the interested year.
Explore the interactive map further here.
The cleaned datasets, python scripts and notebooks used for this project can be accessed here.
Huge credits to Citi Bike System Data New York City, for publishing there bike rides data, this is used as my datasource for the project.
Explore Bike share data, The blog was a helpful guide when processing the data using Pandas.