Thursday, January 12, 2023

Case Study: How Does a Bike-Share Navigate Speedy Success?

This is a case study performed for the Google Data Analytics Professional Certificate. For the purposes of this case study, I am a junior data analyst working for Cylistic, a bike share company located in Chicago. I am a member of the Cyclistic marketing analytics team. I joined the team 6 months ago and have spent that time learning about the company, its mission, and its long-term goals. My job is to help my company achieve its goals by helping to focus its marketing strategies.

Cyclistic's director of marketing strongly believes the future success of the company depends on maximizing annual memberships:
Therefore, [my marketing analytics] team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, [my] team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve [my] recommendations, so they must be backed up with compelling data insights and professional data visualizations.
The stakeholders for this project are Lily Moreno, the director of marketing and my direct manager, and the Cyclistic executive team, described as "notoriously detail-oriented."
 
The materials for the case study provide a brief history of Cyclstic and the goals of this project:
About the Company
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

 Our analytics team has been given 3 questions whose answers will guide the future marketing program:

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

The business task 

Lily Moreno has assigned the first question to me. The business task that I am looking to solve is to determine how members and casual riders use the company’s bikes differently, with the ultimate goal of developing strategies to convince casual riders to convert to annual members. I will present my findings to my boss Lily Moreno, the company's director of marketing, and the Cyclistic executive team.

Accessing the data

The data available to me to answer this question is in CSV files, each containing a month’s worth of data. The URL to access these files was

I have been asked to download the 12 most recent monthly files. As I am performing this analysis in November 2022, the files I downloaded were from 11/2021 – 10/2022. I made a copy of each file as an Excel spreadsheet, the format and tool I will use for my analysis. I kept the original CSV files in a separate folder as a backup.

Since, for the purposes of this study, the data was downloaded from Cyclistic's server and was their own internal data, I considered it reliable. As it contained raw trip data, I wasn't worried about bias affecting the data. Also, the files contained no personal data on individual users, so there were no privacy concerns. Finally, as there were thousands of records of individual bike rides, I felt that I had more than enough data available to me to perform meaningful analysis.

Organizing and cleaning the data

To begin my work, I cleaned the data in the spreadsheets individually. I first checked that the column headings were uniform across all 12 sheets. They were. The headings were: ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, and member_casual. I decided that for my analysis, I would concentrate on when and where a ride started and ended, what type of bike was used, and whether the ride was by a member or casual rider.

The ride_id column contained unique alpha-numeric ID numbers for each ride. I checked for duplicates or blanks in this column in each sheet. There were none. I checked the rest of the data to make sure that the numbers were properly formatted throughout the sheet.

To aid my analysis, I added columns for ride_length (by time as described below) and week_day day of the week), arrived at for each ride with the formula =TEXT(C*,"ddd") where C was the started_at column, and the result was the 3 letter day of the week the ride started.

One area where I found the data was lacking was with the start_station and end_station columns. They were blank for a significant amount of rides. I was interested in this data in that identifying the most active stations for casual users might be useful for targeting those stations for promotions to convert casual users to annual members. For casual_riders, start_station_name information was missing from 15.5% of the results. That number jumped to 18.2% for the end_station_name for casual riders. This was clearly due to the limitations of the GPS tracking, as the lines that lacked this information also had truncated GPS results.

The started_at and ended_at columns contained a date and time for the start and end of each ride. I created a new column for ride_length, which was arrived at by subtracting started_at from ended_at and formatting the result as a time. This was not only useful for my future analysis, but also for further cleaning the data. All of the spreadsheets had some rows with 0:00 results. I removed those rows from the sheets. Some of the spreadsheets had negative ride lengths, where the start time was later than the end time. I removed those rows as well. The total amount of rows removed per month was as follows:
  • 11/2021  53 records with negative ride lengths, 33 with 0:00. 86 total
  • 12/2021  21 records with 0:00 trip duration
  • 01/2022   5 records with 0:00 trip duration
  • 02/2022  5 records with 0:00 trip duration
  • 03/2022  2 records with negative ride lengths, 3,107 with 0:00. 3,109 total
  • 04/2022  31 records with 0:00 trip duration
  • 05/2022  47 records with 0:00 trip duration
  • 06/2022  12 records with negative ride lengths, 54 with 0:00. 66 total
  • 07/2022 16 records with negative ride lengths, 56 with 0:00, 72 total
  • 08/2022  15 records with negative ride lengths, 62 with 0:00, 77 total
  • 09/2022  9 records with negative ride lengths, 63 with 0:00, 72 total
  • 10/2022   4 records with negative ride lengths, 61 with 0:00, 65 total
March 2022 was the only month that required significant deletion of records with 3,109 total. However, there were still 280,933 remaining, so the deletions represented just over 1% of the month's records.

Once I was done with the cleaning and my additions, the spreadsheets I was working with looked like the example from 11/2021 shown below, with the columns I used in my analysis highlighted in green:

Click on any image in this case study
to see a larger version of that image


Working with the data

Now that my files were cleaned, and prepared I created pivot tables in each monthly spreadsheet. Below is my pivot table for the month of 11/2021.


Since the specific business task assigned to me was to determine how members and casual riders use the bikes differently, I wanted to see at a glance the total number of rides for members and casual riders for the month. This was further broken down by the day of the week and the type of bike used.

As the ultimate goal was to convert casual riders into yearly members, I thought it might be useful to have the information for start_station and end_station  ranked in reverse order by the number of rides. Each month would have a list for both start_station and end_station. The example below is my start_station ranking for 11/2021:

 
Including the number of blank entries with no station information allowed me to calculate the percentage of blank records for each month.

Once I had these pivot tables for each of the 12 months, I wanted to amalgamate all the data into one central spreadsheet which I could use for my analysis. I created some tables where I broke down the combined information in different ways that I thought I would find useful:
 


My Analysis

Now I had the information I needed to answer the question my supervisor had assigned to me: How do annual members and casual riders use Cyclistic bikes differently?

In looking at my tables, one glaringly obvious difference was in seasonal usage. The busiest months for Cyclistic were June, July, and August. The split in usage between annual members and casual riders was fairly close, with casual riders taking 48% of the rides in June, 49% in July, and 46% in August. However, in the 3 slowest months, December, January, and February, casual riders made up only 28%, 18%, and 19% of total riders. This disparity is clearly demonstrated in the chart below:

Another difference in usage between annual members and casual riders was by day of the week. The busiest days for annual members were weekdays, while casual rider usage peaked on weekend days. Combined with the seasonal differences outlined earlier, My theory would be that broadly speaking, annual members depended on Cyclistic more for practical tasks, such as getting back and forth to work, while casual members were more focused on recreational uses. A survey of both annual members and casual riders would be recommended to verify this theory.

One other difference between annual members and casual riders that came up in the data was the choice of bike type. There were 3 types of bikes: classic bikes, electric bikes, and docked bikes. Note that docked bikes are a self-service option only used by casual riders.


As demonstrated in the chart above, members slightly favor classic bikes over electric bikes. 51.2% of their rides were on classic bikes, while 48.8 % were on electric bikes. Casual riders, on the other hand, demonstrated a clear preference for electric bikes. They chose them 54.1% of the time. Classic bikes were used for 38.1 % of the rides, while docked bikes made up the remaining 7.7 % of rides. 

Key Findings and Recommendations

My specific task was to determine how casual riders and annual members use bikes differently, with the ultimate goal for my marketing analytics team to convert more casual riders to more profitable annual members.
 
I believe that the key here is to differentiate between different types of casual riders. The spike in usage on weekends and during the warm weather months, shown in the chart below, would indicate that a significant number of casual riders are likely to be tourists and weekend visitors to the city. These are obviously not good candidates to be converted into annual members.


I would therefore recommend looking first at the casual riders who use the bikes in the most similar way to annual members. Therefore, despite the fact that casual ridership shrinks in cold weather months, these seem like the best candidates to be persuaded to go for membership. Going a step further, the ones who also tend to ride on weekdays are most likely to be Chicago residents and therefore, the better candidates to convert.


Although finding casual riders who are most like members would be a good place to start, I think it would be wise to ask why city residents who use the bikes throughout the year have not signed up for yearly membership. Again, a survey of users would be invaluable for this.
 
One reason that comes to mind for their reluctance to commit would be an unwillingness to pay a large sum of money for the entire year. Even if the cost isn't an issue, there might be a reluctance to commit for a whole year in advance. To that end, I would recommend we consider offering shorter 6-month memberships.  Besides offering riders a shorter commitment while still affording Cyclistic more certainty of income, the 6-month membership might also be a good option for luring riders who avoid using the bikes during the coldest months. Presumably, they would be more comfortable with a commitment that didn't include months they wouldn't be riding.

I would also recommend different payment options for the annual for those who find a one-time payment prohibitively expensive. Perhaps the opportunity to sign up as an annual member but break down the payment into 2 or 3 smaller installments might be attractive to these potential candidates.