About the company:

Bellabeat is a wellness company headquartered in San Francisco that develops wearable computers for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

Ask

Key Stakeholders:

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur: Mathematician and cofounder; key member of the Bellabeat executive team
  • Bellabeat marketing analytics team

Business Task:

Analyze data of non-Bellabeat consumers’ use of their health tracking devices to identify potential growth opportunities and give recommendations for the next steps of the marketing strategy.

Prepare

The data for this analysis comes from this FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Exploring the credibility of the data:

  • Demographic information is missing: gender and age are not included
  • Metadata is not present including location, weather, lifestyle, etc.
  • Small sample size: 30 users is not representative of all health tracker users
  • Short data collection period: there is only 3 months worth of data
  • Inconsistencies: Some people don’t wear their watches daily; some may wear it for part of an activity or part of their sleep then remove it - we don’t know.
  • Data is not necessarily current as it is 6 years old. Trends may be different in current times.

Installing and loading packages:

Setting up my R environment by installing and loading the ‘tidyverse’ and ‘readr’ packages

install.packages("tidyverse")
install.packages("readr")
install.packages("dplyr")
install.packages("ggplot2")
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)

Importing datasets:

The data was imported and turned into data frames with simplified names for a more straightforward analysis

library(readr)
steps <- read_csv("Zip Data/dailySteps_merged.csv")
activity <- read_csv("Zip Data/dailyActivity_merged.csv")
calories <- read_csv("Zip Data/dailyCalories_merged.csv")
intensities <- read_csv("Zip Data/dailyIntensities_merged.csv")
sleep <- read_csv("Zip Data/sleepDay_merged.csv")
weight <- read_csv("Zip Data/weightLogInfo_merged.csv")

Process

I already viewed and explored the data in Google Sheets. I just need to make sure that everything imported correctly by using View() and head() functions.

head(activity)
## # A tibble: 6 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
view(intensities)

Exploring and Summarizing Data

Viewing Feature Use:

Using the n_distinct() function to determine which Fitbit features were used more than others.

n_distinct(activity$Id)
## [1] 33
n_distinct(calories$Id)
## [1] 33
n_distinct(intensities$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(steps$Id)
## [1] 33
n_distinct(weight$Id)
## [1] 8

These distinctions summarized that 100% of users (33) all used the ‘Activity’, ‘Calories’, ‘Intensities’, and ‘Steps’ features. About 73% of users (24) used the ‘Sleep’ feature and only 24% of users (8) use the ‘Weight Log’ feature.

Checking Out Some Summaries

activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes, Calories) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900
sleep %>%
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

Some Interesting Notes From These Summaries:

  • The average total steps per day is 7,638 which is slightly lower than the CDC’s recommendation. It was found that 8,000 steps per day was associated with a 51% lower risk for all-cause mortality. Taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps.

  • On average, participants sleep 1 time per day for nearly 7 hours exactly. This meets the CDC’s recommendation of sleep for most adults.

Analyze

Classifying Users:

We can classify the users into ‘Sedentary’, ‘Lightly Active’, ‘Fairly Active’ and ‘Very Active’ categories by considering their daily steps. This helps determine what types of people generally use health tracking devices.

steps_new <- mutate(steps, Category = ifelse(StepTotal < 5000, 
                                                       "Sedentary",
                                ifelse(StepTotal %in% 5000:7499, 
                                       "Lightly Active",
                                  ifelse(StepTotal %in% 7500:9999, 
                                         "Fairly Active",
                                    ifelse(StepTotal >= 10000, "Very Active",
                                           "NA")))))
view(steps_new)

Calculating the percentages of each user category to determine what activity level is the most common in this Fitbit user sample.

categories <- c("Sedentary", "Lightly Active", "Fairly Active", "Very Active")
percentages <- c(round((sum(steps_new$Category == 'Sedentary')/
                 nrow(steps_new))*100, 2), 
                 round((sum(steps_new$Category == 'Lightly Active')/
                 nrow(steps_new))*100, 2), 
                 round((sum(steps_new$Category == 'Fairly Active')/
                 nrow(steps_new))*100, 2), 
                 round((sum(steps_new$Category == 'Very Active')/
                 nrow(steps_new))*100, 2))
category_percentages <- data.frame(categories, percentages)

This shows that the 2 outer groups (‘Sedentary’ and ‘Very Active’) are both at at about 32%, whereas the 2 inner groups (‘Lightly Active’ and ‘Fairly Active’) have fewer users with 17-18%.

Identifying Relationships Between Variables

Using the merge() function to join two data frames to determine if there is a direct correlation between daily steps / user category and daily calories burned.

steps_calories <- merge(x = steps_new, y = calories, all = TRUE)

Using the cor() function to determine if there is a positive correlation between steps taken and calories burned.

cor(x = steps_calories$StepTotal, y = steps_calories$Calories)
## [1] 0.5915681

Share

Visualization

The correlation coefficient shows that these variables have a moderately positive correlation. This makes sense as the more active we are, the more calories will be burned. Let’s use the ggplot() function to create a quick scatter plot for a visual of the positive correlation.

steps_calories_plot <- ggplot(data = steps_calories, aes(x = StepTotal, 
                                                         y = Calories)) + 
  geom_point() + geom_smooth()
print(steps_calories_plot + labs(title="Steps Taken vs. Calories Burned"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Lastly, let’s see if people “workout” more when they wear the Fitbit by viewing Total Step trends from the start of the study to the end.

steps_average <- steps %>% 
  group_by(ActivityDay) %>%
  summarise_at(vars(StepTotal), list(Average = mean))
steps_average_plot <- ggplot(data=steps_average, aes(x = ActivityDay, 
                                                     y = Average)) + 
  geom_histogram(stat = "identity", fill = 'darkblue') +
  theme(axis.text.x = element_text(angle = 90))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
print(steps_average_plot + labs(title = "Average Total Steps Per Day"))

After visualizing Average Total Steps Per Day, I found that there was no major upward or downward trend from the start of the study period to the end. This indicates that just by wearing the Fitbit doesn’t necessarily motivate users to be more active.

Saving Visualizations For Further Use

ggsave("steps_calories_plot.png")
## Saving 7 x 5 in image
ggsave("steps_average_plot.png")
## Saving 7 x 5 in image

Act

Summarizing Recommendations

  • The most popular features used included ‘Activity’, ‘Calories’, ‘Intensities’, and ‘Steps’. These would be beneficial features to include in Bellabeat’s products. The least popular feature was ‘Weight’
  • Fitbit users ranged from classifying as ‘Sedentary’ up to ‘Very Active’. This indicates that features and goals on Bellabeat products should be customizable to each individual’s activity level
  • A positive correlation was found between total steps taken and calories burned. A reminder or prompt to get more steps throughout the day would be beneficial in helping users burn more calories
  • Further research geared more specifically to Bellabeat’s audience would be beneficial, especially with a sample of all women as trends, feature use, and recommendations may differ between men and women.