Strava Exploration
Analysis of my Running and Biking Data

a

Preface

Strava is a fitness tracking application that I use to track runs, bike rides, and the occasional hike. The app allows users to download their activity history in bulk as individual GPX files, a GPS enabled data format built ontop of XML. Since I've been using the app for almost five years, I decided that my collection of activities would be large enough to provide some interesting insights.

In this project I use python to prepare my dataset, R to perform some exploratory analysis, and leaflet.js to create an interactive map of my activities.

GPX Parsing

Python has a great library for parsing GPX files named gpxpy. Using this library along with pandas, I combined my collection of activity files (208 in total) into a tidy dataframe for further analysis.

import os
import pandas as pd
import gpxpy

#Function to parse individual GPX files
def parseGPX(file):
    pointlist = []
    with open(file, 'r') as gpxfile:
        if "Run" in file:
            activity = "Run"
        elif "Ride" in file:
            activity = "Ride"
        elif "Hike" in file:
            activity = "Hike"
        else:
            activity = "NA"
        gpx = gpxpy.parse(gpxfile)
        for track in gpx.tracks:
            for segment in track.segments:
                for point in segment.points:
                    dict = {'Timestamp' : point.time,
                            'Latitude' : point.latitude,
                            'Longitude' : point.longitude,
                            'Elevation' : point.elevation,
                            'Activity' : activity
                            }
                    pointlist.append(dict)
    return pointlist

#Create file names variable
gpx_dir = r'D:/strava/strava_rides'
files = os.listdir(gpx_dir)

#Call parseGPX on each file and append to dataframe
df = pd.concat([pd.DataFrame(parseGPX(file)) for file in files], keys=files)
df.reset_index(level=0, inplace=True)
df.rename(columns={'level_0':'File'}, inplace=True)
df.head()
File Activity Elevation Latitude Longitude Timestamp
1 20130209-203200-Ride.gpx Ride 170.4 34.123601 -118.233569 2013-02-09 20:32:00
2 20130209-203200-Ride.gpx Ride 169.6 34.123588 -118.233539 2013-02-09 20:33:13
3 20130209-203200-Ride.gpx Ride 169.1 34.123561 -118.23352 2013-02-09 20:33:18
4 20130209-203200-Ride.gpx Ride 169.1 34.12351 -118.233521 2013-02-09 20:33:21
5 20130209-203200-Ride.gpx Ride 169.6 34.123442 -118.233532 2013-02-09 20:33:24
6 20130209-203200-Ride.gpx Ride 170.4 34.123368 -118.233528 2013-02-09 20:33:27

EDA

At this point I have a dataframe of temporal and location based inforamtion for each activity so in order to extract some real insights I'll need to add some additional variables. Using R's geosphere package I can calculate the haversine distance between lat/long coordinates for each activity. I'll also caluclate differences in time and elevation for each waypoint. I then create a summary datafame with statistics on each activity including:

  1. Total Miles Travelled
  2. Total Time Spent
  3. Average Speed (MPH)
  4. Elevation Gain/Lost

I'll also run a simple linear regression on time predicted via distance to evaluate the overall consistency of both my runs and rides.

library(dplyr)
library(lubridate)
library(geosphere)

strava <- read.csv("D:/strava/strava_rides.csv")

strava <- strava %>%
  filter(Activity %in% c('Run', 'Ride')) %>%
  group_by(File) %>%
  mutate(long2 = ifelse(is.na(lag(Longitude)),Longitude,lag(Longitude)),
         lat2 = ifelse(is.na(lag(Latitude)),Latitude,lag(Latitude))) %>%
  rowwise() %>%
  mutate(dist = geosphere::distHaversine( c(Longitude, Latitude),
                                          c(long2, lat2)))

strava <- strava %>%  mutate(Timestamp = ymd_hms(Timestamp)) %>%
  group_by(File) %>%
  mutate(elev_chg = Elevation - ifelse(is.na(lag(Elevation)), Elevation, lag(Elevation)),
         time_lag = if_else(is.na(lag(Timestamp)), Timestamp, lag(Timestamp)),
         time_chg = as.numeric(difftime(Timestamp, time_lag, units = 'mins')))

summary <- strava %>%
  group_by(File, Activity) %>%
  summarise(miles = sum(dist) / 1609.34,
            start = min(Timestamp),
            end = max(Timestamp),
            mins = as.numeric(difftime(end, start, units = 'mins')),
            hrs = as.numeric(mins) / 60,
            mph = miles/hrs,
            elev_gain = sum(elev_chg[elev_chg > 0]),
            elev_loss = sum(elev_chg[elev_chg < 0]))

ride_ols <- summary(lm(hrs ~ miles, data = filter(summary, Activity == 'Ride')))
run_ols <- summary(lm(hrs ~ miles, data = filter(summary, Activity == 'Run')))
rsq_df <- data.frame(Act = c('Ride','Run'), rsq = c(round(ride_ols$r.squared,2),round(run_ols$r.squared,2)))

With this information I'm ready to plot the data to compare activities and get an idea of how consistent my efforts are. The plotting code is a little verbose and not shown below but can be viewed on github.

Results

Alright! The year by year trend is a little sad to look at but there are some hidden factors at play here. Biking was deifnitely my primary source of excersise until mid 2105 when I began training for a marathon, and after 2016 I more or less dropped biking completely in favor of indoor rowing and weight lifting. I'm sure I haven't seen the last of my bike though!

A predictable but interesting trend is the variability for rides vs runs. It looks like I'm an extremely consistent runner when it comes to speed, validated by a .96 r squared from the regression and the tight distribution of both speed and elevation changes seen in the box and violin plots. Rides are less consistent but are typically at least 2.5 times faster and 3-4 times further than my runs.

Mapping

Of course, 500k rows of latitude and longitude coordinates were just begging to be mapped so I decided to use Leaflet.js to create an interactive map. After loading a base map, leaflet can be fed geojson linestrings that it converts to svg paths. These paths can in turn be styled with generic css which will allow me to use the same color scheme for rides and runs among other things. Converting my data to geojson took a little bit of effort but the leaflet implemenation was extremely easy and highly recommened for custom mapping projects. I used D3's json handler to load my geojson files for rides and runs, all implemented in just a few lines of javascript.

var ridesPath = {
"color": "#00b2ee",
"weight": 2,
"opacity": 0.4
};
var runsPath = {
"color": "#bcee68",
"weight": 2,
"opacity": 0.4
};

var map = new L.Map("map").setView([34.0195,-118.4912], 12);

map.addLayer(new L.TileLayer("https://cartodb-basemaps-{s}.global.ssl.fastly.net/dark_all/{z}/{x}/{y}.png"));

d3.json("rides_mls.json", function(error, collection) {
if (error) throw error;

map.addLayer(new L.geoJSON(collection, {style : ridesPath}));
});
d3.json("runs_mls.json", function(error, collection) {
if (error) throw error;

map.addLayer(new L.geoJSON(collection, {style : runsPath}));
});))