Plot your health with Samsung Health and Pandas

Artwork by Sami Lee.

For the last 5+ years, I’ve been tracking my various aspects of my personal health using Samsung Health. It helps track weight, calories, heart rate, stress, and exercise and stores all of it in the app.

However, the app only gives some basic high level charts and insights. Luckily, it enables you to export your personal data into CSV files that you can then import into your tool of choice and perform any kind of analytics. In this post, I’m going to show how to export it all, then load it into Zeppelin and some sample Pandas queries that’ll enable you to start building more complex queries yourself.

Download your Data

First, open Samsung Health on your mobile phone

Scroll down and tap “Download personal data”

Tap the Download button

Wait for it to download

After it completes downloading, tap View files.

Select the file, tap compress, save it as a zip file.

Then upload the file to the computer running Zeppelin. I upload it to OneDrive, then SFTP it to the server, then extract it into a folder

Working with the Data

The export should contain a number of interesting files in the main directory along with the json and files directories. The main files are CSV formatted.

Data Type Listing

File Name	Doc Link	Notes
com.samsung.health.ecg.{ts}.csv
com.samsung.health.floors_climbed.{ts}.csv	Link
com.samsung.health.food_info.{ts}.csv
com.samsung.health.food_intake.{ts}.csv
com.samsung.health.height.{ts}.csv
com.samsung.health.nutrition.{ts}.csv
com.samsung.health.sleep_stage.{ts}.csv	Link
com.samsung.health.user_profile.{ts}.csv
com.samsung.health.water_intake.{ts}.csv	Link
com.samsung.health.weight.{ts}.csv	Link
com.samsung.shealth.activity.day_summary.{ts}.csv
com.samsung.shealth.activity.goal.{ts}.csv
com.samsung.shealth.activity_level.{ts}.csv
com.samsung.shealth.best_records.{ts}.csv
com.samsung.shealth.blood_pressure.{ts}.csv	Link
com.samsung.shealth.breathing.{ts}.csv
com.samsung.shealth.caloric_balance_goal.{ts}.csv
com.samsung.shealth.calories_burned.details.{ts}.csv
com.samsung.shealth.exercise.{ts}.csv
com.samsung.shealth.exercise.weather.{ts}.csv
com.samsung.shealth.floor_goal.{ts}.csv
com.samsung.shealth.food_favorite.{ts}.csv
com.samsung.shealth.food_frequent.{ts}.csv
com.samsung.shealth.food_goal.{ts}.csv
com.samsung.shealth.goal.{ts}.csv
com.samsung.shealth.goal_history.{ts}.csv
com.samsung.shealth.insight.milestones.{ts}.csv
com.samsung.shealth.library_subscription.{ts}.csv
com.samsung.shealth.permission.{ts}.csv
com.samsung.shealth.preferences.{ts}.csv
com.samsung.shealth.report.{ts}.csv
com.samsung.shealth.rewards.{ts}.csv
com.samsung.shealth.sleep.{ts}.csv	Link
com.samsung.shealth.sleep_combined.{ts}.csv
com.samsung.shealth.sleep_data.{ts}.csv
com.samsung.shealth.sleep_goal.{ts}.csv
com.samsung.shealth.social.friends.{ts}.csv
com.samsung.shealth.social.leaderboard.{ts}.csv
com.samsung.shealth.social.public_challenge.{ts}.csv
com.samsung.shealth.social.public_challenge.detail.{ts}.csv
com.samsung.shealth.social.public_challenge.extra.{ts}.csv
com.samsung.shealth.social.public_challenge.history.{ts}.csv
com.samsung.shealth.social.service_status.{ts}.csv
com.samsung.shealth.stand_day_summary.{ts}.csv
com.samsung.shealth.step_daily_trend.{ts}.csv	Link
com.samsung.shealth.stress.{ts}.csv
com.samsung.shealth.stress.base_histogram.{ts}.csv
com.samsung.shealth.stress.histogram.{ts}.csv
com.samsung.shealth.tip.{ts}.csv
com.samsung.shealth.tracker.heart_rate.{ts}.csv
com.samsung.shealth.tracker.oxygen_saturation.{ts}.csv	Link
com.samsung.shealth.tracker.pedometer_day_summary.{ts}.csv
com.samsung.shealth.tracker.pedometer_recommendation.{ts}.csv
com.samsung.shealth.tracker.pedometer_step_count.{ts}.csv	Link
files/		Contains random files like your profile picture
json/		Contains the binned data which is higher resolution (e.g. minute level) data from workouts, sleep, etc.

Notes:

Once the data is loaded and accessible to Zeppelin, create a new notebook. This first block of code provides some basic package imports for Pandas and charting.

1
2
3
4
5
6
7
8
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from datetime import date, time, datetime

plt.style.use('ggplot')

I define a few utility methods that help me load data from the different CSV files.

1
2
3
4
5
6
7
ts = "{timestamp}"

def file_name(name):
    return "/shealth/samsunghealth_example_user_" + ts + "/" + name + "." + ts + ".csv"
    
def bin_file(type, uuid):
    return "/shealth/samsunghealth_example_user_" + ts + "/jsons/" + type + "/" + uuid[0] + "/" + uuid + ".binning_data.json"

The timestamps stored in time based fields are UTC time, but the time offset is in a separate column. This makes it hard to do certain kinds of investigations because you can’t see things like what time you go to bed on average if you travel around.

The following code provides a load method: load_file(), that can be used to correctly parse the dates.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def date_parser(date, tz):
    return pd.to_datetime(date + " " + tz, utc=False, format='%Y-%m-%d %H:%M:%S.fff %Z%z')

def load_file(name, date_cols=None, tz_col=None, **kwargs):
    parse_dates_param = dict([('foo' + x, [x, tz_col]) for x in date_cols])

    df = pd.read_csv(file_name(name), header=1, index_col=False, keep_date_col=False, parse_dates=parse_dates_param, date_parser=date_parser, **kwargs)

    for key, val in parse_dates_param.items():
        df[val[0]] = df[key]
        df = df.drop(columns=key)

    return df

Analytics

Below I show a few different examples of how to load and visualize interesting data sets.

Sleep Data

Sleep data can be loaded with the following code. This produces two data frames: sleep_data and sleep_data_by_day. If take naps during the day/night, then you’ll get one record per nap + one record per sleep event. If you wake up at night and move around enough, then Samsung Health may create a separate sleep record for when you go back to sleep.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
sleep_data = load_file("com.samsung.shealth.sleep", date_cols=['com.samsung.health.sleep.start_time', 'com.samsung.health.sleep.end_time'], tz_col='com.samsung.health.sleep.time_offset')
sleep_data['start_time'] = sleep_data['com.samsung.health.sleep.start_time']
sleep_data = sleep_data.set_index('com.samsung.health.sleep.start_time').sort_index()

# Drop invalid data
sleep_data = sleep_data[sleep_data['efficiency'] > 0]

sleep_data['day'] = sleep_data['start_time'].dt.floor('d')
sleep_data_by_day = sleep_data.groupby(['day']).agg({'sleep_duration': 'sum', 'movement_awakening': 'sum', 'start_time': 'min', 'physical_recovery': 'mean'})
sleep_data_by_day['first_sleep'] = sleep_data_by_day['start_time'].apply(lambda dt: (dt - dt.replace(hour=0, minute=0, second=0, microsecond=0)).total_seconds() / 60 / 60)
sleep_data_by_day = sleep_data_by_day.sort_index()

With this, you can graph how many hours of sleep you get per night:

1
2
ax = (sleep_data_by_day[['sleep_duration']] / 60).plot()
ax.set_ylabel("hours")

Hours of sleep per day. The dips near zero are likely data quality issues.

Sleep data is additionally broken out into a separate file for “sleep stages”. Sleep stages breaks down each night sleep into light, deep, REM, and awake stages. These stages help explain how good of a night’s sleep you’re getting and identify if you’re tossing and turning too much.

The following code will load sleep data into a data frame:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
sleep_stages = pd.read_csv(file_name("com.samsung.health.sleep_stage"), skiprows=1, index_col=10)

sleep_stage_name_map = {40001: 'awake', 40002: 'light', 40003: 'deep', 40004: 'rem'}

sleep_stages = sleep_stages[sleep_stages['start_time'] > '2021-01-01']
sleep_stages = sleep_stages.sort_values(by='start_time')

# Data Massaging

# Convert Samsung ids into friendly names e.g. 40001 -> 'awake'
sleep_stages['stage_name'] = sleep_stages['stage'].replace(sleep_stage_name_map)
sleep_stages['duration'] = sleep_stages['end_time'] - sleep_stages['start_time']
sleep_stages['day'] = sleep_stages['start_time'].dt.floor('d')

stage_by_day = stages.groupby(['day', 'stage']) \
  .agg({'duration': 'sum'}) \
  .unstack()

stage_by_day.columns = stage_by_day.columns.get_level_values(1)
stage_by_day = stage_by_day.fillna(0)
stage_by_day['total'] = stage_by_day['awake'] + stage_by_day['deep'] + stage_by_day['light'] + stage_by_day['rem'

There’s a couple interesting views of this data that I use. One is to project out the % of time in each stage over a time period to see if my sleep is changing.

1
2
3
4
5
6
7
8
9
stage_by_day['awake%'] = stage_by_day['awake'] / stage_by_day['total']
stage_by_day['light%'] = stage_by_day['light'] / stage_by_day['total']
stage_by_day['rem%'] = stage_by_day['rem'] / stage_by_day['total']
stage_by_day['deep%'] = stage_by_day['deep'] / stage_by_day['total']

data = stage_by_day[['awake%', 'light%', 'rem%', 'deep%']]
ax = data.plot()
ax.set_title("% sleep time spent per stage")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1))

Heart Rate

Heart Rate data can be found in two locations: the main CSV file and in the binned JSON files. The JSON files contain higher resolution per minute. The main CSV file will contain either hourly summaries in the case of binned data or individual data points. My watch can be configured to collect heart rate data either continuously or intermittently. If it’s continuous, then you’ll see the binned data.

The following block will load the summary data and additionally load any binned data that is associated with an hour block into a data frame.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Load Heart Rate Data (non-binned)
heart_rate = pd.read_csv(file_name("com.samsung.shealth.tracker.heart_rate"), skiprows=1, index_col=False, dtype={'com.samsung.health.heart_rate.custom': str, u'com.samsung.health.heart_rate.binning_data': str})#, parse_dates=['com.samsung.health.heart_rate.start_time'])
heart_rate['start_time_tz'] = pd.to_datetime(heart_rate[['com.samsung.health.heart_rate.start_time', 'com.samsung.health.heart_rate.time_offset']].agg(' '.join, axis=1))
heart_rate.set_index('start_time_tz')

# Load the binned data
import glob
hr_bins = []
hr_simplified = heart_rate[heart_rate['com.samsung.health.heart_rate.binning_data'].isna()].rename(columns={
    "com.samsung.health.heart_rate.end_time": "end_time", 
    "start_time_tz": "start_time", 
    "com.samsung.health.heart_rate.heart_rate": "heart_rate", 
    "com.samsung.health.heart_rate.min": "heart_rate_min", 
    "com.samsung.health.heart_rate.max": "heart_rate_max"
})[['start_time', 'end_time', 'heart_rate_min', 'heart_rate_max', 'heart_rate']]
hr_bins.append(hr_simplified)
for file in glob.glob(bin_file("com.samsung.shealth.tracker.heart_rate", "*")):
    hr_bins.append(pd.read_json(file))

heart_rate_full = pd.concat(hr_bins)

heart_rate contains just the summaries meaning either hourly buckets if you enable continuous heart rate monitoring or single samples if don’t
heart_rate_full contains all available heart rate data at the lowest resolution available

Plot min, max, and average daily heart rate per day:

1
2
3
4
5
heart_rate['day'] = heart_rate['start_time_tz'].dt.floor('d')
heart_rate \
  .groupby('day') \
  .agg({'com.samsung.health.heart_rate.heart_rate': ['mean', 'min', 'max']}) \
  .plot()

Weight

Tracking your weight is easy too. The weight is stored as kilograms. The code below concerts to pounds, but if you want to use kilograms remove the conversion line.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
weight = load_file("com.samsung.health.weight", date_cols=['start_time'], tz_col='time_offset')
weight = weight.set_index('start_time').sort_index()

# Convert to pounds
weight['weight'] = weight['weight'] * 2.20462

fig, ax = plt.subplots(2, figsize=(12, 8), sharex='all')
ax[0].plot(weight.index, weight[['weight']])
ax[1].plot(weight.index, weight[['body_fat']])

ax[0].set_xlabel('Time')
ax[0].set_ylabel('Weight')
ax[1].set_ylabel('Body Fat %')

plt.show()

That gives us a plot similar to the below:

Conclusion

Samsung Health gives a great export of data that you can analyze. I’ve given examples of how to leverage a few of the files. In the future, I plan to share some other analytics that I’ve done as I’ve worked to bring data analytics to my personal health.

If you’ve noticed any issues or have any other insights you’ve found, leave a comment below.

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!