Plot your health with Samsung Health and Pandas

A group of pandas exercising

Artwork by Sami Lee.

For the last 5+ years, I’ve been tracking my various aspects of my personal health using Samsung Health. It helps track weight, calories, heart rate, stress, and exercise and stores all of it in the app.

However, the app only gives some basic high level charts and insights. Luckily, it enables you to export your personal data into CSV files that you can then import into your tool of choice and perform any kind of analytics. In this post, I’m going to show how to export it all, then load it into Zeppelin and some sample Pandas queries that’ll enable you to start building more complex queries yourself.

Use the following links to jump directly to each section:

Download your Data

First, open Samsung Health on your mobile phone

Scroll down and tap “Download personal data”

Tap the Download button

Login to your account when it prompts you

Wait for it to download

After it completes downloading, tap View files.

Select the file, tap compress, save it as a zip file.

Then upload the file to the computer running Zeppelin. I upload it to OneDrive, then SFTP it to the server, then extract it into a folder

Working with the Data

The export should contain a number of interesting files in the main directory along with the json and files directories. The main files are CSV formatted.

File NameDoc LinkNotes{ts}.csv{ts}.csvLink{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csvLink{ts}.csv{ts}.csvLink{ts}.csvLink{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csvLink{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csvLink{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csvLink{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csv{ts}.csvLink{ts}.csv{ts}.csv{ts}.csvLink
files/Contains random files like your profile picture
json/Contains the binned data which is higher resolution (e.g. minute level) data from workouts, sleep, etc.


Once the data is loaded and accessible to Zeppelin, create a new notebook. This first block of code provides some basic package imports for Pandas and charting.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from datetime import date, time, datetime'ggplot')Code language: JavaScript (javascript)

I define a few utility methods that help me load data from the different CSV files.

ts = "{timestamp}"

def file_name(name):
    return "/shealth/samsunghealth_example_user_" + ts + "/" + name + "." + ts + ".csv"
def bin_file(type, uuid):
    return "/shealth/samsunghealth_example_user_" + ts + "/jsons/" + type + "/" + uuid[0] + "/" + uuid + ".binning_data.json"
Code language: JavaScript (javascript)

The timestamps stored in time based fields are UTC time, but the time offset is in a separate column. This makes it hard to do certain kinds of investigations because you can’t see things like what time you go to bed on average if you travel around.

The following code provides a load method: load_file(), that can be used to correctly parse the dates.

def date_parser(date, tz):
    return pd.to_datetime(date + " " + tz, utc=False, format='%Y-%m-%d %H:%M:%S.fff %Z%z')

def load_file(name, date_cols=None, tz_col=None, **kwargs):
    parse_dates_param = dict([('foo' + x, [x, tz_col]) for x in date_cols])

    df = pd.read_csv(file_name(name), header=1, index_col=False, keep_date_col=False, parse_dates=parse_dates_param, date_parser=date_parser, **kwargs)

    for key, val in parse_dates_param.items():
        df[val[0]] = df[key]
        df = df.drop(columns=key)

    return dfCode language: PHP (php)


Below I show a few different examples of how to load and visualize interesting data sets.

Sleep Data

Sleep data can be loaded with the following code. This produces two data frames: sleep_data and sleep_data_by_day. If take naps during the day/night, then you’ll get one record per nap + one record per sleep event. If you wake up at night and move around enough, then Samsung Health may create a separate sleep record for when you go back to sleep.

sleep_data = load_file("", date_cols=['', ''], tz_col='')
sleep_data['start_time'] = sleep_data['']
sleep_data = sleep_data.set_index('').sort_index()

# Drop invalid data
sleep_data = sleep_data[sleep_data['efficiency'] > 0]

sleep_data['day'] = sleep_data['start_time'].dt.floor('d')
sleep_data_by_day = sleep_data.groupby(['day']).agg({'sleep_duration': 'sum', 'movement_awakening': 'sum', 'start_time': 'min', 'physical_recovery': 'mean'})
sleep_data_by_day['first_sleep'] = sleep_data_by_day['start_time'].apply(lambda dt: (dt - dt.replace(hour=0, minute=0, second=0, microsecond=0)).total_seconds() / 60 / 60)
sleep_data_by_day = sleep_data_by_day.sort_index()
Code language: PHP (php)

With this, you can graph how many hours of sleep you get per night:

ax = (sleep_data_by_day[['sleep_duration']] / 60).plot()
ax.set_ylabel("hours")Code language: JavaScript (javascript)
Hours of sleep per day. The dips near zero are likely data quality issues.

Sleep data is additionally broken out into a separate file for “sleep stages”. Sleep stages breaks down each night sleep into light, deep, REM, and awake stages. These stages help explain how good of a night’s sleep you’re getting and identify if you’re tossing and turning too much.

The following code will load sleep data into a data frame:

sleep_stages = pd.read_csv(file_name(""), skiprows=1, index_col=10)

sleep_stage_name_map = {40001: 'awake', 40002: 'light', 40003: 'deep', 40004: 'rem'}

sleep_stages = sleep_stages[sleep_stages['start_time'] > '2021-01-01']
sleep_stages = sleep_stages.sort_values(by='start_time')

# Data Massaging

# Convert Samsung ids into friendly names e.g. 40001 -> 'awake'
sleep_stages['stage_name'] = sleep_stages['stage'].replace(sleep_stage_name_map)
<meta charset="utf-8">sleep_stages['duration'] = <meta charset="utf-8">sleep_stages['end_time'] - <meta charset="utf-8">sleep_stages['start_time']
<meta charset="utf-8">sleep_stages['day'] = <meta charset="utf-8">sleep_stages['start_time'].dt.floor('d')

stage_by_day = stages.groupby(['day', 'stage']) \
  .agg({'duration': 'sum'}) \

stage_by_day.columns = stage_by_day.columns.get_level_values(1)
stage_by_day = stage_by_day.fillna(0)
stage_by_day['total'] = stage_by_day['awake'] + stage_by_day['deep'] + stage_by_day['light'] + stage_by_day['rem'Code language: PHP (php)

There’s a couple interesting views of this data that I use. One is to project out the % of time in each stage over a time period to see if my sleep is changing.

stage_by_day['awake%'] = stage_by_day['awake'] / stage_by_day['total']
stage_by_day['light%'] = stage_by_day['light'] / stage_by_day['total']
stage_by_day['rem%'] = stage_by_day['rem'] / stage_by_day['total']
stage_by_day['deep%'] = stage_by_day['deep'] / stage_by_day['total']

data = stage_by_day[['awake%', 'light%', 'rem%', 'deep%']]
ax = data.plot()
ax.set_title("% sleep time spent per stage")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1))Code language: JavaScript (javascript)

Heart Rate

Heart Rate data can be found in two locations: the main CSV file and in the binned JSON files. The JSON files contain higher resolution per minute. The main CSV file will contain either hourly summaries in the case of binned data or individual data points. My watch can be configured to collect heart rate data either continuously or intermittently. If it’s continuous, then you’ll see the binned data.

The following block will load the summary data and additionally load any binned data that is associated with an hour block into a data frame.

# Load Heart Rate Data (non-binned)
heart_rate = pd.read_csv(file_name(""), skiprows=1, index_col=False, dtype={'': str, u'': str})#, parse_dates=[''])
heart_rate['start_time_tz'] = pd.to_datetime(heart_rate[['', '']].agg(' '.join, axis=1))

# Load the binned data
import glob
hr_bins = []
hr_simplified = heart_rate[heart_rate[''].isna()].rename(columns={
    "": "end_time", 
    "start_time_tz": "start_time", 
    "": "heart_rate", 
    "": "heart_rate_min", 
    "": "heart_rate_max"
})[['start_time', 'end_time', 'heart_rate_min', 'heart_rate_max', 'heart_rate']]
for file in glob.glob(bin_file("", "*")):

heart_rate_full = pd.concat(hr_bins)Code language: PHP (php)
  • heart_rate contains just the summaries meaning either hourly buckets if you enable continuous heart rate monitoring or single samples if don’t
  • heart_rate_full contains all available heart rate data at the lowest resolution available

Plot min, max, and average daily heart rate per day:

heart_rate['day'] = heart_rate['start_time_tz'].dt.floor('d')
heart_rate \
  .groupby('day') \
  .agg({'': ['mean', 'min', 'max']}) \
  .plot()Code language: JavaScript (javascript)


Tracking your weight is easy too. The weight is stored as kilograms. The code below concerts to pounds, but if you want to use kilograms remove the conversion line.

weight = load_file("", date_cols=['start_time'], tz_col='time_offset')
weight = weight.set_index('start_time').sort_index()

# Convert to pounds
weight['weight'] = weight['weight'] * 2.20462

fig, ax = plt.subplots(2, figsize=(12, 8), sharex='all')
ax[0].plot(weight.index, weight[['weight']])
ax[1].plot(weight.index, weight[['body_fat']])

ax[1].set_ylabel('Body Fat %') language: PHP (php)

That gives us a plot similar to the below:


Samsung Health gives a great export of data that you can analyze. I’ve given examples of how to leverage a few of the files. In the future, I plan to share some other analytics that I’ve done as I’ve worked to bring data analytics to my personal health.

If you’ve noticed any issues or have any other insights you’ve found, leave a comment below.

2 thoughts on “Plot your health with Samsung Health and Pandas”

  1. Thank you for this article, great work. I was planning to use Pandas for the same purpose and I’m glad I don’t have to start from scratch.

  2. This is great — thank you! One issue, though, is that I think there is an error in your date parser. The time on the watch is already in UTC, and the time_offset column is UTC to the local time. Your parser ends up shifting in the wrong direction. I found that the following parser gives the correct offset:

    def date_parser(date, tz):
    return pd.to_datetime(date, utc=True, format='%Y-%m-%d %H:%M:%S.%f').tz_convert(tz)

Leave a Reply

Your email address will not be published. Required fields are marked *