This is a notebook of the day to day activity of a look into three years of fasting data.
2021-06-27
Initial thoughts
- I have about 3 years of fasting data now and I would like to see if something interesting can be said about it. And if given some time series data, is it possible to predict the next day. Or at least something about the next day.
- I suspect the data is somewhat cyclical at least with respect to patterns of “restraint” and “indulgence” , but also maybe with respect to week and weekend.
- Initially I think I want to use frequency based features, but perhaps afterwards I can compare that with an LSTM approach.
- I also have some calorie data, but it is only for about 3 months or so, and maybe I can use that somehow too later.
- Perhaps I can even later on try using exercise data, sleep data if available. And there is of course temperature and weather data.
Next steps
- Write out some interesting questions I would like to answer.
- What is my feature list so far , as a sort of data dictionary.
- Can do some initial data exploration, with visualizations.
- Build a dataset based on some initial features,
Interesting Questions ?
- Given a sequence of days, is there an interesting target variable, such as the hours fasted, which can be predicted of the next day or even the next few days or next week?
- A core qualitative question is, whether a period of restraint (more fasting) is followed by a period of less restraint (less fasting)?
- And maybe an extension of the above is whether fasting appears to be cyclical and how would you even measure this?
- How does the number of prior days of activity improve the accuracy of the prediction of the next day?
Feature list initial
- most recent fast started before midnight
- most recent fast ended before 20:00
- most recent fast more than 18 hours
- most recent fast more than 16 hours
- most recent fast more than 14 hours
- last two fasts, average start time before midnight
- last two fasts, average end time before 20:00
- last two fasts, average more than 18 hours
- last two fasts, average more than 16 hours
- last two fasts, max is more than 18 hours
- last two fasts, min is more than 18 hours
- last two fasts, m in is more than 16 hours
Next
- Yea next I would like to start building out a dataset of these features
- And would be cool to visualize some of these features too, get some summary stats on them.
2021-06-28
build dataset…
import os
import pandas as pd
workdir = os.getenv("WORKDIR")
datadir = os.getenv("DATADIR")
loc = f"{datadir}/2021-06-26-zero-fast.csv"
df = pd.read_csv(loc)
In [3]: df.shape
Out[3]: (1109, 5)
In [4]: df.iloc[:4]
Out[4]:
Date Start End Hours Night Eating
0 6/26/21 01:57 NaN NaN NaN
1 6/25/21 00:14 17:17 17.0 NaN
2 6/24/21 01:50 18:39 16.0 NaN
3 6/23/21 01:02 19:21 18.0 NaN
- Probably the simples question, is what is a unit of fasting? A fast can start and end on two different dates or it can start and end on the same day.
- A Fast can also span multiple days and it can also happen while traveling, resulting in different time zones messing with the data.
- However this dataset doesn’t appear to have time zones, so I wonder if it kind of simplifies that.
- Simplest I think is to ignore the first row for example,
Date Start End Hours Night Eating
0 6/26/21 01:57 NaN NaN NaN
- Since this fast is on-going. So a fast unit has a start and an end.
In [6]: df[df.Hours.isnull()].shape, df.shape
Out[6]: ((1, 5), (1109, 5))
- Ok per above only the latest fast would be null then.
Next
- Start writing that dataset
- write a fast id perhaps as the “start-date-start-time” which would be unique, and has to have defined end of course.
- And figure out what does that “Night Eating” column mean.
- names for the features I started writing out.
2021-06-30
quick notes
Also realized that the time between fasts can be a feature too.
For each “row”, we can basically say the “End time” is a separator between what happened before a fasting event, which can be used to create the “X” or independent variables and what happens after is the “Y” or dependent variables.
So for each row, if using pandas for example, we want a way to take perhaps 4 or more rows surrounding it and that would be used to create the features.
Perhaps this feels like a job for rolling window , interesting article
# most recent fast started before midnight
updated feature list
Feature list initial
- most recent fast started before midnight
- most recent fast ended before 20:00
- most recent fast more than 18 hours
- most recent fast more than 16 hours
- most recent fast more than 14 hours
- last two fasts, average start time before midnight
- last two fasts, both start time before midnight
- last two fasts, average end time before 20:00
- last two fasts, average more than 18 hours
- last two fasts, average more than 16 hours
- last two fasts, max is more than 18 hours
- last two fasts, min is more than 18 hours
- last two fasts, m in is more than 16 hours
- last two feeding windows, average less than 4 hours
2021-07-07
tried this rolling calc
- Looks good just need to calibrate the direction.
df['RollingHoursMean2Fasts'] = df['Hours'].rolling(2).mean()
In [8]: df.iloc[:10]
Out[8]:
Date Start End Hours Night Eating RollingHoursMean2Fasts
0 6/26/21 01:57 NaN NaN NaN NaN
1 6/25/21 00:14 17:17 17.0 NaN NaN
2 6/24/21 01:50 18:39 16.0 NaN 16.5
3 6/23/21 01:02 19:21 18.0 NaN 17.0
4 6/22/21 00:06 17:52 17.0 NaN 17.5
5 6/21/21 01:46 20:55 19.0 NaN 18.0
6 6/20/21 04:39 13:28 8.0 NaN 13.5
7 6/19/21 02:37 18:21 15.0 NaN 11.5
8 6/18/21 01:41 18:41 17.0 NaN 16.0
9 6/17/21 01:02 21:38 20.0 NaN 18.5
2021-07-09
- Add index too…
import datetime
df['StartDt'] = df.apply(lambda x: datetime.datetime(int(f"20{x.Date.split('/')[2]}"),
int(x.Date.split("/")[0]),
int(x.Date.split("/")[1]),
int(x.Start.split(":")[0]),
int(x.Start.split(":")[1])
), axis=1)
df['id'] = df.apply(lambda x: x.StartDt.strftime("%Y-%m-%dT%H%M"), axis=1)
In [21]: df.iloc[:10]
Out[21]:
Date Start End Hours Night Eating RollingHoursMean2Fasts StartDt id
0 6/26/21 01:57 NaN NaN NaN NaN 2021-06-26 01:57:00 2021-06-26T0157
1 6/25/21 00:14 17:17 17.0 NaN NaN 2021-06-25 00:14:00 2021-06-25T0014
2 6/24/21 01:50 18:39 16.0 NaN 16.5 2021-06-24 01:50:00 2021-06-24T0150
3 6/23/21 01:02 19:21 18.0 NaN 17.0 2021-06-23 01:02:00 2021-06-23T0102
4 6/22/21 00:06 17:52 17.0 NaN 17.5 2021-06-22 00:06:00 2021-06-22T0006
5 6/21/21 01:46 20:55 19.0 NaN 18.0 2021-06-21 01:46:00 2021-06-21T0146
6 6/20/21 04:39 13:28 8.0 NaN 13.5 2021-06-20 04:39:00 2021-06-20T0439
7 6/19/21 02:37 18:21 15.0 NaN 11.5 2021-06-19 02:37:00 2021-06-19T0237
8 6/18/21 01:41 18:41 17.0 NaN 16.0 2021-06-18 01:41:00 2021-06-18T0141
9 6/17/21 01:02 21:38 20.0 NaN 18.5 2021-06-17 01:02:00 2021-06-17T0102
- Looking at above, I see that
RollingHoursMean2Fasts
for2021-06-24T0150
is16.5 = mean([17, 16])
, so the rows it is taking into account should be current row and next not current row and last
Next
- Make
RollingHoursMean2Fasts
be for current and previous fast , not current and future fast haha since cannot know the future!
2021-07-10
how to update the rolling window to reflect the last 2 fasts
df['foo'] = df['Hours'].shift(-1)
df['RollingHoursMean2Fasts'] = df['Hours'].shift(-1).rolling(2).mean()
- Nice, per the below, this is perfect.
In [33]: cols = ['id', 'Date', 'Start', 'End', 'Hours', 'RollingHoursMean2Fasts', 'foo']
...: df[cols].iloc[:10]
Out[33]:
id Date Start End Hours RollingHoursMean2Fasts foo
0 2021-06-26T0157 6/26/21 01:57 NaN NaN NaN 17.0
1 2021-06-25T0014 6/25/21 00:14 17:17 17.0 16.5 16.0
2 2021-06-24T0150 6/24/21 01:50 18:39 16.0 17.0 18.0
3 2021-06-23T0102 6/23/21 01:02 19:21 18.0 17.5 17.0
4 2021-06-22T0006 6/22/21 00:06 17:52 17.0 18.0 19.0
5 2021-06-21T0146 6/21/21 01:46 20:55 19.0 13.5 8.0
6 2021-06-20T0439 6/20/21 04:39 13:28 8.0 11.5 15.0
7 2021-06-19T0237 6/19/21 02:37 18:21 15.0 16.0 17.0
8 2021-06-18T0141 6/18/21 01:41 18:41 17.0 18.5 20.0
9 2021-06-17T0102 6/17/21 01:02 21:38 20.0 18.5 17.0
Ok, going to add two more features like this
- last two fasts, both start time before midnight, “LastTwoFastsStartedBeforeMidnight”
def func(x):
pass
import ipdb; ipdb.set_trace()
return all([18 <= int(a.split(":")[0]) <=23 for a in x])
df["LastTwoFastsStartedBeforeMidnight"] = df['Start'].shift(-1).rolling(2).apply(func)
- I encountered
DataError: No numeric types to aggregate
and deeper in there I also stumbled on
ValueError: could not convert string to float: '00:14'
- So got to convert first to a numerical..
def func(data):
return all([18 <= x <=23 for x in data])
df["StartHour"] = df["Start"].map(lambda x: int(x.split(":")[0]))
df["LastTwoFastsStartedBeforeMidnight"] = df['StartHour'].shift(-1).rolling(2).apply(func)
- Ok I think that worked ..
In [57]: df.LastTwoFastsStartedBeforeMidnight.value_counts()
Out[57]:
0.0 907
1.0 200
Name: LastTwoFastsStartedBeforeMidnight, dtype: int64
In [60]: cols = ['id', 'Date', 'Start', 'End', 'Hours', 'LastTwoFastsStartedBeforeMidnight',]
...: df[cols].iloc[35:45]
Out[60]:
id Date Start End Hours LastTwoFastsStartedBeforeMidnight
35 2021-05-21T2341 5/21/21 23:41 21:02 21.0 0.0
36 2021-05-21T0026 5/21/21 00:26 20:31 20.0 0.0
37 2021-05-20T0101 5/20/21 01:01 20:57 19.0 0.0
38 2021-05-19T0008 5/19/21 00:08 20:29 20.0 0.0
39 2021-05-17T2324 5/17/21 23:24 21:38 22.0 1.0
40 2021-05-16T2300 5/16/21 23:00 20:21 21.0 1.0
41 2021-05-15T2358 5/15/21 23:58 14:06 14.0 0.0
42 2021-05-15T0110 5/15/21 01:10 21:42 20.0 0.0
43 2021-05-14T0049 5/14/21 00:49 17:53 17.0 0.0
44 2021-05-13T0015 5/13/21 00:15 17:26 17.0 0.0
Next
- Ok at this point I should start throwing stuff into version control so creating these features and visualizing/analyzing them can be more deterministic/reproducible.
- And then I can try visualizing / understanding some features and see how predictive they are w.r.t. “does past behavior determine future behavior such as the length of the next fast”.
2021-07-11
ok started things off in a new repo
- New repo here , https://github.com/namoopsoo/fasting-analyze
import os
import core.dataset as cd
import pandas as pd
workdir = os.getenv("WORKDIR")
datadir = os.getenv("DATADIR")
loc = f"{datadir}/2021-06-26-zero-fast.csv"
df = pd.read_csv(loc)
datasetdf = cd.build_dataset(df)
In [11]: datasetdf.iloc[:10]
Out[11]:
Date Start End Hours ... id StartHour LastTwoFastsStartedBeforeMidnight RollingHoursMean2Fasts
0 6/26/21 01:57 NaN NaN ... 2021-06-26T0157 1 NaN NaN
1 6/25/21 00:14 17:17 17.0 ... 2021-06-25T0014 0 0.0 16.5
2 6/24/21 01:50 18:39 16.0 ... 2021-06-24T0150 1 0.0 17.0
3 6/23/21 01:02 19:21 18.0 ... 2021-06-23T0102 1 0.0 17.5
4 6/22/21 00:06 17:52 17.0 ... 2021-06-22T0006 0 0.0 18.0
5 6/21/21 01:46 20:55 19.0 ... 2021-06-21T0146 1 0.0 13.5
6 6/20/21 04:39 13:28 8.0 ... 2021-06-20T0439 4 0.0 11.5
7 6/19/21 02:37 18:21 15.0 ... 2021-06-19T0237 2 0.0 16.0
8 6/18/21 01:41 18:41 17.0 ... 2021-06-18T0141 1 0.0 18.5
9 6/17/21 01:02 21:38 20.0 ... 2021-06-17T0102 1 0.0 18.5
[10 rows x 10 columns]
In [12]: datasetdf.iloc[0]
Out[12]:
Date 6/26/21
Start 01:57
End NaN
Hours NaN
Night Eating NaN
StartDt 2021-06-26 01:57:00
id 2021-06-26T0157
StartHour 1
LastTwoFastsStartedBeforeMidnight NaN
RollingHoursMean2Fasts NaN
Name: 0, dtype: object
look at RollingHoursMean2Fasts briefly
import matplotlib.pyplot as plt
import pylab
import date_utils as du
col = "RollingHoursMean2Fasts"
loc = f"{workdir}/{du.utc_ts()}-{col}.png"
with plt.style.context('fivethirtyeight'):
plt.plot(datasetdf[col].tolist())
plt.title(f'{col}')
pylab.savefig(loc, bbox_inches='tight')
pylab.close()
Next
- I think the main next thing to do is to create a “y” dependent variable for this dataset, which can for example be something like “hours fasted in next fast” or “hours until next fast” or “proportion of hours fasted in next 7 days”.
- And with a dependent variable I can then see how predictive these features are, as a first iteration with only the twofeatures so far.
- And I can continue to add other features too.