Using synthetic data and machine learning to predict heart rate and enable lifestyle recommendations

Helana Lutfi; Thomas Spittler; Hassan Majid Ibrahim

doi:10.21037/jmai-24-35

Original Article

Using synthetic data and machine learning to predict heart rate and enable lifestyle recommendations

Helana Lutfi¹ , Thomas Spittler¹ , Hassan Majid Ibrahim²

¹Faculty European Campus Rottal-Inn, Deggendorf Institute of Technology, Pfarrkirchen, Germany; ²Faculty of Civil Engineering, Environmental and Infrastructure, RWTH Aachen University, Aachen, Germany

Contributions: (I) Conception and design: H Lutfi; (II) Administrative support: All authors; (III) Provision of study materials or patients: H Lutfi; (IV) Collection and assembly of data: H Lutfi; (V) Data analysis and interpretation: H Lutfi; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Helana Lutfi, MSc. Faculty European Campus Rottal-Inn, Deggendorf Institute of Technology, Eggenfeldener Str.55, 84347 Pfarrkirchen, Germany. Email: helana.lutfi@th-deg.de.

Background: Heart rate (HR) is an essential indicator for cardiovascular (CV) health and is known to be influenced by unhealthy lifestyle habits like smoking, alcohol and caffeinated drinks consumption. However, to date the potential degree of impact of multiple lifestyle factors on HR is unknown. The main goal in this study is to train a machine learning (ML) model to capture future HR over a 10-day period and subsequent generation of lifestyle recommendations based on predicted HR.

Methods: The proposed system consists of generating synthetic data, building HR random forest ML predictive model, and evaluating the performance of the model versus real participants’ data.

Results: We applied the system to 25 male participants. The study results validate the system’s ability to demonstrate effectiveness in predicting HR variations in response to lifestyle factors, including smoking, alcohol, and caffeinated drinks during different times of the day. Accurate predictions were observed for baseline HR readings, compared to the poorest performance for the prediction model at noon as reflected in the mean absolute error closer to zero and high R2 score of 0.953, capturing around 95.3% of the variance in the actual data.

Conclusions: The study presents the foundation for further studies to investigate the applicability of other ML techniques for predicting HR using synthetic data generation. Additionally, the model can be further explored to improve its accuracy by including additional behavioural factors that influence the HR variability.

Keywords: Heart rate (HR); prediction; random forest (RF); synthetic data; lifestyle

Received: 04 February 2024; Accepted: 14 June 2024; Published online: 30 July 2024.

doi: 10.21037/jmai-24-35

Highlight box

Key findings

• The study validates the system’s ability to predict heart rate (HR) variations based on lifestyle factors, including smoking, alcohol and caffeinated drinks.

What is known and what is new?

• HR is a vital indicator for cardiovascular health. Few studies had reported the applicability of machine learning in HR prediction using actual dataset.

• The study introduces a new method of using random forest regressor model to predict future HR variability over a 10-day period using the concept of synthetic data generation.

What is the implication, and what should change now?

• The synthetic data generation framework can be used as the foundation to predict HR using several data analytics techniques.

• Additional factors influencing HR variability can be incorporated into the model to improve its accuracy.

Introduction

Cardiovascular disease (CVD) is the main cause of death worldwide (1). The diagnosis of acute myocardial infarction and the prognosis of heart failure, the former being the main cause of mortality and the latter the entity with the highest morbidity and the highest cost for the healthcare system (2). Heart rate (HR) is an essential indicator for cardiovascular health (3). Early detection of deviations in HR allows for easier detection and management of cardiac function irregularities at an early stage (4). Timely identification of heart problems improves patients’ quality of life and increases their survival time (5). A CVD patient’s health is adversely impacted by behavioral risk factors (6). However, to date the potential degree of impact of multiple lifestyle factors on HR has not been investigated (7).

Artificial intelligence (AI) plays a crucial role for identifying CVDs at an earlier stage, providing personalised treatment, and evaluating prognoses in healthcare systems (8). Moreover, AI allows a paradigm shift from the traditional identification of CVDs risk factors to personalized prevention strategies tailored to characteristics for each patient (9). Based on a knowledge of user and AI interactions, prediction and therefore prevention will better involve patients to control their own health (10).

Recent studies conducted between 2022 and 2023 have explored the prediction of HR using various machine learning (ML) models. One study reported that the gated recurrent units (GRU) model performed the best in predicting HR 5 minutes in advance using Medical Information Mart for Intensive Care dataset (11). In contrast, another study concluded that the autoregressive integrated moving average (ARIMA) model and linear regression showed efficacy in HR prediction for durations longer than 1 minute using accelerometer-derived datasets (12). Whereas support vector machine (SVM) was identified as the top-performing model for HR prediction based on electrocardiogram (ECG) datasets in studies conducted in Canada, achieving an accuracy of 91.8% (13). Additionally, a research highlighted the significant predictive capabilities of long short-term memory (LSTM) and bidirectional LSTM (BiLSTM) models in forecasting resting HR during sleep, leveraging datasets obtained from HR acquisition equipment, with these models the root mean square error (RMSE) was less than 2 beats per minute (bpm) (14).

On the other hand, a study comparing various ML techniques in predicting HR has highlighted that random forest (RF) regressor improves HR predictions when applied to HR data collected over extended time-period (15). Supported findings have substantiated the efficacy of the RF regressor algorithm as the most suitable predictive model, owing to its demonstrated advantages in mitigating overfitting, efficiency in training and adaptability to new problems (16). These results were supported by a study, which reported in a literature review that RF regressor yields better predictions in healthcare applications than alternative ML algorithms with its accuracy, sensitivity and specificity rates (17). Based on these collective insights, the RF regressor algorithm was selected in this study for HR prediction.

To the best of the knowledge, none of the existing literature on HR prediction used RF regressor for predicting future HRs using synthetic dataset. For this reason, in this study we propose to new method to use a RF regressor algorithm to estimate HR for a span of 10 days based on generated synthetic data. Therefore, the objectives of this study are to (I) generate synthetic dataset based on evidence-based literature and established guidelines, (II) build a RF ML predictive model to capture future HRs over a 10-day period based on input features, (III) utilize the predicted HRs values to represent the average HR for baseline and for three time periods: morning, noon and evening and (IV) utilize the predicted average HRs to enable activities and lifestyle recommendations.

Methods

System architecture

In this proposed framework, a RF regressor is used to predict future HR time-series readings for a span of the 10-day period using synthetic data generation, and then utilize the 10-day predicted HRs values to present the average HR for a baseline and for three time periods: morning, noon and evening. Based on the predicted average HR values, recommendations on lifestyle and activities are proposed. Figure 1 describes the architecture of the proposed system. It is structured into five major stages, including generating synthetic data, data pre-processing, data modelling, model training and evaluation.

Figure 1 System architecture of the proposed heart rate prediction model. MAE, mean absolute error; MAPE, mean absolute percentage error; MSE, mean squared error; RMSE, root mean square error.

Firstly, the RF regressor model was trained to make predictions about future HR using the synthetic datasets generated for the project. The synthetic dataset included vital, behavioural and contextual information. The input data for the RF model included vital signs (resting HR ranging from 66 to 71 bpm), behavioural factors (the daily consumption of cigarettes, beer cans, energy drink cans and coffee cups) and contextual information (time of HR measurement and age) and rising in HR value after habit intake. The data were entered as features. Feature engineering is performed to enhance the performance and predictive power of the ML algorithm. RF model was then trained using the available features and HR readings to predict HR readings for a duration of 10 days across 24 hours over four time periods: morning, noon, evening, and sleeping times. Additionally, the model aims to predict average HR over 10 days at a baseline, morning, noon, and evening periods. The optimized hyperparameters were applied to the model, and its performance was evaluated.

Generating synthetic data

Information on study variables of interest was gathered based on German guidelines such as German Federal Centre for Health Education (Bundeszentrale für gesundheitliche Aufklärung, BZgA), European Food and Safety Authority (EFSA), The German Cancer Research Center (Deutsches Krebsforschungszentrum, DKFZ) and Federal Ministry of Health (Bundesministerium für Gesundheit). Supported international guidelines included National Health Service (NHS) and the World Health Organization (WHO). The search was conducted based on two questions (I) what are the risk factors that affect HR? and (II) how do these risk factors affect HR?

A semi-structured interview with open ended questions was additionally conducted with a specialist pulmonologist physician who works at Klinik Schillerhöhe GmbH. This general hospital is in Gerlingen which is a town in the district of Ludwigsburg, Baden-Württemberg, Germany. The interview was conducted to indicate (I) what is resting HR? and (II) when do HR measurements become concerning during the lifestyle factors? (Table 1).

Table 1

Pulmonologist insights: heart rate explored via open-ended interview

Question	Answer
What is resting heart rate?	“Resting heart rate is an important biomarker for cardiovascular health, it represents heart rate beats at rest. Average heart rate for adult’s ranges from 60 to 100 beats per minute. When resting heart rate either too high or too low, it might be a sign of a medical concern.”
When do heart rate measurements become concerning during the lifestyle factors?	“When the heart pumps too fast, it may not pump enough blood to the rest of the body. Organs and tissues may not get enough oxygen resulting in hypoxia. Lowering heart rate requires lifestyle changes such as quitting smoking, controlling alcohol consumption per day and maintaining healthy lifestyle.”

To generate a synthetic dataset and derive insights, mathematical or logical operations were applied to the collected information, which was based on evidence literature and established guidelines. Attributes and other measurable variables relevant to the dataset were identified. These variables provide a comprehensive overview of the factors considered in the study, ranging from demographic characteristics like age to lifestyle factors such as smoking and alcohol consumption, all of which can influence HR fluctuations. Maximum and minimum values are also assigned to each variable as listed in Table 2.

Table 2

Study variables with Max. and Min. values

Variables	Min. and Max.
Age (years)	18 to 65+
Smoking (number of cigarettes per day)	0–20
Alcohol consumption (number of beer cans per day)	0–2
Energy drinking (number of energy cans per day)	0–3
Coffee drinking (number of cups per day)	0–4
Rest HR (bpm)	66–71
Time of HR measurement (24 h time interval)	1–3
Rising in HR (bpm)	–

HR, heart rate.

To provide a comprehensive picture of HR fluctuations over an extended period of time, HR readings series were evenly spaced over four time periods: morning (8:00–10:00 am), noon (13:00–17:00 pm), evening (18:00–22:00 pm), and sleeping time (23:00 pm–7:00 am).

To reflect real-world data, a variable or fluctuating habit consumption model was created to indicate that habit consumption is subject to change and can vary from day to day (Table 3). Thus, the habit rate (R) taking place at a particular time was calculated according to the equation:

$R = H peak / H \max$ [1]

Table 3

Habit rate (R) at specific time

Time	Habit rate (R)
Time	Cigarette	Beer cans	Energy drinks	Coffee cups
Morning	0.3	0	0	0.5
Noon	0.2	0	0.66	0.2
Evening	0.5	1.0	0.34	0.3

R represents the habit rate; H_peak is the peak habit consumption at a daytime; H_max is maximum daily limit of the habit.

During different times of the day (morning, noon, and evening), the peak consumption of different habits (smoking, coffee, energy drinks, and alcohol) can be seen in Table 4. The model in Table 3 was created based on the total 10 hours duration of time for the habits including morning (2 hours), noon (4 hours), and evening (4 hours).

Table 4

Peak habit consumption throughout the day

Habit	Peak habit consumption at a daytime
Habit	Morning	Noon	Evening
Smoking	6 cigs.	4 cigs.	10 cigs.
Coffee	2 cups	1 cup	1 cup
Energy drinks	−	2 cans	1 can
Alcohol	−	−	2 cans

−, none; cigs., cigarettes.

Habit consumption such as the number of cigarettes, beer and energy cans or coffee cups at morning, noon and evening accordingly depends on the rate of a habit taking place (R) at the same specified time and the average number of habit intakes per day. For example, a man on average smokes 10 cigarettes per day. Out of these 3 cigarettes in the morning, 2 cigarettes at noon, and 5 cigarettes in the evening were consumed. For this, the following equation is applied:

$C = R \times N$ [2]

C represents habit consumption; R represents the habit rate; N represents the average number of habit intake per day.

A uniform probability distribution by n+3 and n−3 were assigned to give values within a range of possible outcomes for the number of cigarettes smoked per day over a period of 10 days. Whereas an equal probability by n or n−1 was assigned to give assigned to give values within a range of possible outcomes for the number of beer cans, energy drinks cans and coffee cups per day over a period of 10 days.

Over a period of 10 days, the average HR of the predictions is shown at four different times: baseline, morning, noon, and evening. This step involved the subsequent equation:

$Average HR = \sum readings at a specific time / 10 days$ [3]

Interpolation method (linear interpolation) was used to estimate how much time it takes to reach the value of HR at rest following cigarette smoking, beer cans consumption, coffee cups intake, and energy cans drinking. This also included simulation the value of HR during the time between two events. The simulated effect depends on individual’s HR at rest and intensity or the frequency of habit intake per day.

A study on the acute effects of daily nicotine intake on the HR has revealed simulated HR following smoking (18). Interpolated records in Table 5 shows the relationship between the number of cigarettes smoked per day and the amount of time measured in hours it took for HR to return to its resting level.

Table 5

Recovery time for the heart rate after daily smoking

HR_rest	1–3 cigs.	4–8 cigs.	9–14 cigs.	15–20 cigs.
66 bpm	1.5 hours	5 hours	11.5 hours	13 hours
67 bpm	1 hour	4 hours	10.5 hours	12 hours
68 bpm	0.75 hours	3.5 hours	7 hours	11 hours
69 bpm	0.75 hours	2.75 hours	6 hours	10 hours
70 bpm	0.5 hours	2.5 hours	5.5 hours	8 hours
71 bpm	0.5 hours	2.5 hours	5 hours	7 hours

HR_rest, rest heart rate; cigs., cigarettes.

Physiological and pathological effects of alcohol on the cardiovascular system have been observed during a study of alcohol intake on the cardiovascular system (19). Interpolated recordings in Table 6 represent the time measured in hours it took for the HR to return to its resting level after consuming different quantities of 1 to 2 cans of beer per day. The beer consumption was standardised using a beer can size of 341 mL with 5% alcohol by volume (ABV).

Table 6

Recovery time for heart rate after daily alcohol consumption

HR_rest	1 can	2 cans
66 bpm	2.5 hours	5.5 hours
67 bpm	2.5 hours	5.0 hours
68 bpm	2.5 hours	4.5 hours
69 bpm	2.5 hours	4.25 hours
70 bpm	2.5 hours	4.0 hours
71 bpm	2.5 hours	4.0 hours

HR_rest, rest heart rate.

A study on the energy drinks and their acute effects on heart rhythm and electrocardiographic time intervals had analysed the separate effects of the variable energy drinks at each interval to observe the mean HR variations at different time periods (20). Following consumption of 1 to 3 energy cans per day, the interpolated readings in Table 7 represent the length of time that it took for the HR to return to its resting level. For energy drink consumption, we considered a standard can containing 80mg of caffeine per 250 mL.

Table 7

Recovery time for the heart rate after daily energy drink

HR_rest	1 can	2 cans	3 cans
66–71 bpm	1 hour	1.75 hours	2.75 hours

HR_rest, rest heart rate.

A study on the effect of caffeine intake on blood pressure and HR variability had identified the cardiovascular behaviour at rest, before and after caffeine or placebo ingestion (21). Table 8 shows interpolated data to represent the time measured in hours it took for the HR to return to its resting level after consuming different quantities of 1 to 4 cups of coffee per day. For the coffee intake, we utilized an instant coffee measurement, following the instructions on the glass can, where 1.5 spoons of coffee in 150 mL of water constituted one cup, with each cup containing approximately 27 mg of caffeine.

Table 8

Recovery time for the heart rate after the daily coffee intake

HR_rest	1 cup	2 cups	3 cups	4 cups
66–71 bpm	0.75 hours	1.5 hours	2.25 hours	3.0 hours

HR_rest, rest heart rate.

Additionally, the HR readings were influenced by the increase in bpm upon habit consumption. When consuming caffeine, the HR can rise by 3 bpm (22). On the other hand, the German Cancer Research Center (Deutsches Krebsforschungszentrum, DKFZ) had reported that the nicotine in cigarette smoking increases HR by 10 to 20 beats thus, each cigarette corresponds to an increase of 1 bpm (23). Whereas the German Federal Centre for Health Education (Bundeszentrale für gesundheitliche Aufklärung, BZgA) had identified that alcohol, a 5% alcohol for a 341 mL beer can holds 17.05 mL alcohol and every 17.05 mL increases the HR by 5 bpm. Therefore, a can of 500 mL beer holds a 25 mL alcohol and would increase the HR by 7 bpm (24).

To understand how the heart responds to habit consumption in terms of its bpm above the resting level. The following equation quantifies the change in HR upon habit consumption by adding the HR at rest to the increased bpm due to habit consumption:

$HR upon habit consumption = HRrest + increase in bpm$ [4]

To enable lifestyle recommendations, the recommendations were structured based on a report released from Erlangen Fau University Press in Germany (25). The recommendations were determined based on two main factors: age and predicted evening HR. The age was categorized into three distinct age groups. Simultaneously, the predicted evening HR was categorized into HR ranges. The combination of these age and HR ranges guided the formulation of personalized lifestyle recommendations. For each age group and HR range combination, an output was provided on recommended activities and lifestyle, as depicted in Table 9. In this table, the activities were classified into three distinct categories, denoted as A, B, and C. Category A encompassed activities such as brisk walking, power walking, riding a bike on level ground or hills, playing double tennis, and engaging in water aerobics. In Category B, activities like jogging, cycling, swimming, and participating in sport games such as basketball and soccer were included. Finally, Category C included brisk walking, power walking, gardening, household chores, walking domestic animals, riding a bike on level ground or hills, and active involvement in sports with children. Additionally, lifestyle recommendations also included gradually reducing unhealthy habits.

Table 9

Recommended activities based on age and HR ranges

HR ranges (bpm)	Age groups (years)
HR ranges (bpm)	18 to 39	40 to 64	65≤
66 to 74	A + B	B	C
75 to 79
80 to 84
85 to 89
90≤

A, encompassed activities such as brisk walking, power walking, riding a bike on level ground or hills, playing double tennis, and engaging in water aerobics; B, activities like jogging, cycling, swimming, and participating in sport games such as basketball and soccer were included; C, included brisk walking, power walking, gardening, household chores, walking domestic animals, riding a bike on level ground or hills, and active involvement in sports with children. HR, heart rate.

Predictive modelling using RF

In this study, RF regression was utilized to model HR aiming to predict a 10-day HR values using the ‘RandomForestRegressor’ class from the Scikit-learn library and Python environment. A total number of HR samples in the dataset of 15,000 records across morning, noon and evening obtained through synthetic generated data were used for training the RF model. The input features employed for the RF model included ‘age’, ’smoking’, ‘alcohol_consumption’, ‘energy_drinking’, ‘coffee_drinking’, ‘HR_rest’, ‘time (T) of HR measurement’, ‘rising in HR (bpm) during activity morning’, ‘rising in HR (bpm) during activity noon’, and ‘rising in HR (bpm) during activity evening’. Data preprocessing involved LabelEncode class to encode categorical data into numerical values to prepare the dataset for training. The RF model’s output then would provide estimations of rising HR for morning, noon, and evening times over a 10-day period. We also employed GridSearch algorithm to fine-tune the hyperparameters. we set the ‘n_estimators’ to [100], ‘max_depth’ to [8], ‘max_features’ to [‘auto’]. Finally, the trained models are saved using joblib.

RF is an ensemble learning method, constructs multiple decision trees during training and computes the mean prediction of the individual trees. The RF algorithm was employed to aggregate predictions from numerous weak learners (decision trees) to form a robust learner with improved generalization performance (26). The predictions of the decision trees were aggregated by taking the mean prediction (regression) of the individual trees.

Performance evaluation of the RF model

Mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), RMSE, and R-squared (R²) score were applied to evaluate the performance of the RF regression predictive model and to measure how well the RF model predictions align with the truth values of the pulse oximeter readings of real testers. Figure 2 illustrates an overview to the stepwise process steps of the performance evaluation.

Figure 2 Stepwise process overview of the performance evaluation. ML, machine learning; MAE, mean absolute error; MAPE, mean absolute percentage error; MSE, mean squared error; RMSE, root mean square error.

Obtaining the actual values from real participants was done using a wireless pulse oximeter device called iHealth provided by the digital health laboratory at Deggendorf Institute of Technology. It displays the pulse rate result on a large digital LED screen and on a smartphone application and can store up to 100 measurements per user (27). Prior testing, the participants were advised for accurate measurements, should remain seated, standing, or lying down without walking or excessive finger movement while using the device on any finger except the thumb, preferably the middle or index finger. It’s advised to wash hands before use, remove any nail polish, especially dark shades, to ensure measurement accuracy.

A total of 25 participants with lifestyle habits were included in the study. Participants were recruited through convenience sampling. The inclusion criteria for this study were as follows: (I) men aged 18 to 65+ years, (II) perceived lifestyle habits such as smoking, coffee, energy cans, and alcohol drinkers, (III) with above average resting HR from 66 to 71 bpm, (IV) willing to provide informed consent. Additionally, the following exclusion criteria were applied: (I) subjects with significant medical conditions that could interfere with study objectives, (II) subjects wearing nail polish or artificial nails, (III) subjects who were unable to remain still during the measurement period. Figure 3 demonstrates study’s participant recruitment.

Figure 3 Flowchart of participant recruitment.

Two self-administered paper-based questionnaires were distributed using face-to-face approach. The initial questionnaire aimed to collect data such as age, and the daily consumed behavioural lifestyle factors including number of smoked cigarettes, beer cans, coffee cups, and energy drinks. These data were collected one time before taking the HR readings. The second questionnaire was designed to collect the recorded HR measurements for 10 days. Majority of subjects were given a pulse oximeter device for 10 days to record the pulse rate readings. The guideline for HR measurement in this study is measure the HR four times daily for 10 consecutive days by each subject. Specific measurement time for baseline value at 7:00 am was assigned whereas, other measurements were evenly spaced over three time periods: morning (8:00–10:00 am), noon (13:00–17:00 pm), evening (18:00–22:00 pm). The data collection period extended over 40 days, starting from June 05th to July 14th, 2023.

Of the 25 subjects, we sorted to evaluate the HR predicted model. Each subject’s lifestyle habits were used as an input data for the trained RF model to predict future HR time-series readings for a 10-day period, utilize the predicted HRs values to represent the average HR for baseline and for three time periods: morning, noon and evening. MAE, MAPE, MSE, RMSE, and R² score were used as the evaluation metrics.

Statistical analysis

Time-series analysis was employed to provide insights into HR fluctuations throughout a 10-day period, showing how selected lifestyle habits influenced these fluctuations. Moreover, residual analysis, specifically using scatter plot was used to compare the actual versus predicted data by the RF model for the 25 subjects in the study.

Results

The study used RF regression model for the analysis of synthetic generated data to predict a 10-day future HR and utilizing the predicted HR values to represent the average HR for baseline and for three time periods: morning, noon and evening. To this end, we generated synthetic data to train the model and then performance testing was conducted. This testing involved comparing actual data for 25 subjects for a span of 10 days versus predicted data by the model.

Figure 4 depicts a line plot graph illustrating the fluctuations in HR data across various times of the day over a span of 10 days. The figure shows an example of the model outcome based on synthetic dataset only. This representation highlights how these fluctuations align with the targeted predicted average HR of 75 bpm for the same 10-day period. The data presented represents an example of predicted HR readings generated by the regression RF model using the created synthetic dataset. These predictions pertain to input features within the synthetic data generation, assuming 31 years old, a resting HR of 67 bpm with daily consumption of 10 cigarettes, 1 beer, 2 energy drinks, and 2 cups of coffee. Additionally, the average predicted HR values over the 10-day period are showcased at four distinct time points: baseline, morning, noon, and evening. An example of activity and lifestyle recommendation is also shown in Figure 4.

Figure 4 Assembled example of synthetic generated HR readings over time with average HR at baseline, morning, noon and evening with activities and lifestyle recommendations. HR, heart rate.

Twenty-five subjects were included in this study, a 10-day measurement was taken four times throughout the day by each participant. The initial measurement was at 7:00 am, and subsequent ones were evenly spaced in the morning, noon, and evening time periods. The gathered information for the subjects including age, lifestyle habits, and baseline values were used as an input for the RF model. The RF model would predict the HR values for a duration of 10 days across 24 hours over four time periods: morning, noon, evening, and sleeping time, predict the average HR for a duration of 10 days and the average HR values for specific time hours including baseline at 7:00, morning at 8:00, noon at 13:00 pm, and evening at 22:00 pm across the entire duration of 10 days.

To illustrate the comparison between actual data measured by the pulse oximeter and predicted HR readings by the model over a span of 10 days. Figure 5 shows a 42-year-old male who maintains a daily consumption of approximately 10 cigarettes, 1 beer, 0 energy drinks, and 3 cups of coffee. Whereas Figure 6 represents a 54-year-old male with an average consumption of 12 cigarettes and 3 cups of coffee daily. On the other hand, Figure 7 shows a 26-year-old male with an average consumption of 20 cigarettes, 1 beer, and 3 cups of coffee daily. Each data point represents a HR measurement taken at various times throughout the day, including 7:00 am, 8:00 am, 13:00 pm, and 22:00 pm. The actual data points are derived from observed measurements by a pulse oximeter device, while the predicted data points are generated through the RF predictive model. This comparison offers valuable insights into the predictive model’s accuracy in estimating HR fluctuations. We selected these participants to exemplify the comparison between actual and predicted HR readings across 10 days because their profiles exhibit the most pronounced variations, primarily attributed to their significant cigarette consumption. Other trial results, omitted here, displayed less variability in HR data, thus providing less informative data visualization in showing the differences between actual data versus predicted data by the RF model.

Figure 5 Comparison of actual versus predicted data across 10 days for a 42-year-old male who consumes an average of 10 cigarettes, 1 beer, 0 energy drinks, and 3 cups of coffee daily.

Figure 6 Comparison of actual versus predicted data across 10 days for a 54-year-old male who consumes an average of 12 cigarettes, 0 beer, 0 energy drinks, and 3 cups of coffee daily.

Figure 7 Comparison of actual versus predicted data across 10 days for a 26-year-old male who consumes an average of 20 cigarettes, 1 beer, 0 energy drinks, and 3 cups of coffee daily.

As can be seen in Figure 8, the scatter plot refers to how well the predicted data by the model fits the actual data. The ‘Perfect Prediction Line’ acts as a reference, highlighting disparities between actual and predicted average HR values for a 10-day period for baseline and three time periods including morning, noon, and evening for 25 subjects. The plot shows that the values appear to be reasonably convergent.

Figure 8 Model prediction versus actual data alignment. (A) Highlights variations at baseline and time periods including morning, noon, and evening; (B) actual versus predicted average heart rate values for a 10-day period for 25 subjects.

In Table 10, R² score, MSE, MAPE, RMSE, and MAE were used to evaluate the performance of the model. The results indicate that the baseline time period demonstrates the best performance across key metrics, including significantly lower MAE, MSE, RMSE, and a higher R2 score, compared to the poorest time period found at noon. For instance, the baseline predictions demonstrate a significantly lower MAE of 0.92 compared to the noon predictions MAE of 1.30. This indicates that, on average, the baseline model’s predictions deviate less from the actual HR values. Similarly, the baseline MSE and RMSE values of 1.08 and 1.04, respectively, are substantially lower than the noon MSE of 4.84 and RMSE of 2.20, highlighting a better overall fit to the data. Furthermore, the baseline predictions achieve a higher R2 score of 0.953, which means that around 95.3% of the variance in the actual data is captured by the model, compared to the lower R2 score of 0.674 (67.4%) of the variance at noon.

Table 10

Prediction error in different time periods for the average heart rate across 10 days using random forest

Time period	Mean absolute error	Mean square error	Root mean square error	Mean absolute percentage error (%)	R² score
Baseline	0.92	1.08	1.04	1.34	0.953
Morning	1.08	1.88	1.37	1.64	0.857
Noon	1.30	4.84	2.20	1.41	0.674
Evening	1.20	6.80	2.61	1.40	0.896

Discussion

In the study, we used the concept of synthetic data generation for prediction. It employs a RF regressor to capture future HR time-series for a 10-day period. HR data captured this way may not be as accurate as HR data recorded by an ECG, but synthetic data generation emerges as a viable alternative that allows for data sharing and utilization for concerns such as privacy, safety and regulations (28). The performance testing of the model involved comparing actual HR data recorded from a pulse oximeter for a total of 25 subjects for a span of 10 days versus predicted HR readings by the model.

The HR variability is influenced by various factors. The selection of alcohol consumption, smoking, coffee, and energy drinking habits as key factors in the study was based on two reasons. Firstly, previous studies have indicated that these factors signify the most deleterious effect on HR variability affecting short and long-term health whereas, factors such as exercise indicate a beneficial effect on HR variability, thus changes in the habit intake of the selected factors can present an apparent alteration in the HR variability (29,30). Secondly, supported study findings has additionally reported that behavioural risk factors contribute the most significant to worsening well-being comparable to unmodifiable factors (12). The selected factors are behavioural factors that individuals can actively change and control. Therefore, the proposed model presents the foundation to predict future HR based on the most significant factors on HR.

The study showed that the model performed better for predicting HR at the baseline demonstrating substantially lower error metrics, including MAE, MSE, and RMSE, and higher R2 score, indicating predictive accuracy compared to the noon time period. Notably, the baseline predictions exhibit minimal deviation from actual HR values, supported by lower MAE values. However, MSE increases significantly during the morning, noon, and evening periods, indicating less accurate predictions during these times. While previous studies utilized various real datasets to develop ML models for predicting HR. However, different ML models yield varying predictive performances. For instance, while the ARIMA model and linear regression emerged as optimal choices in these studies (12,13) for predicting HR over 1 minute and 30 minutes recording durations only. A comparison of the results of the RF regressor in this study to the results of corresponding study (12) evaluating the performance of various ML techniques in the predictive analysis of HR had indicated that the developed technique performs better than the techniques used in the study approaches as indicated by the lowest MAE for predicting HR. Supported results by Bashar et al. (15) demonstrated the efficacy of RF regression algorithm in estimating HR from photoplethysmography (PPG) due to the lower achieved error rate.

This study proposes a new method of predicting HR based on synthetic data generation. This challenge was compounded by the limited literature available on the utilization of synthetic data for training ML algorithms, especially within the healthcare domain. Therefore, the paper significantly brings advancement in the growing field of prediction research and algorithms in healthcare.

Conclusions

This study has addressed the use of ML to predict HR across 10 days using the concept of synthetic data generation. We used RF regressor to predict future HR time-series and to enable lifestyle and activities recommendations. The model was then evaluated using R² score, MSE, MAPE, RMSE, and MAE metrics against baseline, morning, noon and evening time periods. Results showed that the model demonstrates the best performance for predicting the HR at the baseline, compared to the poorest performance for the noon predictions. The results of this study showed two major significancy to future research firstly, the synthetic data generation framework can be used as the foundation to predict the HR using several data analytics techniques. Secondly, the model can be further explored to improve its accuracy by including additional factors that influence the HR variability.

Acknowledgments

The first author appreciates the guidance, constructive comments, and fruitful discussions with the supervisor Prof. Dr. Thomas Spittler and for coordinating and supervising the research. Additional thanks to Dr. Ibrahim for his contribution in the design of the paper.

Funding: None.

Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-35/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-35/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study doesn’t involve any human experiment thus the IRB approval and informed consent are waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Stark K, Massberg S. Interplay between inflammation and thrombosis in cardiovascular pathology. Nat Rev Cardiol 2021;18:666-82. [Crossref] [PubMed]
Barh D. Artificial Intelligence in Precision Health. Academic Press; 2020.
Saxena A, Minton D, Lee DC, et al. Protective role of resting heart rate on all-cause and cardiovascular disease mortality. Mayo Clin Proc 2013;88:1420-6. [Crossref] [PubMed]
Stegmann T, Koehler K, Schulze M, et al. Early detection of atrial fibrillation in patients with heart failure reduces the risk of subsequent hospitalization: a subanalysis of the randomized TIM-HF2 trial. Eur Heart J Digit Health 2022;3:218-27. [Crossref] [PubMed]
Sadasivuni KK. Predicting heart failure: invasive, non-invasive,machine learning and artificial intelligence based methods. Hoboken, NJ: John Wiley & Sons; 2022.
Sabzmakan L, Morowatisharifabad MA, Mohammadi E, et al. Behavioral determinants of cardiovascular diseases risk factors: A qualitative directed content analysis. ARYA Atheroscler 2014;10:71-81. [PubMed]
Luo M, Wu K. Heart rate prediction model based on neural network. IOP Conf Ser: Mater Sci Eng 2020;715:012060.
Moshawrab M, Adda M, Bouzouane A, et al. Cardiovascular Events Prediction using Artificial Intelligence Models and Heart Rate Variability. Procedia Computer Science 2022;203:231-8. [Crossref]
Berrouiguet S, Barrigón ML, Castroman JL, et al. Combining mobile-health (mHealth) and artificial intelligence (AI) methods to avoid suicide attempts: the Smartcrises study protocol. BMC Psychiatry 2019;19:277. [Crossref] [PubMed]
Nordlinger B, Villani C, Rus D. Healthcare and Artificial Intelligence. Springer Nature; 2020.
Alharbi A, Alosaimi W, Sahal R. Real-Time System Prediction for Heart Rate Using Deep Learning and Stream Processing Platforms Complexity. Complex 2021;22:1-9. [Crossref]
Oyeleye M, Chen T, Titarenko S, et al. A Predictive Analysis of Heart Rates Using Machine Learning Techniques. Int J Environ Res Public Health 2022;19:2417. [Crossref] [PubMed]
Dhyani S, Kumar A, Choudhury S. Analysis of ECG-based arrhythmia detection system using machine learning. MethodsX 2023;10:102195. [Crossref] [PubMed]
Lin H, Zhang S, Li Q, et al. A new method for heart rate prediction based on LSTM-BiLSTM-Att. Measurement 2023;207:112384. [Crossref]
Bashar SS, Miah MdS, Karim AHMZ, et al. A Machine Learning Approach for Heart Rate Estimation from PPG Signal using Random Forest Regression Algorithm [Internet]. IEEE Xplore 2019 [cited 2022 Mar 1]:[1–5p]. Available online: https://ieeexplore.ieee.org/abstract/document/8679356?casa_token=VkdZoX5416wAAAAA:GtPDuas_ssRMXwp53SYvNbqbnG5ejHUWbyj6VKKGI0eDm1epFRlXy1NYECJtV0Dgmci7CwG5rAovIdA
Borup D, Christensen BJ, Mühlbach NS, et al. Targeting predictors in random forest regression. International Journal of Forecasting 2023;39:841-68. [Crossref]
Pal S. A Comparative Analysis of Machine Learning Algorithms for Predictive Analytics in Healthcare. Heritage Research J 2024;72:10-25.
Gajewska M, Worth A, Urani C, et al. The acute effects of daily nicotine intake on heart rate--a toxicokinetic and toxicodynamic modelling study. Regul Toxicol Pharmacol 2014;70:312-24. [Crossref] [PubMed]
Kawano Y. Physio-pathological effects of alcohol on the cardiovascular system: its role in hypertension and cardiovascular disease. Hypertens Res 2010;33:181-91. [Crossref] [PubMed]
Mandilaras G, Li P, Dalla-Pozza R, et al. Energy Drinks and Their Acute Effects on Heart Rhythm and Electrocardiographic Time Intervals in Healthy Children and Teenagers: A Randomized Trial. Cells 2022;11:498. [Crossref] [PubMed]
Costa JB, Anunciação PG, Ruiz RJ, et al. Effect of caffeine intake on blood pressure and heart rate variability after a single bout of aerobic exercise: original research article. International Sportmed Journal 2012;13:109-21.
Clark R. How does caffeine affect your heart rate? Health Digest [Internet] [cited 2022 July 17]. Available online: https://www.healthdigest.com/961027/how-does-caffeine-affect-your-heart-rate/
DKFZ FZR mukoviszidose [Internet]. Sport und Rauchen-ein Widerspruch![cited 2022 August 5]. Available online: https://www.dkfz.de/de/tabakkontrolle/download/Publikationen/FzR/FzR_Sport_und Rauchen_ein_Widerspruch.pdf
Ärztliches Manual zur Prävention und Behandlung von riskantem, schädlichem und abhängigem Alkoholkonsum bei Patientinnen und Patienten ansprechen. [Internet]. Bundesaerztekammer; 2021 [cited 2023 Aug 6]. Available online: https://www.bundesaerztekammer.de/fileadmin/user_upload/_old-files/downloads/pdf-Ordner/Pressemitteilungen/BZgA_Leitfaden_Alkoholkonsum.pdf
Rütten A. National recommendations for physical activity and physical activity promotion. Erlangen Fau University Press; 2016 [cited 2023 May 5]. Available online: https://www.sport.fau.de/files/2015/05/National-Recommendations-for-Physical-Activity-and-Physical-Activity-Promotion.pdf
Akinbo RS, Daramola OA. Ensemble Machine Learning Algorithms for Prediction and Classification of Medical Images. In: Sen J, editor. Machine Learning - Algorithms, Models and Applications. Artificial Intelligence. IntechOpen; 2021;67:59-78.
iHealth Labs Inc. IHealth Air Pulse Oximeter [Internet]. iHealth Labs Inc [cited 2023 June 9]. Available online: https://ihealthlabs.com/products/ihealth-air-pulse-oximeter
Lu Y, Shen M, Wang H, et al. Machine Learning for Synthetic Data Generation: A Review [Internet]. Synthical 2023 [cited 2024 Apr 22]. Available online: https://synthical.com/article/9cc1fcdc-d001-407f-8a92-4aaf51c28fc5
Fatisson J, Oswald V, Lalonde F. Influence diagram of physiological and environmental factors affecting heart rate variability: an extended literature overview. Heart Int 2016;11:e32-40. [Crossref] [PubMed]
Tiwari R, Kumar R, Malik S, et al. Analysis of Heart Rate Variability and Implication of Different Factors on Heart Rate Variability. Curr Cardiol Rev 2021;17:e160721189770. [Crossref] [PubMed]

doi: 10.21037/jmai-24-35
Cite this article as: Lutfi H, Spittler T, Ibrahim HM. Using synthetic data and machine learning to predict heart rate and enable lifestyle recommendations. J Med Artif Intell 2024;7:25.

Using synthetic data and machine learning to predict heart rate and enable lifestyle recommendations

Highlight box

Introduction

Methods

System architecture

Generating synthetic data

Table 1

Table 2

Table 3

Table 4

Table 5

Table 6

Table 7

Table 8

Table 9

Predictive modelling using RF

Performance evaluation of the RF model

Statistical analysis

Results

Table 10

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share