Chapter 4. Simulating Time Series Data

Up to this point, we have discussed where to find time series data and how to process it. Now we will look at how to create times series data via simulation.

Our discussion proceeds in three parts. First, we compare simulations of time series data to other kinds of data simulations, noting what new areas of particular concern emerge when we have to account for time passing. Second, we look at a few code-based simulations. Third, we discuss some general trends in the simulation of time series.

The bulk of this chapter will focus on specific code examples for generating various kinds of time series data. We will run through the following examples:

We simulate email opening and donation behavior of members of a nonprofit organization over the course of several years. This is related to the data we examined in “Retrofitting a Time Series Data Collection from a Collection of Tables”.
We simulate events in a taxicab fleet of 1,000 vehicles with various shift start times and hour-of-the-day-dependent passenger pickup frequencies over the course of a single day.
We simulate step-by-step state evolution of a magnetic solid for a given temperature and size using relevant laws of physics.

These three code examples correlate to three classes of time series simulations:

Heuristic simulations: We decide how the world should work, ensure it makes sense, and code it up, one rule at a time.
Discrete event simulations: We build individual actors with certain rules in our universe and then run those actors to see how the universe evolves over time.
Physics-based simulations: We apply physical laws to see how a system evolves over time.

Simulating time series can be a valuable analytical exercise and one we will also demonstrate in later chapters as it relates to specific models.

What’s Special About Simulating Time Series?

Simulating data is an area of data science that is rarely taught, but which is a particularly useful skill for time series data. This follows from one of the downsides of having temporal data: no two data points in the same time series are exactly comparable since they happen at different times. If we want to think about what could have happened at a given time, we move into the world of simulation.

Simulations can be simple or complex. On the simpler side, you will encounter synthetic data in any statistics textbook on time series, such as in the form of a random walk. These are usually generated as cumulative sums of a random process (such as R’s rnorm) or by a periodic function (such as a sine curve). On the more complex side, many scientists and engineers make their careers out of simulating time series. Time series simulations remain an active area of research—and a computationally demanding one—in many fields, including:

Meteorology
Finance
Epidemiology
Quantum chemistry
Plasma physics

In some of these cases, the fundamental rules of behavior are well understood, but it can still be difficult to account for everything that can happen due to the complexity of the equations (meteorology, quantum chemistry, plasma physics). In other cases, not all of the predictive variables can ever be known, and experts aren’t even sure that perfect predictions can be made due to the stochastic nonlinear nature of the systems studied (finance, epidemiology).

Simulation Versus Forecasting

Simulation and forecasting are similar exercises. In both cases you must form hypotheses about underlying system dynamics and parameters, and then extrapolate from these hypotheses to generate data points.

Nonetheless, there are important differences to keep in mind when learning about and developing simulations rather than forecasts:

It can be easier to integrate qualitative observations into a simulation than into a forecast.
Simulations are run at scale so that you can see many alternative scenarios (thousands or more), whereas forecasts should be more carefully produced.
Simulations have lower stakes than forecasts; there are no lives and no resources on the line, so you can be more creative and exploratory in your initial rounds of simulations. Of course, you eventually want to make sure you can justify how you build your simulations, just as you must justify your forecasts.

Simulations in Code

Next we look at three examples of coding up simulations of time series. As you read these examples, consider what a wide array of data can be simulated to produce a “time series,” and how the temporal element can be very specific and human-driven, such as days of the week and times of day of donations, but can also be very nonspecific and essentially unlabeled, such as the "nth step” of a physics simulation.

The three examples of simulation we will discuss in this section are:

Simulating a synthetic data set to test our hypotheses about how members of an organization may (or may not) have correlated behavior between receptiveness to organizational email and willingness to make donations. This is the most DIY example in that we hardcode relationships and generate tabular data with for loops and the like.
Simulating the synthetic data set to explore aggregate behavior in a fleet of taxis, complete with shift times and time-of-day-dependent frequency of passengers. In this data set, we make use of Python’s object-oriented attributes as well as generators, which are quite helpful when we want to set a system going and see what it does.
Simulating the physical process of a magnetic material gradually orienting its individual magnetic elements, which begin in disarray but ultimately coalesce into a well-ordered system. In this example, we see how physical laws can drive a time series simulation and insert natural temporal scaling into a process.

Doing the Work Yourself

When you are programming simulations, you need to keep in mind the logical rules that apply to your system. Here we walk through an example where the programmer does most of the work of making sure the data makes sense (for example, by not specifying events that happen in an illogical order).

We start by defining the membership universe—that is, how many members we have and when each joined the organization. We also pair each member with a member status:

## python
>>> ## membership status
>>> years        = ['2014', '2015', '2016', '2017', '2018']
>>> memberStatus = ['bronze', 'silver', 'gold', 'inactive']

>>> memberYears = np.random.choice(years, 1000, 
>>>               p = [0.1, 0.1, 0.15, 0.30, 0.35])
>>> memberStats = np.random.choice(memberStatus, 1000, 
>>>               p = [0.5, 0.3, 0.1, 0.1])

>>> yearJoined = pd.DataFrame({'yearJoined': memberYears,
>>>                          'memberStats': memberStats})

Notice that there are already many rules/assumptions built into the simulation just from these lines of code. We impose specific probabilities of the years the members joined. We also make the status of the member entirely independent on the year they joined. In the real world, it’s likely we can already do better than this because these two variables should have some connection, particularly if we want to incentivize people to remain members.

We make a table indicating when members opened emails each week. In this case, we define our organization’s behavior: we send three emails a week. We also define different patterns of members behavior with respect to email:

Never opening email
Constant level of engagement/email open rate
Increasing or decreasing level of engagement

We can imagine ways to make this more complex and nuanced depending on anecdotal observations from veterans or novel hypotheses we have about unobservable processes affecting the data:

## python
>>> NUM_EMAILS_SENT_WEEKLY = 3

>>> ## we define several functions for different patterns 
>>> def never_opens(period_rng):
>>>   return []

>>> def constant_open_rate(period_rng):
>>>   n, p = NUM_EMAILS_SENT_WEEKLY, np.random.uniform(0, 1)
>>>   num_opened = np.random.binomial(n, p, len(period_rng))
>>>   return num_opened

>>> def increasing_open_rate(period_rng):
>>>   return open_rate_with_factor_change(period_rng, 
>>>                                        np.random.uniform(1.01, 
>>>                                                          1.30))
 
>>> def decreasing_open_rate(period_rng):
>>>   return open_rate_with_factor_change(period_rng, 
>>>                                        np.random.uniform(0.5,  
>>>                                                          0.99))

>>> def open_rate_with_factor_change(period_rng, fac):
>>>     if len(period_rng) < 1 :
>>>         return [] 
>>>     times = np.random.randint(0, len(period_rng), 
>>>                                int(0.1 * len(period_rng)))    
>>>     num_opened = np.zeros(len(period_rng))
>>>     for prd in range(0, len(period_rng), 2):  
>>>         try:
>>>             n, p = NUM_EMAILS_SENT_WEEKLY, np.random.uniform(0, 
>>>                                                              1)
>>>             num_opened[prd:(prd + 2)] = np.random.binomial(n, p, 
>>>                                                            2)
>>>             p = max(min(1, p * fac), 0)
>>>         except:
>>>             num_opened[prd] = np.random.binomial(n, p, 1)
>>>     for t in range(len(times)):
>>>         num_opened[times[t]] = 0    
>>>     return num_opened

We have defined functions to simulate four distinct kinds of behavior:

Members who never open the emails we send them: (never_opens())
Members who open about the same number of emails each week: (constant_open_rate())
Members who open a decreasing number of emails each week: (decreasing_open_rate())
Members who open an increasing number of emails each week: (increasing_open_rate())

We ensure that those who grow increasingly engaged or disengaged over time are simulated in the same way with the open_rate_with_factor_change() function via the functions increasing_open_rate() and decreasing_open_rate().

We also need to come up with a system to model donation behavior. We don’t want to be totally naive, or our simulation will not give us insights into what we should expect. That is, we want to build into the model our current hypotheses about member behavior and then test whether the simulations based on those hypotheses match what we see in our real data. Here, we make donation behavior loosely but not deterministically related to the number of emails a member has opened:

## python
>>> ## donation behavior
>>> def produce_donations(period_rng, member_behavior, num_emails, 
>>>                       use_id, member_join_year):
>>>     donation_amounts = np.array([0, 25, 50, 75, 100, 250, 500, 
>>>                                  1000, 1500, 2000])
>>>     member_has = np.random.choice(donation_amounts)    
>>>     email_fraction = num_emails  / 
>>>                        (NUM_EMAILS_SENT_WEEKLY * len(period_rng))  
>>>     member_gives = member_has * email_fraction
>>>     member_gives_idx = np.where(member_gives 
>>>                                  >= donation_amounts)[0][-1]
>>>     member_gives_idx = max(min(member_gives_idx, 
>>>                                len(donation_amounts) - 2), 
>>>                            1)
>>>     num_times_gave = np.random.poisson(2) * 
>>>                        (2018 - member_join_year)
>>>     times = np.random.randint(0, len(period_rng), num_times_gave)
>>>     dons = pd.DataFrame({'member'   : [], 
>>>                          'amount'   : [],  
>>>                          'timestamp': []})
                                   
>>>     for n in range(num_times_gave):    
>>>         donation = donation_amounts[member_gives_idx 
>>>                      + np.random.binomial(1, .3)]
>>>         ts = str(period_rng[times[n]].start_time 
>>>                   + random_weekly_time_delta())
>>>         dons = dons.append(pd.DataFrame(
>>>                   {'member'   : [use_id],
>>>                    'amount'   : [donation],
>>>                    'timestamp': [ts]}))
>>>     
>>>     if dons.shape[0] > 0:
>>>         dons = dons[dons.amount != 0]
>>>         ## we don't report zero donation events as this would not
>>>         ## be recorded in a real world database   
>>>                                    
>>>     return dons

There are a few steps we have taken here to make sure the code produces realistic behavior:

We make the overall number of donations dependent on how long someone has been a member.
We generate a wealth status per member, building in a hypothesis about behavior that donation amount is related to a stable amount a person would have earmarked for making donations.

Because our member behaviors are tied to a specific timestamp, we have to choose which weeks each member made donations and also when during that week they made the donation. We write a utility function to pick a random time during the week:

## python
>>> def random_weekly_time_delta():
>>>     days_of_week = [d for d in range(7)]
>>>     hours_of_day = [h for h in range(11, 23)]
>>>     minute_of_hour = [m for m in range(60)]
>>>     second_of_minute = [s for s in range(60)]
>>>     return pd.Timedelta(str(np.random.choice(days_of_week))   
>>>                             + " days" ) +
>>>          pd.Timedelta(str(np.random.choice(hours_of_day))     
>>>                             + " hours" )  +
>>>          pd.Timedelta(str(np.random.choice(minute_of_hour))   
>>>                             + " minutes") +
>>>          pd.Timedelta(str(np.random.choice(second_of_minute)) 
>>>                             + " seconds")

You may have noticed that we only draw the hour of the timestamp from the range of 11 to 23 (hours_of_day = [h for h in range(11, 23)]). We are postulating a universe with people in a very limited range of time zones or even in just a single time zone, as we do not allow hours outside the range given. Here we are building in more of our underlying model as to how users behave.

We thus expect to see unified behavior from our users as though they are all in one or a few adjoining time zones, and we are further postulating that reasonable donation behavior is for people to donate from late morning to late evening, but not overnight and not first thing when they wake up.

Finally, we put all the components just developed together to simulate a certain number of members and associated events in a way that ensures that events happen only once a member has joined and that a member’s email events have some relation (but not an unrealistically small relation) to their donation events:

## python
>>> behaviors        = [never_opens, 
>>>                    constant_open_rate,
>>>                    increasing_open_rate, 
>>>                    decreasing_open_rate]
>>> member_behaviors = np.random.choice(behaviors, 1000, 
>>>                                    [0.2, 0.5, 0.1, 0.2])
                                    
>>> rng = pd.period_range('2015-02-14', '2018-06-01', freq = 'W')
>>> emails = pd.DataFrame({'member'      : [], 
>>>                        'week'        : [], 
>>>                        'emailsOpened': []})
>>> donations = pd.DataFrame({'member'   : [], 
>>>                           'amount'   : [], 
>>>                           'timestamp': []})

>>> for idx in range(yearJoined.shape[0]):
>>>     ## randomly generate the date when a member would have joined
>>>     join_date = pd.Timestamp(yearJoined.iloc[idx].yearJoined) + 
>>>                   pd.Timedelta(str(np.random.randint(0, 365)) + 
>>>                                   ' days')
>>>     join_date = min(join_date, pd.Timestamp('2018-06-01'))
>>>  
>>>     ## member should not have action timestamps before joining
>>>     member_rng = rng[rng > join_date]    
>>>   
>>>     if len(member_rng) < 1:
>>>         continue
>>> 
>>>     info = member_behaviors[idx](member_rng)
>>>     if len(info) == len(member_rng):
>>>         emails = emails.append(pd.DataFrame(
>>>            {'member': [idx] * len(info), 
>>>             'week': [str(r.start_time) for r in member_rng], 
>>>             'emailsOpened': info}))
>>>         donations = donations.append(
>>>            produce_donations(member_rng, member_behaviors[idx], 
>>>                                 sum(info), idx, join_date.year))

We then look at the temporal behavior of the donations to get a sense of how we might try this for further analysis or forecasting. We plot the total sum of donations we received for each month of the data set (see Figure 4-1):

## python
>>> df.set_index(pd.to_datetime(df.timestamp), inplace = True)
>>> df.sort_index(inplace = True)
>>> df.groupby(pd.Grouper(freq='M')).amount.sum().plot()

It looks as though the number of donations and of emails opened rose over time from 2015 through 2018. This is not surprising, since the number of members also rose over time, as indicated in the cumulative sum of members and the year they joined. In fact, one built-in assumption of our model was that we got to keep a member indefinitely after they joined. We made no provision for termination other than allowing for members to open a decreasing number of emails. Even in that case, however, we left open the possibility of continued donations. We see this assumption of indefinitely continuing membership (and correlated donation behavior) in Figure 4-1. We should probably go back and refine our code, as indefinite membership and donation is not a realistic scenario.

This is not a classic time series simulation, so it may feel quite a bit more like an exercise in generating tabular data. It absolutely is that as well, but we did have to be time series–aware:

We had to make decisions about how many time series our users were in.
We had to make decisions about what kinds of trends we would model over time:
- In the case of email, we decided to have three trends: stable, increasing, and decreasing email open rates.
- In the case of donations, we made donations a stable behavioral pattern related to how many emails the member had ever opened in their lifetime. This included a lookahead, but since we were generating data, this was a way of deciding that a member’s overall affinity in the organization, which would lead to more emails opened, would also increase the frequency of donations.
We had to be careful to make sure we did not have emails opened or donations made before the member joined the organization.
We had to make sure our data did not go into the future, to make it more realistic for consumers of the data. Note that for a simulation it is fine if our data goes into the future.

But it’s not perfect. The code presented here is ungainly, and it doesn’t produce a realistic universe. What’s more, since only the programmer checked the logic, they could have missed edge cases such that events take place in an illogical order. It would be good to establish external metrics and standards of validity before running the simulation as one protection against such errors.

We need software that enforces a logical and consistent universe. We will look at Python generators as a better option in the next section.

Building a Simulation Universe That Runs Itself

Sometimes you have a specific system and you want to set up the rules for that system and see how it rolls along. Perhaps you want to envision what a universe of independent members accessing your application will use, or you want to attempt to validate an internal theory of decision making based on posited external behavior. In these cases, you are looking to see how individual agents contribute to your aggregate metrics over time. Python is an especially good fit for this job thanks to the availability of generators. When you start building software rather than staying purely in analysis, it makes sense to move to Python even if you are more comfortable in R.

Generators allow us to create a series of independent (or dependent!) actors and wind them up to watch what they do, without too much boilerplate code to keep track of everything.

In the next code example, we explore a taxicab simulation.¹ We want to imagine how a fleet of taxis, scheduled to begin their shifts at different times, might behave in aggregate. To do so, we want to create many individual taxis, set them loose in a cyber city, and have them report their activities back.

Such a simulation could be exceptionally complicated. For demonstration purposes, we accept that we will build a simpler world than what we imagine to truly be the case (“All models are wrong…”). We start by trying to understand what a Python generator is.

Let’s first consider a method I wrote to retrieve a taxi identification number:

## python
>>> import numpy as np

>>> def taxi_id_number(num_taxis):
>>>    arr = np.arange(num_taxis)
>>>    np.random.shuffle(arr)
>>>    for i in range(num_taxis):
>>>        yield arr[i]

For those who are not familiar with generators, here is the preceding code in action:

## python
>>> ids = taxi_id_number(10)
>>> print(next(ids))
>>> print(next(ids))
>>> print(next(ids))

which might print out:

7
2
5

This will iterate until it has emitted 10 numbers, at which point it will exit the for loop held within the generator and emit a StopIteration exception.

The taxi_id_number() produces single-use objects, all of which are independent of one another and keep their own state. This is a generator function. You can think of generators as tiny objects that maintain their own small bundle of state variables, which is useful when you want many objects parallel to one another, each one minding its own variables.

In the case of this simple taxi simulation, we compartmentalize our taxis into different shifts, and we also use a generator to indicate shifts. We schedule more taxis in the middle of the day than in the evening or overnight shifts by setting different probabilities for starting a shift at a given time:

## python
>>> def shift_info():
>>>    start_times_and_freqs = [(0, 8), (8, 30), (16, 15)]
>>>    indices               = np.arange(len(start_times_and_freqs))
>>>    while True:
>>>        idx   = np.random.choice(indices, p = [0.25, 0.5, 0.25])
>>>        start = start_times_and_freqs[idx]
>>>        yield (start[0], start[0] + 7.5, start[1])

Pay attention to start_times_and_freqs. This is our first bit of code that will contribute to making this a time series simulation. We are indicating that different parts of the day have different likelihoods of having a taxi assigned to the shift. Additionally, different times of the day have a different mean number of trips.

Now we create a more complex generator that will use the preceding generators to establish individual taxi parameters as well as create individual taxi timelines:

## python
>>> def taxi_process(taxi_id_generator, shift_info_generator):
>>>    taxi_id = next(taxi_id_generator)
>>>    shift_start, shift_end, shift_mean_trips = 
>>>                                    next(shift_info_generator)
>>>    actual_trips = round(np.random.normal(loc   = shift_mean_trips, 
>>>                                          scale = 2))
>>>    average_trip_time = 6.5 / shift_mean_trips * 60 
>>>    # convert mean trip time to minutes
>>>    between_events_time = 1.0 / (shift_mean_trips - 1) * 60
>>>    # this is an efficient city where cabs are seldom unused
>>>    time = shift_start
>>>    yield TimePoint(taxi_id, 'start shift', time)    
>>>    deltaT = np.random.poisson(between_events_time) / 60
>>>    time += deltaT
>>>    for i in range(actual_trips):
>>>        yield TimePoint(taxi_id, 'pick up    ', time)
>>>        deltaT = np.random.poisson(average_trip_time) / 60
>>>        time += deltaT
>>>        yield TimePoint(taxi_id, 'drop off   ', time)
>>>        deltaT = np.random.poisson(between_events_time) / 60
>>>        time += deltaT        
>>>    deltaT = np.random.poisson(between_events_time) / 60
>>>    time += deltaT        
>>>    yield TimePoint(taxi_id, 'end shift  ', time)

Here the taxi accesses generators to determine its ID number, shift start times, and mean number of trips for its start time. From there, it departs on its own individual journey as it runs through a certain number of trips on its own timeline and emits those to the client calling next() on this generator. In effect, this generator produces a time series of points for an individual taxi.

The taxi generator yields TimePoints, which are defined as follows:

## python
>>> from dataclasses import dataclass

>>> @dataclass
>>> class TimePoint:
>>>    taxi_id:    int
>>>    name: str
>>>    time: float

>>>    def __lt__(self, other):
>>>        return self.time < other.time

We use the relatively new dataclass decorator to simplify the code (this requires Python 3.7). I recommend that all Python-using data scientists familiarize themselves with this new and data-friendly addition to Python.

Python’s Dunder Methods

Python’s dunder methods, whose names begin and end with two underscores, are a set of built-in methods for every class. Dunder methods are called automatically in the natural course using a given object. There are predefined implementations that can be overridden when you define them for your class yourself. There are many reasons you might want to do this, such as in the case of the preceding code, where we want TimePoints to be compared only based on their time and not based on their taxi_id or name attributes.

Dunder originated as an abbreviation of “double under.”

In addition to the automatically generated initializer for TimePoint, we need only two other dunder methods, __lt__ (to compare TimePoints) and __str__ (to print out TimePoints, not shown here). We need comparison because we will take all TimePoints produced into a data structure that will keep them in order: a priority queue. A priority queue is an abstract data type into which objects can be inserted in any order but which will emit objects in a specified order based on their priority.

Abstract Data Type

An abstract data type is a computational model defined by its behavior, which consists of an enumerated set of possible actions and input data and what the results of such actions should be for certain sets of data.

One commonly known abstract data type is a first-in-first-out (FIFO) data type. This requires that objects are emitted from the data structure in the same order in which they were fed into the data structure. How the programmer elects to accomplish this is a matter of implementation and not a definition.

We have a simulation class to run these taxi generators and keep them assembled. This is not merely a dataclass because it has quite a bit of functionality, even in the initializer, to arrange the inputs into a sensible array of information and processing. Note that the only public-facing functionality is the run() function:

## python
>>> import queue

>>> class Simulator:
>>>    def __init__(self, num_taxis):
>>>        self._time_points = queue.PriorityQueue()
>>>        taxi_id_generator = taxi_id_number(num_taxis)
>>>        shift_info_generator = shift_info()
>>>        self._taxis = [taxi_process(taxi_id_generator, 
>>>                                    shift_info_generator) for 
>>>                                             i in range(num_taxis)]        
>>>        self._prepare_run()        

>>>    def _prepare_run(self):
>>>        for t in self._taxis:
>>>            while True:
>>>                try:
>>>                    e = next(t)
>>>                    self._time_points.put(e)
>>>                except:
>>>                    break        

>>>    def run(self):
>>>        sim_time = 0
>>>        while sim_time < 24:
>>>            if self._time_points.empty():
>>>                break
>>>            p = self._time_points.get()
>>>            sim_time = p.time
>>>            print(p)

First, we create the number of taxi generators that we need to represent the right number of taxis. Then we run through each of these taxis while it still has TimePoints and push all these TimePoints into a priority queue. The priority of the object is determined for a custom class such as TimePoint by our implementation of a TimePoint’s __lt__, where we compare start time. So, as the TimePoints are pushed into the priority queue, it will prepare them to be emitted in temporal order.

We run the simulation:

## python
>>> sim = Simulator(1000)
>>> sim.run()

Here’s what the output looks like (your output will be different, as we haven’t set a seed—and every time you run the code it will be different from the last iteration):

id: 0539 name: drop off    time: 23:58
id: 0318 name: pick up     time: 23:58
id: 0759 name: end shift   time: 23:58
id: 0977 name: pick up     time: 23:58
id: 0693 name: end shift   time: 23:59
id: 0085 name: end shift   time: 23:59
id: 0351 name: end shift   time: 23:59
id: 0036 name: end shift   time: 23:59
id: 0314 name: drop off    time: 23:59

Setting a Seed When Generating Random Numbers

When you write code that is generating random numbers, you might want to ensure that it’s reproducible (e.g., if you wanted to set up unit tests for code that is normally random or if you were trying to debug and wanted to narrow down sources of variation to make debugging easier). To ensure that random numbers come out in the same nonrandom order, you set a seed. This is a common operation, so there are guides on how to set a seed in any computer language.

We have rounded to the nearest minute for display simplicity, although we do have more fine-grained data available. What temporal resolution we use will depend on our purposes:

If we want to make an educational display for people in our city of how the taxi fleet affects traffic, we might display hourly aggregates.
If we are a taxicab app and need to understand load on our server, we likely want to look at minute-by-minute data or even more highly resolved data to think about our infrastructure design and capacity.

We made the decision to report taxi TimePoints as they are “happening.” That is, we report the start of a taxi ride (“pick up”) without the time when the ride will end, even though we easily could have condensed this. This is one way of making the time series more realistic, in the sense that you likely would have recorded events in this way in a live stream.

Note that, as in the previous case, our time series simulation has not yet produced a time series. We have produced a log and can see our way through to making this a time series in a number of ways, however:

Output to a CSV file or time series database as we run the simulation.
Run some kind of online model hooked up to our simulation to learn how to develop a real-time streaming data processing pipeline.
Save the output down to a file or database and then do more post-processing to package the data in a convenient (but possibly risky vis-à-vis lookahead) form, such as pairing together start and end times of a given ride to study how the length of a taxi ride behaves at different times of the day.

There are several advantages to simulating this data in addition to being able to test hypotheses about the dynamics of a taxi system. Here are a couple of situations where this synthetic time series data could be useful:.

Testing the merits of various forecasting models relative to the known underlying dynamics of the simulation.
Building a pipeline for data you eventually expect to have based on your synthetic data while you await the real data.

You will be well served as a time series analyst by your ability to make use of generators and object-oriented programming. This example offers just one example of how such knowledge can simplify your life and improve the quality of your code.

For Extensive Simulations, Consider Agent-Based Modeling

The solution we coded here was all right, but it was a fair amount of boilerplate to ensure that logical conditions would be respected. If a simulation of discrete events based on the actions of discrete actors would be a useful source of simulated time series data, you should consider a simulation-oriented module. The SimPy module is a helpful option, with an accessible API and quite a bit of flexibility to do the sorts of simulation tasks we handled in this section.

A Physics Simulation

In another kind of simulation scenario, you may be in full possession of the laws of physics that define a system. This doesn’t have to be physics per se, however; it can also apply to a number of other areas:

Quantitative researchers in finance often hypothesize the “physical” rules of the market. So do economists, albeit at different timescales.
Psychologists posit the “psychophysical” rules of how humans make decisions. These can be used to generate “physical” rules about expected human responses to a variety of options over time.
Biologists research rules about how a system behaves over time in response to various stimuli.

One case of knowing some rules for a simple physical system is that of modeling a magnet. This is the case we are going to work on, via an oft-taught statistical mechanics model called the Ising model.² We will look at a simplified version of how to simulate its behavior over time. We will initialize a magnetic material so that its individual magnetic components are pointing in random directions. We will then watch how this system evolves into order where all the magnetic components point in the same direction, under the action of known physical laws and a few lines of code.

Next we discuss how such a simulation is accomplished via a Markov Chain Monte Carlo (MCMC) method, discussing both how that method works in general and as applied to this specific system.

Monte Carlo Simulations and Markov Chains

The idea behind a Monte Carlo simulation is to find clever ways to apply random numbers to situations that should in theory be solvable exactly, but in practice are much easier to solve probabilistically.

The Markov chain is a helpful addition to a general Monte Carlo simulation and is particularly applicable to time series simulation. A Monte Carlo simulation will help you figure out what a particular distribution or series of terms looks like, but not how those terms should evolve over time. This is where a Markov chain comes in. It calculates a probability of transitioning between states, and when we factor that in, we take “steps” rather than simply calculating a global integral. Now we can have a time series simulation rather than merely the calculation of an integral.

In physics, an MCMC simulation can be used, for example, to understand how quantum transitions in individual molecules can affect aggregate ensemble measurements of that system over time. In this case, we need to apply a few specific rules:

In a Markov process, the probability of a transition to a state in the future depends only on the present state (not on past information).
We will impose a physics-specific condition of requiring a Boltzmann distribution for energy; that is, $T_{i j} / T_{j i} = e^{- b (E_{j} - E_{i})}$ . For most of us, this is just an implementation detail and not something nonphysicists need to worry about.

We implement an MCMC simulation as follows:

Select the starting state of each individual lattice site randomly.
For each individual time step, choose an individual lattice site and flip its direction.
Calculate the change in energy that would result from this flip given the physical laws you are working with. In this case this means:
- If the change in energy is negative, you are transitioning to a lower energy state, which will always be favored, so you keep the switch and move on to the next time step.
- If the change in energy is not negative, you accept it with the acceptance probability of $e Superscript left-parenthesis minus energy change right-parenthesis$ . This is consistent with rule 2.

Continue steps 2 and 3 indefinitely until convergence to determine the most likely state for whatever aggregate measurement you are making.

Let’s take a look at the specific details of the Ising model. Imagine we have a two-dimensional material composed of a grid of objects, each one having what boils down to a mini-magnet that can point up or down. We put those mini-magnets randomly in an up or down spin at time zero, and we then record the system as it evolves from a random state to an ordered state at low temperature.³

First we configure our system, as follows:

## python
>>> ### CONFIGURATION
>>> ## physical layout
>>> N           = 5 # width of lattice
>>> M           = 5 # height of lattice
>>> ## temperature settings
>>> temperature = 0.5
>>> BETA        = 1 / temperature

Then we have some utility methods, such as random initialization of our starting block:

>>> def initRandState(N, M):
>>>     block = np.random.choice([-1, 1], size = (N, M))
>>>     return block

We also calculate the energy for a given center state alignment relative to its neighbors:

## python
>>> def costForCenterState(state, i, j, n, m):
>>>     centerS = state[i, j]
>>>     neighbors = [((i + 1) % n, j), ((i - 1) % n, j),
>>>                  (i, (j + 1) % m), (i, (j - 1) % m)]
>>>     ## notice the % n because we impose periodic boundary cond
>>>     ## ignore this if it doesn't make sense - it's merely a 
>>>     ## physical constraint on the system saying 2D system is like
>>>     ## the surface of a donut
>>>     interactionE = [state[x, y] * centerS for (x, y) in neighbors]
>>>     return np.sum(interactionE)

And we want to determine the magnetization of the entire block for a given state:

## python
>>> def magnetizationForState(state):
>>>    return np.sum(state)

Here’s where we introduce the MCMC steps discussed earlier:

## python
>>> def mcmcAdjust(state):
>>>     n = state.shape[0]
>>>     m = state.shape[1]
>>>     x, y = np.random.randint(0, n), np.random.randint(0, m)
>>>     centerS = state[x, y]
>>>     cost = costForCenterState(state, x, y, n, m)
>>>     if cost < 0:
>>>         centerS *= -1
>>>     elif np.random.random() < np.exp(-cost * BETA):
>>>         centerS *= -1
>>>     state[x, y] = centerS
>>>     return state

Now to actually run a simulation, we need some recordkeeping as well as repeated calls to the MCMC adjustment:

## python
>>> def runState(state, n_steps, snapsteps = None):
>>>     if snapsteps is None:
>>>         snapsteps = np.linspace(0, n_steps, num = round(n_steps / (M * N * 100)),
>>>         						dtype = np.int32)
>>>     saved_states = []
>>>     sp = 0
>>>     magnet_hist = []
>>>     for i in range(n_steps):
>>>         state = mcmcAdjust(state)
>>>         magnet_hist.append(magnetizationForState(state))
>>>         if sp < len(snapsteps) and i == snapsteps[sp]:
>>>             saved_states.append(np.copy(state))
>>>             sp += 1
>>>     return state, saved_states, magnet_hist

And we run the simulation:

## python
>>> ### RUN A SIMULATION
>>> init_state = initRandState(N, M)
>>> print(init_state)
>>> final_state = runState(np.copy(init_state), 1000)

We can get some insights from this simulation by looking at the beginning and ending states (see Figure 4-2).

In Figure 4-2 we examine one randomly generated initial state. While you might expect to see the two states more mixed up, remember that probabilistically it’s not that likely to get a perfect checkerboard effect. Try generating the initial state many times, and you will see that the seemingly “random” or “50/50” checkerboard state is not at all likely. Notice, however, that we start with approximately half our sites in each state. Also realize that any patterns you find in the initial states is likely your brain following the very human tendency to see patterns even where there aren’t any.

We then pass the initial state into the runState() function, allow 1,000 time steps to pass, and then examine the outcome in Figure 4-3.

This is a snapshot of the state taken at step 1,000. There are at least two interesting observations at this point. First, the dominant state has reversed compared to step 1,000. Second, the dominant state is no more dominant numerically than was the other dominant state at step 1,000. This suggests that the temperature may continue to flip sites out of the dominant state even when it might otherwise be favored. To better understand these dynamics, we should consider plotting overall aggregate measurements, such as magnetization, or make movies where we can view our two-dimensional data in a time series format.

We do this with magnetization over time for many independent runs of the previous simulation, as pictured in Figure 4-4:

## python
>>> we collect each time series as a separate element in results list
>>> results = []
>>> for i in range(100):
>>>     init_state = initRandState(N, M)
>>>     final_state, states, magnet_hist = runState(init_state, 1000)
>>>     results.append(magnet_hist)
>>> 
>>> ## we plot each curve with some transparency so we can see
>>> ## curves that overlap one another
>>> for mh in results:
>>>     plt.plot(mh,'r', alpha=0.2)

The magnetization curves are just one example of how we could picture the system evolving over time. We might also consider recording 2D time series, as the snapshot of the overall state at each point in time. Or there might be other interesting aggregate variables to measure at each step, such as a measure of layout entropy or a measure of total energy. Quantities such as magnetization or entropy are related quantities, as they are a function of the geometric layout of the state at each lattice site, but each quantity is a slightly different measure.

We can use this data in similar ways to what we discussed with the taxicab data, even though the underlying system is quite different. For example, we could:

Use the simulated data as the impetus to set up a pipeline.
Test machine learning methods on this synthetic data to see if they can be helpful on physical data before we go to the trouble of cleaning up real-world data for such modeling.
Watch the movie-like imagery of important metrics to develop better physical intuitions about the system .

Final Notes on Simulations

We have looked at a number of very different examples of simulating measurements that describe behavior over time. We have looked at simulating data related to consumer behavior (NGO membership and donation), city infrastructure (taxicab pick-up patterns), and the laws of physics (the gradual ordering of a randomized magnetic material). These examples should leave you feeling comfortable enough to begin reading code examples of simulated data and also come up with ideas for how your own work could benefit from simulations.

Chances are that, in the past, you have made assumptions about your data without knowing how to test those or alternate possibilities. Simulations give you a route to do so, which means your conversations about data can expand to include hypothetical examples paired with quantitative metrics from simulations. This will ground your discussions while opening new possibilities, both in the time series domain and in other branches of data science.

Statistical Simulations

Statistical simulations are the most traditional route to simulated time series data. They are particularly useful when we know the underlying dynamics of a stochastic system and want to estimate a few unknown parameters or see how different assumptions would impact the parameter estimation process (we will see an example of this later in the book). Even for physical systems, sometimes the statistical simulation is better.

Statistical simulations of time series data are also quite valuable when we need to have a definitive quantitative metric to define our own uncertainty about the accuracy of our simulations. In traditional statistical simulations, such as an ARIMA model (to be discussed in Chapter 6), the formulas for the error are well established, meaning that to understand a system with a posited underlying statistical model, you do not need to run many simulations to make numerical assertions about error and variance.

Deep Learning Simulations

Deep learning simulations for time series are a nascent but promising field. The advantages of deep learning are that very complicated, nonlinear dynamics can be captured in time series data even without the practitioner fully understanding the dynamics. This is also a disadvantage, however, in that the practitioner has no principled basis for understanding the dynamics of the system.

Deep learning simulations also offer promise where privacy is a concern. For example, deep learning has been used to generate synthetic heterogeneous time series data for medical applications based on real time series data but without the potential to leak private information. Such a data set, if it can truly be produced without any privacy leaks, would be invaluable because researchers could have access to a large array of (otherwise expensive and privacy-violating) medical data.

More Resources

Cristóbal Esteban, Stephanie L. Hyland, and Gunnar Rätsch, “Real-Valued (Medical) Time Series Generation with Recurrent Conditional GANs,” unpublished manuscript, last revised December 4, 2017, https://perma.cc/Q69W-L44Z.: The authors demonstrate how generative adversarial networks can be used to produce realistic-looking heterogenous medical time series data. This is an example of how deep learning simulation can be used to create ethical, legal, and (hopefully) privacy-preserving medical data sets to enable wider access to useful data for machine learning and deep learning in the healthcare context.
Gordon Reikard and W. Erick Rogers, “Forecasting Ocean Waves: Comparing a Physics-based Model with Statistical Models,” Coastal Engineering 58 (2011): 409–16, https://perma.cc/89DJ-ZENZ.: This article offers an accessible and practical comparison of two drastically different ways of modeling a system, with physics or with statistics. The researchers conclude that for the particular problem they address, the timescale of interest to the forecaster should drive decisions about which paradigm to apply. While this article is about forecasting, simulation is strongly related and the same insights apply.
Wolfgang Härdle, Joel Horowitz, and Jens-Peter Kreiss, “Bootstrap Methods for Time Series,” International Statistical Review / Revue Internationale de Statistique 71, no. 2 (2003): 435–59, https://perma.cc/6CQA-EG2E.: A classic 2005 review of the difficulties of statistical simulation of time series data given temporal dependencies. The authors explain, in a highly technical statistics journal, why methods to bootstrap time series data lag behind methods for other kinds of data, as well as what promising methods were available at the time of writing. The state of the art has not changed too much, so this is a useful, if challenging, read.

¹ This example is heavily inspired by Luciano Ramalho’s book, Fluent Python (O’Reilly 2015). I highly recommend reading the full simulation chapter in that book to improve your Python programming skills and see more elaborate opportunities for agent-based simulation.

² The Ising model is a well-known and commonly taught classical statistical mechanical model of magnets. You can find many code examples and further discussion of this model online in both programming and physics contexts if you are interested in learning more.

³ The Ising model is more often used to understand what a ferromagnet’s equilibrium state is rather than to consider the temporal aspect of how a ferromagnet might make its way into an equilibrium state. However, we treat the evolution over time as a time series.

Get Practical Time Series Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Practical Time Series Analysis by Aileen Nielsen

Chapter 4. Simulating Time Series Data

What’s Special About Simulating Time Series?

Simulation Versus Forecasting

Simulations in Code

Doing the Work Yourself

Figure 4-1. Total sum of donations received for each month of the data set.

Building a Simulation Universe That Runs Itself

Python’s Dunder Methods

Abstract Data Type

Setting a Seed When Generating Random Numbers

For Extensive Simulations, Consider Agent-Based Modeling

A Physics Simulation

Figure 4-2. Initial state of a 5 × 5 simulated ferromagnetic material, initialized with each state randomly selected to be spin up or spin down with equal probability.

Figure 4-3. Final low temperature state in one run of our simulation, as seen at 1,000 time steps.

Figure 4-4. 100 independent simulations of potential ways the system could enter a magnetized state at a low temperature even when each original lattice site was initialized randomly.

Final Notes on Simulations

Statistical Simulations

Deep Learning Simulations

More Resources

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly