As someone who regularly makes up scenarios to demonstrate concepts – both professionally and in my free time, I was very used to Db2’s answer to data generation – which is basically either fun with recursive SQL, pay for the additional IBM product Optim Test Data Generation, or use another language. As a result, most of us just scripted our own way of generating data when we had to, mostly using whatever language we are most comfortable with, and desperately trying to make the sample database work or use real data sets we have access to. I can even find an example in the IBM Db2 product documentation where perl is used to generate some data.

This is an area that I think some of the open source databases might do a bit better with, and I think of it as a “delighter”. It is something that you don’t absolutely need, and can work around, but when you see it done better, it seems nice.

I noticed when I was working through building a scenario out at work an interesting function that Snowflake has, and it made me want to explore the generation of data more deeply.

## My Interest in Data Generation

When I was working through some graduate courses from Georgia Tech, I took a course in Simulation. It was one of the best classes I’ve ever taken, and I learned so much. One of the big points of the course is that data rarely follows a uniform distribution – there is often more reason to any particular set of data. If we look at something like the timing of arrivals at a restaurant or the time between failures of a mechanical part – these things often follow a distribution. The type of distribution and the parameters of that distribution certainly vary, but much of data analytics involves using patterns we see in past data along with an appropriate distribution or two to predict what we’ll see in the future, or using distributions to look for deviations from what we might expect.

As a part of that class, I wrote SQL functions to generate numbers that fit a number of common distributions in Db2. This was an interesting learning experience for me. I was surprised to find that SQL did not already include a lot of these functions, though they are sometimes implemented by statistical packages like SPSS. In this work, I found the normal distribution to be one of the harder ones to generate, due to two factors. First, the trigonometric functions available left a bit to be desired, though I figured out enough trig to use other functions to get what I wanted.

The second issue was much more difficult. I was attempting to use a seed for my random numbers which are required inputs to the function. Using a seed means that the random numbers are repeatable. Start with the same seed, and you always get the same sequence of numbers in the same order. The functions used by nearly all computer systems to generate random numbers are really pseudo-random. This means there is no relationship between the numbers generated, but that generating the numbers is repeatable.

You might see how this would be useful for a simulation that you’d like to run many times and guarantee that either the set of random numbers being used is identical or alternately ensure you’re testing with a different set of random numbers than a previous run. Db2 would not allow me to specify different seeds for two random numbers in the same statement, and the functions I could find for mathematically generating random numbers that match a normal distribution all require two entirely unrelated random numbers. I ended up not being able to solve this problem in time for the class project.

Since then, of course, IBM has added DBMS_RANDOM.NORMAL, so this is no longer an issue in Db2. It does mean that as I researched my platform options when I was switching tracks, this was something I looked for, and many platforms had an answer to generate random numbers from at least the normal distribution.

That’s a bit more complicated than what I’m talking about in this article, but I think it provides some background for my interest in this topic. Most of my needs for generating data are fairly mundane. I get most of what I need from whatever sample database there is for the platform I’m working with. But even the extensive data available in SNOWFLAKE_SAMPLE_DATA doesn’t always meet my needs. A really good sample database that could be implemented in many platforms and that contained more than just basic use cases would go over well – let me know if anyone else knows of one!

## Functions and Table Functions for Data Generation

There is a table function in Snowflake called GENERATOR. This table function takes one or two arguments – either a `ROWCOUNT`

or a `TIMELIMIT`

or both. With the row count specified, we’re telling Snowflake to generate a static number of rows that we specify. With a time limit, we’re telling Snowflake to generate as many rows as it can within that time limit.

There are several sequence functions in Snowflake which allow us to generate sequences of numbers that are increasing in value of different possible widths. These sequences are guaranteed to not generate conflicting values and to generate values in increasing order, but they do not guarantee that there are no gaps in the sequences generated.

The RANDOM function generates a random 64-bit integer. It can be used alone or as an input to a distribution function like the ‘UNIFORM’ function or the ‘NORMAL’ function. It is important to remember that the values generated by random are not guaranteed to be unique across a data set, and the larger the number of values generated, the more likely a collision is. It’s also important to keep MPP principles in mind when working with random numbers. The random numbers used can vary if data is processed in a different order which is likely if the micro-partitions change or the warehouse size changes.

Combining the generator table function with sequences and random number generators allows us to generate interesting combinations of data.

## Generating Data with a Uniform Distribution

A uniform distribution means that every value in the range specified has an equal chance of being chosen. A simple query to generate some uniformly distributed values might look something like this:

```
SELECT SEQ4() as ID
, UNIFORM(1,10,RANDOM()) as rand_num
FROM TABLE(GENERATOR(ROWCOUNT => 10));
```

This will generate values from 1 to 10 chosen randomly to go along with each row id. The values generated will be different each time I run this. To get a more stable set of values that is repeated each time I can specify a seed value to the random function like this:

```
SELECT SEQ4() as ID
, UNIFORM(1,10,RANDOM(42)) rand_num
FROM TABLE(GENERATOR(ROWCOUNT => 10));
```

With this code I get the exact same 10 values for rand_num every time I run the statement. With a larger data set, remember that other factors can change what numbers are generated including the warehouse size and order in which associated data is processed if my query is more complex.

In any case, I find it easier to generate 10 random values here than I would in IBM Db2, where I would have to use a CTE or table function or other methodology.

## Generating Data with a Normal Distribution

Many tasks in analytics require generating random numbers that match a specific distribution. The normal distribution is a common one. Using the NORMAL function, we can generate data in a similar way as we did with the uniform distribution.

```
SELECT SEQ4() as ID
, NORMAL(5,1,RANDOM(42)) as rand_num
FROM TABLE(GENERATOR(ROWCOUNT => 10));
```

In this case, the values I supply to the normal function represent the mean and the standard deviation. Here, I’ve specified that the numbers should represent a normal distribution with a mean of 5 and a standard deviation of 1. This means it should be fairly rare that I see a number above 7 or below 3. My results look reasonable for that:

And the Snowsight interface shows me a distribution that looks about as much like a normal distribution as any 10 values could:

If I generate more values, it looks even more like a normal distribution:

```
SELECT SEQ4() as ID
, NORMAL(5,1,RANDOM(42)) as rand_num
FROM TABLE(GENERATOR(ROWCOUNT => 100000));
```

## Generating a Table with Specific-Interval Dates or Times

This is actually somewhat common to need to do. I want to play with some range joins because I run into them fairly often in my day job and I need to really understand them well to talk with customers about them and to offer different solutions. I am not finding data in the sample databases I regularly use that can simply fit the bill or be easily manipulated to fit the bill.

This is an area where many analytics databases may have a table of dates to work with. This can be true for other fairly static data as well, as this fits a number of use cases. Generating that table to begin with can be a bit of a hassle, though. I’ve certainly used recursive sql for this kind of thing, but I like some niceties in Snowflake SQL for this.

There are a number of solutions for this available in various places online, but I really dislike the ones that use a flatten or an array function. My favorite solution is adapted from this medium article. I tweak it a bit to generate a list of dates between two supplied dates like this:

```
set min_date=to_date('2024-02-07');
set max_date=to_date('2024-02-14');
SELECT dateadd('day', row_number() over (order by seq4()) - 1, $min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 30))
QUALIFY day <= $max_date;
```

There are a few things to be aware of here. The number specified for the row count in the `GENERATOR`

function today has to be a static number. I hope this is corrected in the future, for now it means this number must be greater than the number of dates you wish to generate.

At first, I thought that I could write this a bit more simply, and this is the code I came up with, which I DO NOT RECOMMEND USING:

```
SELECT dateadd('day', seq4(), $min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 30))
WHERE day <= $max_date;
```

The problem with this code is something that you might not run into at a small scale or with luck, but the general issue is that the sequences generated by the Snowflake SEQ functions are not guaranteed to be gap free. It is guaranteed to be increasing in value, so by using an order over the sequence, we can achieve the gap-free series that we really need for this task.

One reason a gap-free sequence is not guaranteed is simply a property of MPP systems. Many servers or containers working on the same sequence make problems with that sequence possible, if unlikely. Coordination requires a bit of communication and computation, so is therefore not always desirable.

## Generating Data with Random Dates/Times

Another common thing that may be needed is to generate random dates within a specific date range. This is relatively easy to do with a combination of the random values and date generation methods described above.

If I wanted to generate random days from a specific week, I might use something like this:

```
SET min_date = TO_DATE('2024-02-07');
SELECT dateadd('day', UNIFORM(0,6,RANDOM()), $min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 10))
order by day;
```

This certainly works with more granular times as well. If I instead wanted 10 random timestamps to the second from a specific week, I could use:

```
SET min_date = TO_DATE('2024-02-07');
SELECT dateadd('second', UNIFORM(0,604800,RANDOM()), $min_date) as day
FROM TABLE(GENERATOR(ROWCOUNT => 10))
order by day;
```

## Generating Strings

For strings, we have the `RANDSTR`

function. This function takes two arguments - the length of the string and a seed. Generating 10 random strings might look something like this:

```
SELECT randstr(5, random()) as my_string
FROM table(generator(rowCount => 10));
```

It is even possible to generate random strings of random lengths using something like this:

SELECT randstr(uniform(1,5,random()), random()) as my_string

FROM table(generator(rowCount => 10));

## Pulling it Together

Ultimately, we are often trying to generate more than one column at a time. Maybe we want to build a table with multiple columns. We can bring all of these methods together to do that.

```
SELECT SEQ4() as ID
, dateadd('day', row_number() over (order by seq4()) - 1, $min_date) as start_date
, dateadd('second', UNIFORM(0,604800,RANDOM()), $max_date) as end_time
, NORMAL(5,1,RANDOM()) as measurement
, randstr(uniform(1,5,random()), random()) || ' ' || randstr(uniform(1,5,random()), random()) as note
FROM TABLE(GENERATOR(ROWCOUNT => 10))
QUALIFY start_date <= $max_date;
```

This data could be directly inserted into a table via an `INSERT`

or a `CREATE TABLE ... AS ...`

.

The functionality and flexibility here is impressive to me.