A while back I wanted to generate a large volume of test data (a couple billion records), one column of which I needed to determine using a weighted random that had a large number of possible values.

A very simple weighted random can be generated in Oracle with an if or case block:

DECLARE n_random NUMBER; n_total_A NUMBER := 0; n_total_B NUMBER := 0; BEGIN FOR i IN 1 .. 1000 LOOP n_random := TRUNC(dbms_random.value(1, 5)); IF n_random > 1 THEN n_total_A := n_total_A + 1; ELSE n_total_B := n_total_B + 1; END IF; END LOOP; dbms_output.put_line('%A: ' || n_total_A / 10); dbms_output.put_line('%B: ' || n_total_B / 10); END; /

The value “A” will be returned roughly 75% of the time since the random numbers 2, 3, and 4 will equate to “B”. A version of something along these lines works alright for a small, static set of values.

However, if you need to generate a weighted random for an arbitrary number of possible values each with their own weight, a static approach like above would be impossible to manage.

As an alternative, start by creating a table with two columns. For example:

end_key | value |
---|---|

15 | A |

221 | B |

920 | C |

For a randomly generated key between 1 and the highest end_key in the table (920 in this example), the resulting weighted random value would be “A” if the key were between 1 and 15, “B” if it were between 16 and 221, and so on. The result of “A” would be produced roughly 1.3% of the time.

In PL/SQL, load this table into a sparsely populated collection indexed by the end_key and use it to return the weighted value:

DECLARE CURSOR weighted_key_cursor IS SELECT end_key, value FROM weighted_key_example; TYPE weighted_key_table IS TABLE OF weighted_key_cursor%ROWTYPE INDEX BY PLS_INTEGER; TYPE weighted_value_table IS TABLE OF VARCHAR2(5) INDEX BY PLS_INTEGER; t_keys weighted_key_table; t_weight weighted_value_table; FUNCTION get_weighted_random_value RETURN VARCHAR2 IS n_ind PLS_INTEGER; BEGIN -- Pick a random key between 1 and the last end_key in the table n_ind := TRUNC(dbms_random.value(1, t_weight.LAST + 1)); -- Return the value at that key, though it's pretty unlikely it will be found RETURN t_weight(n_ind); EXCEPTION WHEN no_data_found THEN -- Increment the key to the next known end_key and return that value n_ind := t_weight.NEXT(n_ind); RETURN t_weight(n_ind); END; BEGIN -- Fetch the contents of the weighted reference table OPEN weighted_key_cursor; FETCH weighted_key_cursor BULK COLLECT INTO t_keys; CLOSE weighted_key_cursor; -- Populate the collection that will be used to generate the randoms, keyed -- by the end_key column. Some additional error checking would be good here. FOR i IN 1 .. t_keys.LAST LOOP t_weight(t_keys(i).end_key) := t_keys(i).value; END LOOP; -- Delete the input collection since it isn't needed. t_keys.DELETE; -- Do the real work here, calling get_weighted_random_value as needed dbms_output.put_line(get_weighted_random_value); END; /

By only populating the collection at the specific “end points” of each value rather than everywhere in between, the amount of process memory used is minimized. When a random key is picked that is between two of those end points, the function will increment it to the next populated collection index and return that value (this may be the most likely scenario depending on how far apart the populated indexes are).

In the specific case where I used the above method, I needed to get a random 5 digit zip code weighted by an estimate of the population of that zip code (or something roughly proportional to population). This resulted in about 30,000 possible values.

Later,

Kevin