Weighted Random Values in PL/SQL

A while back I wanted to generate a large volume of test data (a couple billion records), one column of which I needed to determine using a weighted random that had a large number of possible values.

A very simple weighted random can be generated in Oracle with an if or case block:

```DECLARE
n_random   NUMBER;
n_total_A  NUMBER := 0;
n_total_B  NUMBER := 0;

BEGIN
FOR i IN 1 .. 1000 LOOP
n_random := TRUNC(dbms_random.value(1, 5));

IF n_random > 1 THEN
n_total_A := n_total_A + 1;
ELSE
n_total_B := n_total_B + 1;
END IF;
END LOOP;

dbms_output.put_line('%A: ' || n_total_A / 10);
dbms_output.put_line('%B: ' || n_total_B / 10);
END;
/
```

The value “A” will be returned roughly 75% of the time since the random numbers 2, 3, and 4 will equate to “B”. A version of something along these lines works alright for a small, static set of values.

However, if you need to generate a weighted random for an arbitrary number of possible values each with their own weight, a static approach like above would be impossible to manage.

As an alternative, start by creating a table with two columns. For example:

end_key value
15 A
221 B
920 C

For a randomly generated key between 1 and the highest end_key in the table (920 in this example), the resulting weighted random value would be “A” if the key were between 1 and 15, “B” if it were between 16 and 221, and so on. The result of “A” would be produced roughly 1.3% of the time.

In PL/SQL, load this table into a sparsely populated collection indexed by the end_key and use it to return the weighted value:

```DECLARE

CURSOR weighted_key_cursor
IS
SELECT end_key,
value
FROM   weighted_key_example;

TYPE weighted_key_table IS TABLE OF weighted_key_cursor%ROWTYPE
INDEX BY PLS_INTEGER;

TYPE weighted_value_table IS TABLE OF VARCHAR2(5)
INDEX BY PLS_INTEGER;

t_keys    weighted_key_table;
t_weight  weighted_value_table;

FUNCTION get_weighted_random_value
RETURN VARCHAR2
IS
n_ind  PLS_INTEGER;
BEGIN
-- Pick a random key between 1 and the last end_key in the table
n_ind := TRUNC(dbms_random.value(1, t_weight.LAST + 1));
-- Return the value at that key, though it's pretty unlikely it will be found
RETURN t_weight(n_ind);
EXCEPTION
WHEN no_data_found THEN
-- Increment the key to the next known end_key and return that value
n_ind := t_weight.NEXT(n_ind);
RETURN t_weight(n_ind);
END;

BEGIN

-- Fetch the contents of the weighted reference table
OPEN  weighted_key_cursor;
FETCH weighted_key_cursor
BULK COLLECT INTO t_keys;
CLOSE weighted_key_cursor;

-- Populate the collection that will be used to generate the randoms, keyed
-- by the end_key column. Some additional error checking would be good here.
FOR i IN 1 .. t_keys.LAST LOOP
t_weight(t_keys(i).end_key) := t_keys(i).value;
END LOOP;

-- Delete the input collection since it isn't needed.
t_keys.DELETE;

-- Do the real work here, calling get_weighted_random_value as needed
dbms_output.put_line(get_weighted_random_value);

END;
/
```

By only populating the collection at the specific “end points” of each value rather than everywhere in between, the amount of process memory used is minimized. When a random key is picked that is between two of those end points, the function will increment it to the next populated collection index and return that value (this may be the most likely scenario depending on how far apart the populated indexes are).

In the specific case where I used the above method, I needed to get a random 5 digit zip code weighted by an estimate of the population of that zip code (or something roughly proportional to population). This resulted in about 30,000 possible values.

Later,
Kevin

When I was introduced to threaded Oracle processes the basic idea for dividing the work was to explicitly assign each thread a set of work and let each go on their way until all were complete. Say you have a reference table called “example_control” that stores units of work to be completed. This might be primary key values or partitioning keys corresponding to data that you are trying to process:

 partitioning_key process_start process_end 1 NULL NULL 2 NULL NULL 3 NULL NULL 4 NULL NULL 5 NULL NULL

You might assign these units of work to threads by adding a column for thread number and creating a process to populate that column by attempting to determine the relative “weight” of each unit (the estimated time to complete). You might instead simply have each thread select from this control table and use a MOD function to divide it mathematically using the current thread’s number and the total number of threads:

```WHERE MOD(partitioning_key, i_max_threads) + 1 = i_cur_thread_num
```

The basic limitation with both approaches is that it can be difficult and/or time consuming to attempt to evenly divide the work beforehand. Either you spend time and resources to calculate a weight for each unit of work or you simply carve the work mathematically knowing that not all units will take the same time to process. Some threads inevitably will finish their assigned work before others and as such their processing power is lost. In some cases the difference in assigned work is extreme with a couple threads running for hours after others have completed.

The alternative I put together is essentially to say that no effort should be spent in dividing the work beforehand and threads themselves should look for work to complete and not end until there is no work left to process. Some people have referred to this as “buffet style” threading.

Say we have the same control table as above and a threaded PL/SQL procedure called “threaded_example_proc”. When that procedure is executed, it will:

1. Fetch all “unfinished” records from the control table into a local collection. These are ordered randomly so there is less competition between threads initially as well as to afford the process some self-correction (discussed later).
2. Attempt to lock the control record (select for update nowait) and process it if the lock is acquired, otherwise skip.
3. If the lock is acquired, set the start time (optional) and process the record. The lock needs to be maintained throughout that processing so any commits should be done autonomously.
4. Set the process end time and commit to release the lock on the control record. Some indication that the record has been successfully processed is necessary. It doesn’t have to be a date/time though that information is useful for metrics.

Example PL/SQL code:

```CURSOR control_cursor
IS
SELECT   partitioning_key,
ROWID AS row_id
FROM     example_control
WHERE    process_end IS NULL
ORDER BY DBMS_RANDOM.VALUE();

TYPE control_cursor_table IS TABLE OF control_cursor%ROWTYPE
INDEX BY PLS_INTEGER;

FUNCTION lock_control_record
(i_ctrl_rec  IN  control_cursor%ROWTYPE)
RETURN BOOLEAN
IS
l_lock_success  NUMBER(1);
resource_busy   EXCEPTION;
PRAGMA EXCEPTION_INIT(resource_busy, -54);
BEGIN
SELECT  1
INTO    l_lock_success
FROM    example_control
WHERE   ROWID = i_ctrl_rec.row_id
AND     process_end IS NULL
FOR UPDATE NOWAIT;

RETURN TRUE;

EXCEPTION
WHEN resource_busy THEN
-- This record is locked, skip
RETURN FALSE;
WHEN no_data_found THEN
-- This record already has a process_end
-- so another thread finished it, skip
RETURN FALSE;
END;

PROCEDURE update_process_start
(i_ctrl_rec  IN  control_cursor%ROWTYPE)
IS
BEGIN
UPDATE example_control
SET    process_start = SYSDATE
WHERE  ROWID = i_ctrl_rec.row_id;
END;

PROCEDURE update_process_end
(i_ctrl_rec  IN  control_cursor%ROWTYPE)
IS
BEGIN
UPDATE example_control
SET    process_end = SYSDATE
WHERE  ROWID = i_ctrl_rec.row_id;
END;

IS
t_ctrl  control_cursor_table;
BEGIN
OPEN  control_cursor;
FETCH control_cursor
BULK COLLECT INTO t_ctrl;
CLOSE control_cursor;

FOR i IN 1 .. t_ctrl.COUNT LOOP
IF lock_control_record(t_ctrl(i)) THEN
update_process_start(t_ctrl(i));

-- Do the processing --

update_process_end(t_ctrl(i));

COMMIT;
END IF;
END LOOP;
END;
```

• The overall process is not dependent on the number of threads running. Additional threads can be added or removed in the middle of the process if deemed necessary without the need to kill all threads and restart.
• There is no need to spend time calculating an even division of work between the threads. The processing actually being done may be difficult to estimate beforehand depending on the complexity of the process.
• If one unit of work takes ends up taking longer to process than another, the thread processing the larger unit will simply process less total control records. The total work will naturally balance out between the threads.
• There is some self-correction capability in that if a thread fails while processing a control record for a recoverable reason (e.g. network connection loss), that session’s lock will be released and another thread will be free to process is later when it gets to that record. This is another reason to initially order the control table collection randomly.

Performance can be further improved if there is physical separation between the work being performed by the threads (e.g. avoiding read by other session waits). I use the example of the control table containing partitioning or subpartitioning keys which helps to accomplish such separation assuming that the work can be mostly limited to an individual partition/subpartition of a table and a local index.

Additionally, the units of work should be sufficiently large such that the majority of the processing power of the database is not spent on overhead (i.e. locking and committing changes to control records). Primary key values might not be a good control unit unless they are simply reference data and correspond with a large amount of data/processing in other related tables.

If the control table contains both a start and end time for each record, the overall progress of the process can be monitored easily with a query such as:

```SELECT TO_CHAR (start_dt + (1/24), 'MM/DD HH:MI AM') start_time,
ROUND ((end_dt - start_dt) * 24, 2) hours_running,
ROUND ((((end_dt - start_dt) * 24) / (c_count / t_count)) - (end_dt - start_dt) * 24, 2) est_hours_until_complete,
TO_CHAR ((start_dt + (((end_dt - start_dt)) / (c_count / t_count))) + (1/24), 'MM/DD HH:MI AM') est_end_time,
ROUND (c_count / t_count * 100, 2) || '%' completed_percent,
c_count || ' / ' || t_count AS completed_recs
FROM  (SELECT SUM(CASE WHEN process_end IS NOT NULL THEN 1
ELSE 0
END) AS c_count,
COUNT(1) AS t_count,
MAX(process_end) end_dt,
MIN(process_start) start_dt
FROM   example_control);
```

Thanks,
Kevin

Automated Tablespace Defragmentation & Storage Reclamation

This is one of the more recent projects I worked on to address an issue I’ve faced in the past with Oracle tablespaces becoming fragmented and difficult to shrink after periods of rapid growth followed by drastic reductions in data retention. The culprits were mostly very large (4+ TB) date partitioned tables.

At one point, I put together a system where historically very large tablespaces (5-10+ TB) were broken up into many, much smaller tablespaces and partitioned tables were spread evenly over all available tablespaces. I added a process to a daily partition management job that would facilitate this by recalculating a subpartition template for subpartitioned tables each day based on the representative size of each subpartition value such that the resulting template would evenly spread data over a range of tablespaces. While this did successfully even out tablespace sizes in the database, it did not address the issue of tablespace fragmentation.

Inevitably at some point it would be decided that one or more large tables should be radically reduced in length of retention or dropped entirely. This resulted in a large amount of free space in the many small tablespaces and the theoretical advantage of no tablespace in the database being very large was that the effort to reorganize and shrink any individual tablespace would be small. However, in practice this wasn’t the case due to the fact that the large objects were highly partitioned and Oracle does not really provide an ability to move/rebuild multiple objects at the same time (apart from DataPump export/import which would be a lengthy outage). Each small subpartition had to be moved separately resulting in a huge quantity of expensive DDL statements (expensive due to the very large library footprint of the biggest tables). Ironically the time spent actually moving a subpartition was far less than the time spent parsing the move statement not to mention the contention of running the moves in parallel.

What I’ve proposed now is that rather than focus on keeping the tablespace sizes even, instead rotate partitioned tables through a set range of tablespaces such that every few days, one of the tablespaces in the range is emptied and easy to shrink. I’ve included some graphs of what this would look like. Basically, partitioned tables would be grouped based on similar length of retention and synced such that the same partition dates across all tables in the group would be in the same tablespace. As an example the below large example tables could be grouped together:

 Table Name Future Partitions Retention Min Tablespace Max Tablespace Tablespace Span Total Tablespace Span Total Assigned Tablespaces Maximum Potentional Retention EXAMPLE_TABLE_1 10 120 31 80 3 44 50 137 EXAMPLE_TABLE_2 3 120 31 80 3 42 50 144

The tables above are in the same logical group because they share the same tablespace range (TABLESPACE_31 through TABLESPACE_80). This introduces a concept which I refer to as tablespace span (there’s probably a less confusing name for this). This value is the number of date partitions that should be placed in the same tablespace. For example, a tablespace span of 3 as seen in the above table would look like:

 Partition Tablespace 9/21/2012 TABLESPACE_21 9/20/2012 TABLESPACE_21 9/19/2012 TABLESPACE_20 9/18/2012 TABLESPACE_20 9/17/2012 TABLESPACE_20 9/16/2012 TABLESPACE_19 9/15/2012 TABLESPACE_19 9/15/2012 TABLESPACE_19

The reasoning behind this is that a tablespace span of 1 would require that there be an available, non-repeated tablespace for each partition in the table. This would require hundreds of tablespaces in some cases which would be a great deal of overhead.

The total tablespace span of each table is the number of tablespaces that the table can occupy at one time based on the add offset days, drop offset days, and tablespace span. It is calculated as follows:

`total_tablespace_span = FLOOR((retention + future_partitions)/tablespace_span) + 1`

The idea here is to keep the total tablespace span less than the total assigned tablespaces in the range. That way there is always one tablespace that is emptied on a regular basis as the tables move out of one tablespace and into another. This gap moves through all tablespaces in the range over time and thereby free storage can be automatically reclaimed from all tablespaces without expensive reorganization. The frequency at which a tablespace is emptied is the same as the tablespace span assigned to the group. In the above example, every third day a tablespace could be shrunk. The reason you would want to assign more tablespaces to each group than required is so that there is a buffer in case it is decided that the retention of a table in the range should be increased. The maximum potential retention a table could be increased to and still support this model is calculated as:

`maximum_potential_retention = tablespace_span * (total_assigned_tablespaces – 1) – retention`

Below are several graphs to help visualize this:

Obviously this model only works for date partitioned tables. The same idea would work for certain other partitioning schemes that involve the regular addition and dropping of partitions from large tables. The assumption here being that if a table is not partitioned or is statically partitioned, it probably isn’t large enough to be of concern and could be placed in one of several tablespaces reserved for such tables. This would include mostly reference tables and other small stage and load tables that total an insignificant amount of storage when compared to the heavy hitters in the database.

Most of the work in implementing something like this is to carefully and methodically construct groups of tables of similar size and retention as well as account for possible changes in retention. The size of the tables in the group should also be taken into account so that the tablespace range is large enough such that the individual size of each tablespace doesn’t get too far out of hand.

Basic implementation of this process is to alter/setup a regularly scheduled partition management job that will add future date partitions to tables and drop date partitions older than the retention threshold. The future partitions would be added such that the tablespace for each partition would repeat for <tablespace span> days. I can provide some PL/SQL examples for this here if people are interested.

A second job could then be created that queries tablespaces for those that are “empty” and re-sizes their associated datafiles to the high water mark. Empty likely being defined as between two configurable values (less than a maximum used size and greater than a minimum total allocated size. This would prevent the process from attempting to calculate high water marks for and re-size datafiles that aren’t “empty” but also prevent the process from shrinking the same already shrunk tablespaces each day. An example query that could be included in a PL/SQL process:

```SELECT tablespace_name
FROM  (SELECT   d.tablespace_name,
ROUND((SUM(d.bytes) / 1024 / 1024) - (NVL(SUM(f.bytes) / 1024 / 1024,0))) used_mb,
ROUND(SUM(d.bytes) / 1024 / 1024) allocated_mb
FROM     dba_data_files d,
dba_free_space f
WHERE    d.tablespace_name = f.tablespace_name(+)
GROUP BY d.tablespace_name)
WHERE  used_mb <= gn_max_used_space_for_shrink
AND    allocated_mb >= gn_min_allocated_for_shrink
AND    REGEXP_LIKE(tablespace_name, gv_valid_tablespace_regex);
```

Thanks,
Kevin

Posted in Projects | 1 Comment

Start

I’m starting this blog as a way to publish some of my ideas, projects, tips, etc. that I’ve put together over the past several years since I began working with large (primarily Oracle) database implementations (40+ TB). This site is still very much a work in progress so expect the look and feel in particular to be updated. In any event, I hope others find the content here helpful and I look forward to feedback.

Thanks,
Kevin