Population synthesis is a method for creating a fully-enumerated population of the RIVCOM model region (persons and households) based on a population sample. The RIVCOM model has a PopSyn III (JavaPop) implementation of the population synthesizer. This implementation requires only Java to run and there is no dependency on a SQL backend. R software, which is included along with the model installation, is also required for the data processing of inputs and outputs outside of PopSyn.

A separate PopSyn procedure is run to synthesize the resident household and group-quarter (GQ) populations. The population synthesizer in its generic form has two control files (targets) - TAZ-level controls and Regional controls. The RIVCOM model includes separate control files for Resident and GQ PopSyn setup.

Resident households

TAZ-level controls include households, population, households by size, households by income, households by workers, population by age, single-family dwelling units, and multi-family dwelling units for each TAZ in the model region.

Below are the categories defined in the RIVCOM PopSyn implementation. There are no constraints on which categories must be used and PopSyn is able to incorporate user-defined categories for the variables.

Control Name Control Categories
Household by Size 1, 2, 3, and 4+
Household by Income Category “less than 40K”, “40K - 85K”, “85K - 170K”, “greater than 170K”
Household by Workers 0, 1, 2, and 3+
Person by Age “5 - 17”, “18 - 24”, “16 - 64”, and “65 +”

Regional controls file includes total regional employment by industry categories (13) as defined in the model. The PopSyn data preparation step uses a cross-walk between the model employment categories and NAICS code.

Resident group-quarter population

For the group-quarter population, the TAZ-level control file has the number of GQ population within each TAZ, and the total number of GQ population in the region acts as a further control on the total number of modeled GQ population.

PopSyn procedure

The PopSyn procedure is a multi-dimensional matrix balancing procedure that attempts to match each set of controls at each level of geography. Based on the confidence in the targets, the algorithm can be set to prioritize certain controls over others. Validation section will discuss the priority that was used for the different controls.

Seed Data

PopSyn takes as an input the Census data - specifically the Public Use Microdata Sample (PUMS) – in addition to the control files described earlier. This census based population sample is called as the Seed Data. The 5-year ACS PUMS data for 2014-2018 is used as seed to the population synthesizer in RIVCOM. The PUMS data is released by the Census Bureau at the geographic level of Public Use Microdata Areas (PUMA). The PUMS data for the state of California is filtered based on the PUMAs that fall within Riverside County, Orange County or San Diego County, as is used as the seed data.

The model region covers part of San Bernardino County, Orange County, Riverside County and San Diego County. Note that the PopSyn procedure to create a synthesized population is run for all zones except for San Diego County, which is populated with a synthesized population from the SANDAG model. In the implementation, SANDAG’s PopSyn outputs are used to extract the synthesized population for San Diego zones and are stitched together with the RIVCOM PopSyn outputs (without the San Diego TAZs) to create final PopSyn outputs for the region.

The outputs of PopSyn are further processed to produce the number of households by different market segmentations. This summarized popsyn output is located in: popsyn/HHDisaggregation. PopSyn inputs and outputs are maintained in separate sub-folders within the master/popsyn folder.

Base Year PopSyn Validation

For the base year the PopSyn validation is shown by comparing the TAZ-level match between control numbers and synthesized numbers. In PopSyn the priorities or importance can be set for each control depending on the confidence level in each control category. Typically, the confidence is highest for the TAZ-level total household control, which is also the most important predictive variable for trip generation models. Hence the priority for total household control is set at the maximum level: 1 billion (1.0E9 in scientific notation). Priorities for other TAZ-level controls are set at a lower level of 1000. The regional employment controls are given the least priority, with a value of 100. The table below shows the match between synthesized number of households and controls number of households. The match is essentially exact because of the priority set on the total household control. Only TAZs that are within Riverside County are shown in the figures below. This is to avoid large TAZs in Orange County and San Bernardino County from dominating the plot.

The next priority of TAZ-level controls used in PopSyn are household size controls, households by number of worker controls, households by income category controls, households by dwelling type control, and population by age category controls. The scatter plot for these controls are shown below.

click for details

The tables below show similar comparison for the number of worker categories used in population synthesis.

The tables below show similar comparison for the household income categories used in population synthesis.

The tables below show comparison for the household building type controls for households used in population synthesis.

The tables below show comparison for the population age categories used in population synthesis.

The tables below summarizes the fit between synthesized input controls and input controls.

Validation Summary for TAZ controls
Controls Input Output % Diff % RMSE
Total Households 722,562 722,562 0.00 0.0
Households Size: 1 124,908 124,939 0.02 14.0
Households Size: 2 198,047 198,083 0.02 5.1
Households Size: 3 103,608 103,637 0.03 9.1
Households Size: 4+ 295,926 295,903 -0.01 7.0
Number of Workers: 0 247,937 247,247 -0.28 11.3
Number of Workers: 1 289,127 289,326 0.07 2.8
Number of Workers: 2 150,583 151,061 0.32 10.3
Number of Workers: 3+ 34,745 34,928 0.53 30.6
Household Income: Low 240,539 240,574 0.01 6.1
Household Income: Medium 228,927 228,922 0.00 4.3
Household Income: High 185,131 185,113 -0.01 5.6
Household Income: Very High 67,918 67,953 0.05 12.3
Number of Single Family Dwelling Unit 493,663 493,870 0.04 5.2
Number of Multi Family Dwelling Unit 228,887 228,692 -0.09 11.6
Population in Age 5_17 465,000 464,932 -0.01 19.2
Population in Age 18_24 234,018 234,059 0.02 9.9
Population in Age 16_64 1,488,656 1,487,643 -0.07 24.5
Population in Age 65P 336,068 335,828 -0.07 26.7

In this table the Input column shows the input control total used for each control vategory and the Output column shows the synthesized total for the same category. % Diff column is the percentage difference between the input control and synethized outputs. %RMSE is the % root mean squared error and is calculated across all the (Riverside County) TAZs. The table reinforces some of the earlier conclusions. Total households at the TAZ-level is matched perfectly. The controls with fewer observations such as the Household Size: 1 category and Number of Workers: 3+ category are matched less well, as indicated by their relatively higher %RMSE.

Validation Summary for Regional controls
Control Input Output % Diff
Agricultural Employment 18,600 18,325 -1.50
Construction Employment 217,373 213,747 -1.70
Manufacturing Employment 258,822 255,485 -1.31
Wholesale Employment 148,900 145,747 -2.16
Retail Employment 323,532 317,941 -1.76
Transportation Employment 154,950 152,717 -1.46
Information Employment 38,010 37,320 -1.85
FIRE Employment 174,606 171,184 -2.00
Professional Employment 499,512 488,105 -2.34
Educational Employment 697,676 683,565 -2.06
Art Entertainment Employment 395,691 387,249 -2.18
Other Services Employment 118,678 116,991 -1.44
Public Adminstration Employment 67,185 66,420 -1.15

The table above shows the match between controls and synthesized totals for all the regional controls; the employment numbers by industry category. Thirteen industry categories are used. It can be observed in the table that the output totals are about 1 to 2% lower than the input control totals across all the industry categories. This is because, the inputs are jobs/employments and outputs are workers. A single worker can work in multiple jobs hence, the total jobs (the input) will be higher than the workers (the output). Due to this difference between inputs and outputs, the priority for the employment control is the lowest in the population synthesize process. It is set at a value of 100.

Riverside County Model, 2020