Population synthesis is a method for creating a fully-enumerated population of the RIVCOM model region (persons and households) based on a population sample. The RIVCOM model has a PopSyn III (JavaPop) implementation of the population synthesizer. This implementation requires only Java to run and there is no dependency on a SQL backend. R software, which is included along with the model installation, is also required for the data processing of inputs and outputs outside of PopSyn.
A separate PopSyn procedure is run to synthesize the resident household and group-quarter (GQ) populations. The population synthesizer in its generic form has two control files (targets) - TAZ-level controls and Regional controls. The RIVCOM model includes separate control files for Resident and GQ PopSyn setup.
TAZ-level controls include households, population, households by size, households by income, households by workers, population by age, single-family dwelling units, and multi-family dwelling units for each TAZ in the model region.
Below are the categories defined in the RIVCOM PopSyn implementation. There are no constraints on which categories must be used and PopSyn is able to incorporate user-defined categories for the variables.
Control Name | Control Categories |
---|---|
Household by Size | 1, 2, 3, and 4+ |
Household by Income Category | “less than 40K”, “40K - 85K”, “85K - 170K”, “greater than 170K” |
Household by Workers | 0, 1, 2, and 3+ |
Person by Age | “5 - 17”, “18 - 24”, “16 - 64”, and “65 +” |
Regional controls file includes total regional employment by industry categories (13) as defined in the model. The PopSyn data preparation step uses a cross-walk between the model employment categories and NAICS code.
For the group-quarter population, the TAZ-level control file has the number of GQ population within each TAZ, and the total number of GQ population in the region acts as a further control on the total number of modeled GQ population.
The PopSyn procedure is a multi-dimensional matrix balancing procedure that attempts to match each set of controls at each level of geography. Based on the confidence in the targets, the algorithm can be set to prioritize certain controls over others. Validation section will discuss the priority that was used for the different controls.
PopSyn takes as an input the Census data - specifically the Public Use Microdata Sample (PUMS) – in addition to the control files described earlier. This census based population sample is called as the Seed Data. The 5-year ACS PUMS data for 2014-2018 is used as seed to the population synthesizer in RIVCOM. The PUMS data is released by the Census Bureau at the geographic level of Public Use Microdata Areas (PUMA). The PUMS data for the state of California is filtered based on the PUMAs that fall within Riverside County, Orange County or San Diego County, as is used as the seed data.
The model region covers part of San Bernardino County, Orange County, Riverside County and San Diego County. Note that the PopSyn procedure to create a synthesized population is run for all zones except for San Diego County, which is populated with a synthesized population from the SANDAG model. In the implementation, SANDAG’s PopSyn outputs are used to extract the synthesized population for San Diego zones and are stitched together with the RIVCOM PopSyn outputs (without the San Diego TAZs) to create final PopSyn outputs for the region.
The outputs of PopSyn are further processed to produce the number of households by different market segmentations. This summarized popsyn output is located in: popsyn/HHDisaggregation. PopSyn inputs and outputs are maintained in separate sub-folders within the master/popsyn folder.
For the base year the PopSyn validation is shown by comparing the TAZ-level match between control numbers and synthesized numbers. In PopSyn the priorities or importance can be set for each control depending on the confidence level in each control category. Typically, the confidence is highest for the TAZ-level total household control, which is also the most important predictive variable for trip generation models. Hence the priority for total household control is set at the maximum level: 1 billion (1.0E9 in scientific notation). Priorities for other TAZ-level controls are set at a lower level of 1000. The regional employment controls are given the least priority, with a value of 100. The table below shows the match between synthesized number of households and controls number of households. The match is essentially exact because of the priority set on the total household control. Only TAZs that are within Riverside County are shown in the figures below. This is to avoid large TAZs in Orange County and San Bernardino County from dominating the plot.
The next priority of TAZ-level controls used in PopSyn are household size controls, households by number of worker controls, households by income category controls, households by dwelling type control, and population by age category controls. The scatter plot for these controls are shown below.
The tables below show similar comparison for the number of worker categories used in population synthesis.
The tables below show similar comparison for the household income categories used in population synthesis.
The tables below show comparison for the household building type controls for households used in population synthesis.
The tables below show comparison for the population age categories used in population synthesis.
The tables below summarizes the fit between synthesized input controls and input controls.
Controls | Input | Output | % Diff | % RMSE |
---|---|---|---|---|
Total Households | 722,562 | 722,562 | 0.00 | 0.0 |
Households Size: 1 | 124,908 | 124,939 | 0.02 | 14.0 |
Households Size: 2 | 198,047 | 198,083 | 0.02 | 5.1 |
Households Size: 3 | 103,608 | 103,637 | 0.03 | 9.1 |
Households Size: 4+ | 295,926 | 295,903 | -0.01 | 7.0 |
Number of Workers: 0 | 247,937 | 247,247 | -0.28 | 11.3 |
Number of Workers: 1 | 289,127 | 289,326 | 0.07 | 2.8 |
Number of Workers: 2 | 150,583 | 151,061 | 0.32 | 10.3 |
Number of Workers: 3+ | 34,745 | 34,928 | 0.53 | 30.6 |
Household Income: Low | 240,539 | 240,574 | 0.01 | 6.1 |
Household Income: Medium | 228,927 | 228,922 | 0.00 | 4.3 |
Household Income: High | 185,131 | 185,113 | -0.01 | 5.6 |
Household Income: Very High | 67,918 | 67,953 | 0.05 | 12.3 |
Number of Single Family Dwelling Unit | 493,663 | 493,870 | 0.04 | 5.2 |
Number of Multi Family Dwelling Unit | 228,887 | 228,692 | -0.09 | 11.6 |
Population in Age 5_17 | 465,000 | 464,932 | -0.01 | 19.2 |
Population in Age 18_24 | 234,018 | 234,059 | 0.02 | 9.9 |
Population in Age 16_64 | 1,488,656 | 1,487,643 | -0.07 | 24.5 |
Population in Age 65P | 336,068 | 335,828 | -0.07 | 26.7 |
In this table the Input column shows the input control total used for each control vategory and the Output column shows the synthesized total for the same category. % Diff column is the percentage difference between the input control and synethized outputs. %RMSE is the % root mean squared error and is calculated across all the (Riverside County) TAZs. The table reinforces some of the earlier conclusions. Total households at the TAZ-level is matched perfectly. The controls with fewer observations such as the Household Size: 1 category and Number of Workers: 3+ category are matched less well, as indicated by their relatively higher %RMSE.
Control | Input | Output | % Diff |
---|---|---|---|
Agricultural Employment | 18,600 | 18,325 | -1.50 |
Construction Employment | 217,373 | 213,747 | -1.70 |
Manufacturing Employment | 258,822 | 255,485 | -1.31 |
Wholesale Employment | 148,900 | 145,747 | -2.16 |
Retail Employment | 323,532 | 317,941 | -1.76 |
Transportation Employment | 154,950 | 152,717 | -1.46 |
Information Employment | 38,010 | 37,320 | -1.85 |
FIRE Employment | 174,606 | 171,184 | -2.00 |
Professional Employment | 499,512 | 488,105 | -2.34 |
Educational Employment | 697,676 | 683,565 | -2.06 |
Art Entertainment Employment | 395,691 | 387,249 | -2.18 |
Other Services Employment | 118,678 | 116,991 | -1.44 |
Public Adminstration Employment | 67,185 | 66,420 | -1.15 |
The table above shows the match between controls and synthesized totals for all the regional controls; the employment numbers by industry category. Thirteen industry categories are used. It can be observed in the table that the output totals are about 1 to 2% lower than the input control totals across all the industry categories. This is because, the inputs are jobs/employments and outputs are workers. A single worker can work in multiple jobs hence, the total jobs (the input) will be higher than the workers (the output). Due to this difference between inputs and outputs, the priority for the employment control is the lowest in the population synthesize process. It is set at a value of 100.
Riverside County Model, 2020