In this Appendix, we describe the data analysis we conducted for our series of posts on California’s low-wage workers and minimum wage. We begin by discussing some strengths and weaknesses of the data sources we used. We then describe the methods used to construct many of the estimates presented in the posts. Finally, we described how we matched household types to apartment types for the analysis of housing costs.

* Primary Data Source: Monthly Current Population Surveys (CPS).* Most of the estimates in our posts use data from the outgoing rotation groups from the Basic Monthly CPS. The survey asks these rotation groups about the hourly wage or usual weekly earnings (along with usual weekly hours) for each worker in the respondent’s household. For our analysis, key strengths of the CPS include:

Individual-level microdata are publicly available, enabling a high degree of customization.

Responses to the wage, earnings, and hours questions are relatively precise. The main exception is that high wages are top-coded: all wages above a certain amount are assigned the same value. In high-wage states like California, top-coding seriously hinders analyses that consider the upper portion of the wage distribution. Also, starting in April 2023, the Census Bureau has rounded wages and earnings to a greater degree than it did historically, although the wage measurements remain more precise than those in other publicly available data.

* Secondary Data Source: Occupational and Employment Wage Survey (OEWS).* The CPS’s sample size is not large enough to produce reliable estimates of average wages for relatively small groups of California workers, such as those in narrowly defined occupations or geographic areas. To examine these issues, we use the OEWS. We limit our use of the OEWS to these cases because OEWS microdata are not publicly available and because the wage measurements are less precise than those in the CPS. The OEWS survey responses place wages into relatively wide bins. The publicly posted statistics incorporate assumptions that (1) each worker’s wage was equal to the average value within the bin (which in turn comes from another data source) and (2) all wages recently grew at the same rate as an aggregate employment cost index.

* Main Weakness: Survey Nonresponse.* When organizations conduct surveys, many of the people they contact do not provide complete responses. Some decline to respond to the survey altogether (“unit nonresponse”), while others respond to most parts of the survey but decline to answer certain questions (“item nonresponse”). Both types of nonresponse can make estimates of wages and of workers’ characteristics less accurate or less informative. To some extent, this is due simply to the smaller sample size. Much more serious problems, however, can arise when respondents differ systematically from nonrespondents. These concerns have grown as response rates for many surveys have declined substantially over the last decade. For example, the monthly CPS unit nonresponse rate grew from 10 percent in 2013 to 30 percent in 2023. Unit nonresponse rates for many other surveys are well above 30 percent.

* Another Weakness: Measurement Error.* Even when survey participants respond to questions, their responses sometimes do not provide totally accurate information about their wages or other characteristics. These measurement errors can reduce the accuracy of the resulting estimates.

* Strategy for Unit Nonresponse: Official Survey Weights.* To address unit nonresponse, we use the earnings weights calculated by the Census Bureau. This approach is widespread, but it has some key limitations. In particular, its accuracy depends on the assumption that outcomes for nonrespondents are “conditionally missing at random.” In other words, if the method used to construct the weights misses major systematic differences between respondents and nonrespondents, then the resulting estimates potentially can be quite inaccurate.

Alternatively, we could take an approach that is more agnostic about the differences between respondents and nonrespondents. Such an approach improves accuracy, but—since 30 percent of households do not respond to the survey—it tends to yield wide bounds. Presenting such bounds alongside point estimates can provide helpful context, but we do not pursue that approach in these posts, as the resulting array of numbers could be difficult to interpret.

* Constructing Hourly Wage Variable.* We construct the hourly wage variable as follows:

We treat all imputed values for hourly wages, weekly earnings, and weekly hours as missing.

Except for the top-coding adjustment described below, we take each worker’s reported hourly wage, weekly earnings, and weekly hours at face value.

For respondents who report usual weekly earnings and hours but not hourly wages, we calculate the hourly wage by dividing usual weekly earnings by weekly hours. We default to using usual weekly hours, but we use last week’s hours if usual hours are not available.

Following Dey et al. (2022), we multiply top-coded wages for months through March 2023 by 1.4. (The Census Bureau changed its top-coding methods starting in April 2023.)

After we take these steps, roughly one-third of employees’ hourly wages remain missing. To construct each figure in the posts, we use one of two different strategies to address this item nonresponse problem, depending on the type of statistic we are trying to estimate. The following sections describe these strategies in more detail.

In this section, we describe the method we use to construct the estimates displayed in Figures 2 through 5 in the post *Who Are California’s Low-Wage Workers?* and Figures 1 and 2 in the post *How Long Do People Stay in Low-Wage Work?* For these figures, we define low-wage workers as employees who made up to $17.50 per hour at their main job in 2023.

* Estimate Probit Regressions.* In the first step, we estimate a model that uses each worker’s observed characteristics to predict the probability that they respond to the wage questions. Similar to the use of survey weights described above, this model relies on the assumption that responses to the wage questions are conditionally missing at random. Dutz et al. (2022) use a Norwegian survey to assess the performance of a variety of models that predict survey nonresponse based on this type of assumption. They find that methods based on conventional econometric models perform at least as well as methods based on machine learning models. In light of this finding, we construct weights by estimating probit regressions.

* Model Selection.* We consider roughly 40 different probit specifications. We assess the pseudo-out-of-sample prediction errors for each specification using the monthly CPS from January to December 2022 as our training set and the monthly CPS from January to December 2023 as our test set. We select the specification with the smallest mean absolute prediction error in the test set. We estimate that model using all 24 months of data from 2022 and 2023.

* Estimate Characteristics by Weighting Respondents.* We construct the estimates in the figures using workers who provided a wage response in 2023. Using the probit estimates described above, we weight each worker by the predicted inverse probability of providing a wage response.

In this section, we summarize our approach to estimating percentiles of hourly wage distributions. We apply this method to monthly CPS data from January 2022 through December 2023 to construct the estimates that appear in Figures 3 through 7 in the post *Is California’s Minimum Wage High, Low, or Somewhere in Between?* (For this analysis, grouping data into two-year intervals strikes a good balance between accuracy and timeliness.) We apply a very similar method to two-year intervals of monthly CPS data from other time periods to construct some of the other estimates in those figures, as well as the estimates in Figure 3 in the post *How Long Do People Stay in Low-Wage Work?*.

* Strategy for Item Nonresponse: Arellano and Bonhomme (2017) Model.* Bollinger et al. (2019) study response patterns to earnings questions in the CPS’s Annual Social and Economic Supplement by linking the survey data to administrative data. They find that methods based on the assumption that earnings responses are conditionally missing at random—such as the conventional weighting approach described above—produce particularly inaccurate estimates for measures of inequality, such as percentiles that are far from the middle of the earnings distribution. They find that the quantile selection model developed by Arellano and Bonhomme (2017) noticeably outperforms the conventional weighting approach in this context. Accordingly, as described below, our wage percentile estimates are based on the Arellano and Bonhomme (2017) approach.

* Exclusion Restrictions.* In this context, the two main parts of the Arellano and Bonhomme (2017) model are (1) a set of equations that predict the conditional quantiles of the wage distribution, and (2) an equation that predicts whether each worker responds to the survey’s wage questions. To estimate the model, we need at least one variable that enters the survey response equation but not the wage equations. For most of our estimates, we use two variables inspired by Bollinger and Hirsch (2013):

The first excluded variable indicates whether the respondent is in the fourth or eighth CPS rotation group. In a given month, the household’s rotation group reflects the number of times the survey has contacted them. This is purely an artifact of the survey design, so it should have no direct relationship to the household’s actual wages. On average, respondents in the eighth rotation group are more likely to respond to the wage questions than respondents in the fourth rotation group. This relationship is not strong enough for the rotation group to serve as the only excluded variable, but it is strong enough to be useful alongside the variable described below.**Rotation Group.**The second excluded variable indicates whether the response comes from the February or March survey, rather than one of the other ten months of the year. As described in Bollinger and Hirsch (2013), CPS enumerators have stronger performance incentives in February and March than in other months. Accordingly, respondents are more likely to respond to the wage questions in February and March than in other months. Seasonal hiring patterns can vary across the wage distribution, so this exclusion restriction is not as airtight as the first one. That said, the most pronounced seasonal hiring patterns tend to occur at other times of year.**Survey Month.**

The F-statistic for this pair of regressors in the response equation generally is between 33 and 35, depending on the specification. The full-time median estimates in Figure 6 are the exception: when we limit the sample to full-time workers only, the F-statistic is around 22 to 23.

* Other Explanatory Variables.* The Arellano and Bonhomme (2017) model assumes that the probability of responding to the wage question is positive for all observed combinations of regressors. Although we assume that this does not apply to continuous variables, such as age, it still seriously limits the number of explanatory variables that we can include in addition to the two described above. Accordingly, we focus almost exclusively on variables that have strong relationships both with survey response and with wages. All of our specifications include:

The worker’s age, age squared, and age cubed.

Whether the survey response comes directly from the worker, from the worker’s spouse, or from somebody else.

Each specification also includes some combination of:

Whether the worker usually works part-time.

Whether the worker graduated from high school.

Whether the worker lives in Los Angeles County.

Whether the worker belongs to a union.

Whether the worker works in the manufacturing industry.

Whether the worker holds multiple jobs.

The worker’s gender.

The worker’s race/ethnicity.

* Using the Model to Estimate Percentiles.* We consider five different specifications of the model for Figures 3 and 7, and four different specifications for Figures 4, 5, and 6. All specifications for the wage gaps between specific demographic groups include the indicators for those demographic groups. For each specification, the estimated model gives us selection-adjusted conditional wage quantile estimates across the entire wage distribution. We then follow the steps described in Bollinger et al. (2019) and Machado and Mata (2005) to produce unconditional percentile estimates. The figures display the median value of the estimates across the various specifications. For example, the left-most column in Figure 5 is the median of the 10

For the housing cost analysis presented in Figure 2 in the post *Is California’s Minimum Wage High, Low, or Somewhere in Between?*, we assume the following:

A single adult without children shares a two-bedroom apartment with another adult.

Single parents with one, two, or three children live in one-bedroom, two-bedroom, or three-bedroom apartments, respectively.

An adult couple without children live in a studio/efficiency.

An adult couple with one or two children live in a two-bedroom apartment.

Arellano, Manuel and Stephane Bonhomme (2017). “Quantile Selection Models with an Application to Understanding Changes in Wage Inequality.” *Econometrica* 85(1).

Bollinger, Christopher and Barry Hirsch (2013). “Is Earnings Nonresponse Ignorable?” *The Review of Economics and Statistics* 95(2).

Bollinger, Christopher, Barry Hirsch, Charles Hokayem, and James Ziliak (2019). “Trouble in the Tails? What We Know About Earnings Nonresponse 30 Years After Lillard, Smith, and Welch.” *Journal of Political Economy* 127(51).

Dutz, Denis, Ingrid Huitfeldt, Santiago Lacouture, Magne Mogstad, Alexander Torgovitsky, and Winnie van Dijk (2022). “Selection in Surveys: Using Randomized Incentives to Detect and Account for Nonresponse Bias.” National Bureau of Economic Research (NBER) Working Paper 29549.

Machado, Jose and Jose Mata (2005). “Counterfactual Decomposition of Changes in Wage Distributions Using Quantile Regression.” *Journal of Applied Econometrics* 20: 445-465.