The Predictive Power of Work Industry
By Kevin Reuning (@KevinReuning)
In my previous blog post, we looked at how class is more than just income and education and showed that particular job types can help us predict support for a generic Democrat. Today we are going to take this a step further and try to answer a related question: how much does knowing someone’s work industry help us predict support for Democratic candidates?
This might sound like the same question we already answered but there are some fine differences between the two. It is possible for work industry to be related to support without being a useful piece of information. If work industry only changes the probability of Dem support by 1 percentage point then it is still related, but it is unlikely to be worth the cost for activists and practitioners to collect when building models.
There are a variety of ways to test this. Here we will build several models of Democratic support using a random forest, train them on 90 percent of our data and then use that remaining 10 percent to evaluate our models. Random forests are a machine learning tool that is relatively simple to understand. They are based on decision trees where you use predictors to partition the data into sets that internally are the same while keeping the sets as different from each other as possible. Ideally, a decision tree would identify a predictor that would perfectly separate Democrats and Republicans. By itself, decision trees can be problematic so a random forest uses randomization and multiple constrained decision trees.
I created seven models in total. The first is a baseline which includes age, gender, race, and union/non-union. This will be used to get an idea of a bare minimum of how well we should do. Next, I create three models with either education, income, or occupation type to see how each of these individually improve our predictions. Finally, I created another set with pairwise combinations of each of these variables to see how well the model does if we could include two out of the three. You might notice that I don’t include ideology, I’ll delve into that in a moment.
Below plots Receiver Operating Characteristic (ROC) Curves for each model on the held-out data. ROC curves are useful as the model gives us probability predictions which have to be translated into either a 0 (not a Dem supporter) or a 1 (a Dem supporter). The cutoff could be at 0.5, or at any point between 0 and 1 and where we place that cutoff will determine the degree of true positives (a good thing) and false positives (a bad thing). ROC curves show how true positives and false positives vary and what we want is a line that follows the top left border and so has a high true positive rate at even low levels of the false positive rate.
The left plot shows the difference between the models with only one variable added. Adding education or income barely moves the needle compared to the baseline model. The model with occupation outperforms them all. On the right, we have the models that include two out of three of the variables. Here we see a substantial difference from the baseline but only in models that include work as one of the variables. Work industry then is a more useful predictor than either income or education, which are commonly included for get out the vote/persuasion modeling.
One caution. If we include ideology then the substantive differences shrink because ideology does such a good job of predicting support (there is little room left for improvement). One way of conceptualizing this is by looking at variable importance in a full random forest model. Mean decrease in Gini is a measure of how much the variable increases the homogeneity of the sets.
Ideology does a lot of work in creating more homogeneous sets. Work type though does better than any other variable except for ideology. It outperforms race, income, and education. The chart below shows another version of the model that does not include ideology. Here again, we see the importance of work type.
Academics have understood the importance of occupational difference for a long time and it is time for political practitioners to embrace this as well. Part of the difficulty in using occupational categories is likely the large number of categories it creates which can be intimidating to use. At Data for Progress we suspect that there might be a set of questions that get at relevant differences across occupations such as if someone has to be licensed, if they deal directly with customers/clients, how often they use a computer in their work and if they have employees that report to them. We hope to test these questions in future research so if you have other ideas, feel free to reach out.
Kevin Reuning (@KevinReuning) is an assistant professor of political science at Miami University.