You want what you know: replicating childhood family size (and fun with count variables!)

Does someone who grew up with lots of siblings want lots of children, and vice versa?  You could see this going either way – the stereotype of the lonely only child dreaming of a house full of children comes to mind. I investigated this question with the 2006 General Social Survey.  I also included the respondent’s level of education, race, marital status, and their region as control variables.

Some notes about the variables and how I expected them to related to my dependent variable:

The number of children someone has is represented by the variable childs, and corresponds to the question “how many children have you ever had? Please count all that were born alive at any time (including any you had from a previous marriage.” The respondent could choose from none to eight or more. Most respondents had no children or two children. Only 35 respondents had eight or more.

The number of siblings someone had growing up is represented by the variable sibs, and corresponds to the question “How many brothers and sisters did you have? Please count those born alive, but no longer living, as well as those alive now. Also include stepbrothers and stepsisters, and children adopted by your parents.” The respondent could choose the appropriate number. The minimum answer was 0 (145 responses) and the maximum was 34 (one response), while the median was 3 and the mean was 3.76. The most common number of siblings was two. I hypothesized that someone with more siblings would have more children – perhaps the respondent liked growing up and wanted to recreate that experience when they were adults.

Education is represented in the variable educ. It is a scale from 0-20, roughly approximating the number of years of schooling the respondent has. The mean is 13.29, and the median is 13, which corresponds to about one year of college. I hypothesized that someone with more education would have fewer children, since they might make more reasoned choices about the financial and temporal costs of raising a child. They may also start their family later and therefore have fewer reproductive years if they are working on establishing their careers or attending graduate school.

The respondent’s race is encoded in the variable race, with categories for white, black, and other. I recoded race to a binary variable white which is one if the respondent is white and zero otherwise. Three thousand, two hundred eighty four of the respondents were white and 1,226 were black or other. I hypothesized that whites might have smaller families than blacks and other races, since stereotypically whites might have more access to education or be of a higher socioeconomic level – characteristics that are associated with smaller family size.

Whether someone is or was ever married is based on the variable marital, which represents someone’s marital status in five categories: married, widowed, divorced, separated, or never married. Since I did not expect someone who is currently married to necessarily have a different number of children than someone who is widowed (or some other married category), I recoded this into a new binary variable evermar, which is one if the respondent chose one of the first four categories and zero if the respondent chose “never married.” In the data, 1,080 respondents are never married and 3,424 are either currently married or were married.

Finally, region is a categorical variable that represents the region where the interview was held, with a total of 9 possible regions. Almost 40% of the respondents were interviewed in a southern region – South Atlantic (region 5), East South Central (region 6), and West South Central (region 7). This broad region stereotypically has more conservative and traditional values than the rest of the United States, and so I hypothesized that these three regions would be associated with larger families. In contrast, I hypothesized that New England (region 1) and the Middle Atlantic region (region 2), with more liberal values and cities with smaller living spaces would have smaller families.

Considering that childs is a count variable, a Poisson model might be more appropriate than OLS. Moreover, the mean is larger than the variance and most respondents had no children, suggesting that a negative binomial model or a zero inflated negative binomial model might perform better still. I compared the outcomes of these different models to determine the best model.

First, I started with OLS, regressing childs on sibs and my control variables. The OLS results are shown in Table 1. They indicate that, on average, one additional sibling is associated with 0.057 additional children, all else constant. It is highly statistically significant (p-value < 0.0001). This model only explains about 26% of the variation in childs, which is disappointing given the number of covariates. As an aside, the coefficient on educ is negative, the coefficient on white is negative, and the coefficient on evermar is positive, and all three coefficients are highly statistically significant (p-value < 0.0001). These results are in line with my initial hypotheses. None of the region variables are statistically significant. As is apparent at the end of Table 1 (which is the OLS R output), the OLS model is problematic because the range of predictions includes a negative number of children, which is not possible. The Poisson, negative binomial, and zero-inflated negative binomial should address this issue.

Table 1: OLS Model

Screen Shot 2016-03-25 at 3.50.56 PM.png

To facilitate comparison to the Poisson, negative binomial, and zero inflated negative binomial, I also ran the OLS model as a maximum likelihood to obtain an AIC (not shown). The primary results of interest are unsurprisingly still substantively the same: one additional sibling is associated with a 0.057 additional child, on average and all else equal, and it is highly statistically significant.  The AIC is 10,666.

I next tried the Poisson model, which, given that childs is a count variable, I expected to improve the model fit. The Poisson model is shown in Table 2 (again, the R output). It indicates that, on average, every additional sibling is associated with a 0.023 higher expected log count of children, all else constant. It is still highly statistically significant. The AIC has now improved to 9,760.2, and there are no more negative counts of children.

Table 2: Poisson ModelScreen Shot 2016-03-25 at 3.53.40 PM.png

Another possibility I considered was that the results of the Poisson model might be affected by the length of exposure to child-bearing years. For example, someone who has five siblings is more likely to have several children, but if they are only 18 years old, that could affect the relationship between sibs and childs in the model. To see if this was the case, I ran the Poisson model again, this time offsetting the model by the log of age, which is measured in years in the GSS. The results are shown in Table 3. This reduced the coefficient on sibs somewhat: on average, every additional sibling is associated with a 0.017 higher expected log count of children, net of other factors. It is still highly statistically significant. The model fit here has improved as well, with an AIC of 9,553.8. After exponentiating the coefficients, it is clear that every additional sibling, on average, is associated with a 1.017 increase in the expected count of children.

Table 3: Poisson Model, offsetting age

Screen Shot 2016-03-25 at 3.56.04 PM.png

As mentioned previously, perhaps the Poisson model is not wholly appropriate, since the mean is much smaller than the variance (indicating an over dispersion of the data). To address this possibility, I ran a negative binomial regression. The results are shown in Table 4. It indicates that, on average, every additional sibling is associated with a 0.024 higher expected log count of children, net of other factors. It is still highly statistically significant. In count terms, on average ever additional sibling is associated a 1.024 increase in the expected count of children. The large value for theta suggests that the data is truly over dispersed. However, the AIC increased to 9,736, higher than the Poisson model offsetting age, which suggests this model did not improve in overall fit.

Table 4: Negative Binomial

Screen Shot 2016-03-25 at 3.57.48 PM.png

A possible reason for the increase in the AIC is the large number of zeroes in the data, since, as mentioned previously, most of the respondents had no children in the data. As a final model, I tried the zero inflated negative binomial, considering that someone’s marital status and one’s age were the most likely to influence whether the respondent had children or not. The results are shown in Table 5. They indicate that, on average, each additional sibling is associated with a 0.02 higher expected log count of children, net of other factors. It is still highly statistically significant. In count terms, on average each additional sibling is associated with a 1.02 increase in the expected count of children, net of other factors. The AIC is the lowest of all the models at 9,478.991.

Table 5: Zero Inflated Negative Binomial

Screen Shot 2016-03-25 at 3.58.54 PM.png

Now that I have established that zero inflated negative binomial has the best model fit, I thought it could be interesting to examine more closely the variation across the regions with the predicted counts. Varying race, education, and number of siblings (see Table 6 through 8, respectively), shows that my hypothesis that the Southern regions would have larger family sizes than the rest of the country was very inaccurate. Only someone with the average number of siblings (about 3.75), 20 years of education, white, married, and at the average age (about 47) from the West South Central region has one of the highest predicted count of children (Table 15). In some cases, such as for the person with average number of siblings, average education, white, married, and with the average age, being in the South Atlantic or East South Central had a predicted number of children lower than all the other regions (Table 14). The same is true for someone with five siblings, average education, white, married, and the average age (Table 16). My hypotheses regarding the North East and the Middle Atlantic were also inaccurate. Neither of these regions have the lowest number of children in any of my predicted count scenarios – they are somewhere in the middle.

Table 6: Predicted Counts of Children Across Regions by race

Mean sib, mean educ, white, married, mean age:

Region

New England

Middle Atlantic

East North Central

West North Central

South Atlantic

East South Central

West South Central

Mountain

Pacific

Predicted Count

2.1

2.2

2.1

2.4

2

1.9

2.3

2.4

2.1

Mean sib, mean educ, black and other, married, mean age:

Region

New England

Middle Atlantic

East North Central

West North Central

South Atlantic

East South Central

West South Central

Mountain

Pacific

Predicted Count

2.4  2.5  2.4  2.7  2.3 2.2  2.6  2.7  2.3

Table 7: Predicted Counts of Children Across Regions by education

Mean sib, 0 years of education, white, married, mean age:

Region

New England

Middle Atlantic

East North Central

West North Central

South Atlantic

East South Central

West South Central

Mountain

Pacific

Predicted Count

4

4.2

4

4.5

3.8

3.7

4.3

4.5

4

Mean sib, 20 years of education, white, married, mean age:

Region

New England

Middle Atlantic

East North Central

West North Central

South Atlantic

East South Central

West South Central

Mountain

Pacific

Predicted Count

1.5

1.6

1.6

1.7

1.5

1.4

1.7

1.7

1.5

Table 8: Predicted Counts of Children Across Regions by number of siblings

2 siblings, mean educ, white, married, mean age:

Region

New England

Middle Atlantic

East North Central

West North Central

South Atlantic

East South Central

West South Central

Mountain

Pacific

Predicted Count

2

2.1

2.1

2.3

2

1.9

2.2

2.3

2

5 siblings, mean educ, white, married, mean age:

Region

New England

Middle Atlantic

East North Central

West North Central

South Atlantic

East South Central

West South Central

Mountain

Pacific

Predicted Count

2.2

2.3

2.2

2.5

2.1

2

2.4

2.5

2.2

Although the zero inflated negative binomial distribution had the best model fit (by having the lowest AIC), it is worth noting that, across all the models, the fundamental relationships stayed the same and retained their significance level. Generally speaking, having more siblings growing up is associated with having more children of your own; being more educated is associated with having fewer children; being white is associated with having fewer children; and being married is associated with having more children. All of these relationships are highly statistically significant. There is also no meaningful relationship between the region of the interview and the number of children someone has across all the models.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s