The absence of self-ruminative thoughts.
I’d like to claim the idea was originally mine but the truth is I first heard this phrase over a decade ago in a talk by Albert Bandura (yes, THAT Albert Bandura) and he said one of the differences between people who are content with their lives and those who are unhappy is that the happy group have “an absence of self-ruminative thoughts”.
There is a phrase I use a lot,
Not my circus, not my monkeys.
In other words, I don’t make everything about ME.
Here are tips to not ruminating too much.
I do the best I can. When I meet with employees or students, I tell them what I think needs to be said, listen to what they have to say and then I don’t worry about whether I was too harsh or too wishy-washy, whether they respected my authority or thought I was incompetent. If random Joe on the Internet thinks I’m old and grey and should just shut up, well, as much as it pains me to have lost the good opinion of an anonymous person I have never met – oh, wait, no I don’t care.
If I screw up, I try to learn from it. If I don’t get a grant, or a person decides not to invest in our company or a school decides not to buy our games, I listen to their reasons and if it is a reasonable suggestion for a change I can make, I try to do it. If not, I don’t worry about it. I still remember the astonishment I felt seeing a colleague throw a grant review in the trash without reading it.
What are you doing? Why didn’t you read the comments?
I asked. He responded,
Shit, why should I read it? They didn’t like me. They don’t think I’m a researcher.
It’s more than just not taking things personally, though. It’s also a matter of not making everything about how other people are not acting as YOU think they should behave.
Your adult children aren’t raising their kids the way you think they should? The neighbors don’t maintain their yard the way you think it should be ?
Not my circus, not my monkeys.
A few months ago, we had a really fascinating guest on the More Than Ordinary podcast, Jonathan Shaw. He’d just finished writing his autobiography, Scab Vendor, and he encouraged me to go away for a month and write my own autobiography. Jonathan’s book was interesting and his idea was intriguing. I randomly happened to be in an area known as a writer’s retreat in Lopinot, Trinidad and I tried for a bit. I have had a long strange trip around the world and back again, that’s for sure.
I just don’t get excited about the idea of looking back through all of the things that happened in my life. Jonathan said,
You’ll grow from the experience, but it will probably hurt – and I only saw ‘probably’ to be nice.
Maybe if I went back and hung out in the mountains I would find myself.
]]>we have arrived at MANOVA. If you skipped those three posts, feel shame at trying to take shortcuts, go back and read them.
Before we dive into coding, let’s take a look at some basic background on MANOVA.
The difference between ANOVA and MANOVA is simple
How does that work? Think back to what you know about multiple correlation
In correlation, you are looking at the relationship between two variables, X and Y. You predict changes in X from changes in Y
Y = bX
In multiple correlation you are looking at the relationship between Y and MULTIPLE X variables.
You have an equation something like
Predicted Y = b0X0 + b1X1 + b2X2 + b3X3
And you are looking at how the Y variable changes in relation to the PREDICTED Y. Notice that predicted Y is a sum of all of your variables, each of which is multiplied by a regression coefficient.
The correlation between these predicted Ys and the actual Y is your multiple R and the multiple R-squared in ANOVA or regression is the square of the multiple R.
The multiple R-squared answers the question – how much of the variance in the dependent variable can be explained by variance in the independent variable (s) ?
In the case of ANOVA, this variance is in group membership, so we are testing the null hypothesis that the mean of group1 = the mean of group 2 all the way to group N
With MANOVA, you have multiple variables on the Y side of the equation
The variable you are predicting/ explaining in this case is also a weighted sum
Dependent = w1Y1 + w2Y2 + w3Y3
Our null hypothesis is that the mean of this weighted combination is equal for groups 1, 2 and all the way up to group N
Instead of looking at a multiple R-squared in this case, we look at two other statistics, Wilk’s lambda and Pillai’s trace
Also note that in the case of a repeated measures ANOVA certainly assumption 1 and possibly assumption 3 are violated
When you have conducted your MANOVA the first thing you should look at is the Multivariate tests – Wilk’s lambda, Pillai’s trace . Rejecting the null hypothesis that the model does not explain the difference in the VECTOR of means then leads you to examine the second logical question, which of these dependent variables differs ? So , if you don’ t have a significant, lambda, trace, etc. STOP. If you do, move on and check out the univariate F-tests. If your F is significant, go on to post hoc tests.
ETA-squared is the variance accounted for IN THE LINEAR COMBINATION OF THE DEPENDENT VARIABLES by the model.
Mertler and Vannata said it well.
“When the IV has only two categories, the F test for Pillai’s Trace, Wilks’ Lambda, and Hotelling’s Trace will be identical. When the IV has three or more categories, the F test for these three statistics will differ slightly but will maintain consistent significance or nonsignificance. Although these test statistics may vary only slightly, Wilks’ Lambda is the most commonly reported MANOVA statistic. Pillai’s Trace is used when homogeneity of variance-covariance is in question. If two or more IVs are included in the analysis, factor interaction must be evaluated before main effects. “
]]>
You promised there would be MANOVA ! Now we’re in the third post!
First there was recoding of variables.
Then, there was creating scales.
Now, we’re looking at reliability.
Patience is a virtue.
Before we get to doing a MANOVA we want to be sure that our dependent and independent variables are reliable and valid. Let’s move on to reliability.
I’m going to do a correlation matrix and a Cronbach alpha, which is a measure of internal consistency. The rationale is that if items all measure the same construct – say, knowledge of health practices, or autonomy or acceptance of wife beating – then those items should be related to one another. An alpha of 0 would indicate the covariance of items in the scale are zero, so, your scale sucks. An alpha of .95 would mean your scale is amazingly consistent.
So, I did three analysis for my three scales
Title "Health Variables " ;
proc corr data=example alpha ;
var hbs1 hbs3-hbs7 ;
Title "Wife beating variables" ;
proc corr data=example alpha ;
var GR34 - GR39 ;
Title "Decision Variables" ;
proc corr data=example alpha ;
VAR D_GR1A GR2A D_GR3A D_GR4A GR5a GR6A D_GR7A GR8A
D_GR9A GR9F D_GR10A D_GR12A GR10F GR12F ;
Let’s skip the simple statistics, mean, etc. you get from these analyses and go to the alpha
The alpha for the health scale is pretty bad. The value for the raw scores is .31, for standardized items, still really bad at .32. When we look at how deleting a variable would improve the alpha, if we dropped the first variable , the alpha would go up to .34 – but that is still awful.
For the wife-beating scale the raw value for alpha was .81 and also for the standardized value. So, that one was pretty good as far as reliability.
I put all of the decision variables together, the ones on whether the woman was involved in making decisions, could go places on her own, needed to ask permission to go places. The Cronbach alpha for the raw variables was .65, for standardized variables .81. Note that standardized variables are placed on the same metric, so my idea of some variables being much more important than others did not pan out.
So … I standardized the variables, then I read in that data set and created two scales, one that was a sum of the decision variables and the other that was the mean of the 6 wife-beating variables. There was no particular reason for using the mean of the six variables as opposed to just adding them up. I did both methods to show it was an option.
BEWARE THE SUM FUNCTION – Note, I did not use the sum function. If you add up the values, as shown below, and one of the variables has a missing value then the value of the sum is going to be missing. If you used the SUM function, the variables that have non-missing values would be added up, so the missing value would be treated as a zero. There are times where that is acceptable. This is not one of those times.
While I’m at it, I want to check whether the scales have approximately normal distributions. A perfectly normal distribution would have skewness and kurtosis values of 0.
proc standard data=example mean=0 std=1 out=MAN_data;
Data create_manova ;
set man_data ;
* I could have used the mean function here, but I didn't ;
decision = D_GR1A + GR2A + D_GR3A + D_GR4A + GR5a + GR6A + D_GR7A + GR8A +
D_GR9A + GR9F + D_GR10A + D_GR12A + GR10F + GR12F ;
beating = mean(of gr34-gr39);
proc univariate data=create_manova ;
var decision beating ;
The skewness values were relatively low: -1.3 and 0.2 for the two scales and kurtosis values were 2.0 and -1.2 . Since my scales aren’t a radical departure from normality, I’m now going on to MANOVA – finally!
]]>First, I want to check that there are no obvious errors or other problems in my data.
PROC MEANS DATA=example ;
VAR gr2A -- gr39 hbs1 --d_gr12a ;
You could type in the variable names but that is a lot of typing. The double dashes mean to include all variables in the data set in order from the first variable to the one that comes after the dashes. How do you know what order the variables are in? Click on the OUTPUT DATA tab at the top and look to the left under COLUMNS.
If you didn’t just run a program creating your data and hence don’t have an OUTPUT DATA tab, you can find your data file by clicking the MY LIBRARIES tab and then clicking on the library (directory) where your data are kept and clicking on the dataset to open it. You can also use the PROC CONTENTS procedure but today we are being all pointy and clicky with SAS Studio.
Sometimes you will see something like:
VAR item1 – item12 ;
The single dash is used for variables that end in a number and if you don’t have item1, item2 all the way through item12, it will give you an error and not run. Then you will be sad.
PROC MEANS will give you the N, mean, standard deviation, minimum and maximum.
Here are a few things to consider.
Okay, so my results from the means procedure looks okay. Now what?
Next, I’m going to do a factor analysis to see if my supposition is supported of three scales related to health, beating your wife and autonomy.
Here is the code for my factor analysis.
PROC FACTOR DATA =example SCREE ROTARE= VARIMAX NFACTORS=5;
VAR gr2A -- gr39 hbs1 --d_gr12a ;
This is actually the second one I ran. In inspecting the results for the first, between the eigenvalues and scree plot, I decided that at most I should retain five factors. I’ve written a lot about factor analysis on this blog previously, so I’m not going to go into detail here. In short, the decision-making variables mostly loaded on the first factor with factor loadings of .70 and higher. The median communality estimate for those items was about .67. In short, considerable evidence for a decision-making factor. The wife-beating variables loaded on the second factor. All but one loaded above .67, and even that variable (Beating your wife if she had an extramarital affair – which 84% of the women said was accepted in their communities) loaded at .40. The variables regarding needing permission to go places loaded on the third factor and also had high communality estimates. The variables regarding going places by yourself loaded on the fourth factor and also had high communality estimates.
The health variables were a different story. Four out of six loaded between .47 and .67 on the fifth factor. The other two did not load on any factor.
It is starting to look like at this point that it is okay to retain the wife-beating items as a scale. The various measures of autonomy – decision-making, going places on your own and needing permission – seem to hang together within factors. I think it would be reasonable to put all three of these together in one scale. I talked about parceling in the past, and I could have done that as a step here, and then re-run the factor analysis to support (or not) my supposed autonomy factor. Since I have limited time and simply doing this analysis for educational and illustrative purposes, I skipped over this to the next procedure, which is reliability analysis.
Since this post is pretty long already, I’ll save that for the next post.
]]>I have the India Human Development Survey data on over 39,000 women and my hypothesis is that education is related to women’s rights’ issues, especially autonomy, health practices knowledge and domestic violence. I also think that mobility might be related, as women who get out of their native village might be exposed to new ideas.
Before I can test out my (supposedly) brilliant hypotheses, I need to create some variables because it turns out when they were collecting data in India in 2011 they were not thinking about my convenience. (Yes, I, too, am appalled by this lack of consideration.)
First, I will need to create my independent variables from
EW11 Differences in family by mobility
1= same village/ town
2= another village
3 = another town
4 = metro (since only 1% fall in here, I’m going to delete this category)
and education (see below)
HEALTH QUESTIONS
HB1 Milk harmful
HB3. 1st milk good for baby
Hb4 chulha smoke good
Hb5 child diarrhea drink more
Hb6 illness spread through water
Hb7 malaria spread
DECISIONS
The items below are scored 1 if the respondent decides, 0 if the respondent does not decide. (More than 1 person can decide, so if both husband and wife decide, the answer will be 1 for both. In this case, I just looked at if the wife had a say in the decision.)
The items below are score 1 if the woman is allowed to do these things alone and 0 if she is not.
These items relate to whether the woman needs to ask permission for activities, with 0 = no, 1 = must inform someone and 2 = yes
WIFE BEATING QUESTIONS
GR34 – GR39 – All of these relate to under what circumstances it is acceptable, coded yes = 1 or 0 = no.
As you can see, well, I hope you can see, each of these presents a different date re-coding problem.
So … here we go. The first thing we’re going to do is create categories. Notice I don’t do anything with the category 4 for mobility, so those people will just have a missing value for MOBILITY and be dropped from the analysis.
Also, a note on ELSE as opposed to just IF statements.
I could just use all IF statements but that would be inefficient. It doesn’t really matter here with 39,000 records but if I had millions it would slow down processing. The ELSE statement is only processed if the preceding IF statement is false.
NOTE!!! In the second set of IF- ELSE statements, I have
else if ew8 < 9 and ew8 ne . then education = “ELEM”;
This statement is only executed IF the preceding IF statement was false. Without the ELSE, everything less than 9, including those who had 0 years of education, would be set to ELEM. Without the and ew8 ne . in this statement, anyone that had missing data would be set to ELEM along with anyone who had 1-8 years of education.
data example ;
set mydata.india ;
If EW11 = 1 then Mobility = “None” ;
else if EW11 = 2 then mobility = “Vill” ;
else if EW11 = 3 then mobility = “TOWN”;
if ew8 = 0 then education = “NONE” ;
else if ew8 < 9 and ew8 ne . then education = “ELEM”;
else if ew8 > 8 then education = “HS +”;
*** The statements below recode the health items ;
*** For hb1 the correct answer is 0, so 1-hb1 will score respondents who said 0 as correct (= 1) and those who said 1 as incorrect (=0);
*** For hb3 the correct answer is 1, so respondents who said 1 are scored as correct (= 1) and those who said any number higher than 1 as incorrect (=0);
*** For hb4 – hb7, the correct answer is scored as correct (=1) and any numbers in the incorrect set scored as incorrect (=0);
*** HEALTH QUESTIONS ;
hbs1 = 1- hb1 ;
If hb3 = 1 then hbs3 = 1 ;
Else if hb3 > 1 then hbs3 = 0 ;
If hb4 = 2 then hbs4 = 1 ;
Else if hb4 in (1,3) then hbs4 = 0 ;
If hb5 = 2 then hbs5 = 1 ;
Else if hb5 in (1,3,4) then hbs5 = 0 ;
If hb6 = 2 then hbs6 = 1 ;
Else if hb6 in (1,3,4) then hbs6 = 0 ;
If hb7 = 3 then hbs7 = 1 ;
Else if hb7 in (1,2,4) then hbs7 = 0 ;
/* DECISION QUESTIONS */
/* ALSO INCLUDES ADDITIONAL ITEMS NOT RECODED */
**** Here, I multiplied items by a factor based on my estimation of importance ;
D_GR1A = GR1A* 0.5 ;
D_GR3A = GR3A * 10 ; * BECAUSE I THINK IT’S IMPORTANT ;
D_GR4A = GR4A *2 ;
D_GR7A = GR7A *2 ;
**** These items are subtracted from 3 so doesn’t have to tell anyone = 2 ;
**** Needs to inform someone = 1 and needs to ask permission = 0 ;
D_GR9A = 3 – GR9A ;
D_GR10A = 3 – GR10A ;
D_GR12A = 3 – GR12A ;
**** KEEPS THE VARIABLES I PLAN TO USE ;
Keep EW8 EW5 Ew6 EW10 EW14a EW12a EW12b
HBS1 HBs3-HBS7 D_GR1A GR2A D_GR3A D_GR4A GR5a GR6A D_GR7A GR8A
D_GR9A GR9F D_GR10A D_GR12A GR10F GR12F GR34 – GR39 mobility education;
So, there we go. You might think I would dive into a Multivariate Analysis of Variance now but you would be wrong. The next thing I am going to do is check the validity of my scales through a combination of factor analysis, univariate statistics and reliability analysis. Only after that step will I do the MANOVA.
]]>So, I have been doing a few videos here and there to refresh, for example, what is a repeated measures ANOVA and why you might want to do it.
If you are interested in being a beta tester for our first bilingual game that teaches statistics, please email info@7generationgames.com
]]>
For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV
]]>Yes, when I was a brand-new baby professor I was sometimes rushing to write a lecture before class, but now that I have given lectures on repeated measures ANOVA approximately 132 times, I just update the examples to be relevant to the current cohort.
So, I was debating about using SAS or SPSS and I got a lot of recommendations, particularly on linkedin. A few people suggested using JMP which I had not considered and hadn’t used in a few years. That sounded like a good idea except that I needed to have the syllabus done in a couple of days and start teaching the class the week after that.
In the end, I decided to go with SAS Studio for this class and investigate JMP for the future.
Interestingly, no one encouraged me to use SPSS, which I found interesting because it’s not a terrible package, just overpriced.
Support my day job! Learn about Ojibwe history and culture. Practice multiplication and division
FREE GAME FOR iPad or Android tablets
Click over here to find links to Making Camp in the App Store or on Google Play. Yes, it’s free.
The video below might give you an idea of why I decided to go with SAS Studio. Maybe I was just lucky, but it was so easy to upload the data I wanted to use for the first weeks’ examples, the India Human Development Survey. Take a look:
Think about this carefully for a moment, if you are using quintiles, you are matching people by which group they fit into as far as probability of being in the treatment group. So, if your friend, Bob, has a predicted probability of 15% of being in the treatment group, his quintile would be 1, because he is in the lowest 20%, that is, the bottom fifth, or quintile. If your other friend, Luella, has a predicted probability of being in the treatment group of 57%, then she is in the third quintile.
Oh, if only there were a means of getting the predicted probability of being in a certain category – oh, wait, there is!
Let’s do binary logistic regression with SAS Studio
First, log into your SAS Studio account.
Second, you probably need to run a program with a LIBNAME statement to make your data available. I am going to skip that step because in this example I’m going to use one of the SASHELP data sets and create a data set in mu WORK library as so, so I don’t need a LIBNAME for that but, as you will see, I do need it later. Here is the program I ran.
data psm_ex ;
set sashelp.heart ;
if smoking = 0 then smoker = 0 ;
else if smoking > 0 then smoker = 1;
WHERE weight_status ne “Underweight” ;
libname mydata “/courses/blahblah/c_123/” ;
run;
My question is if I had people who had the same propensity to smoke, based on age, gender, etc. would smoking still be a factor in the outcome (in this case, death). To answer that, I need propensity scores.
Third, in the window on the left, click on TASKS AND UTILITIES, then STATISTICS and select BINARY LOGISTIC REGRESSION, as shown below.
Next, choose the data set you want by clicking on the thing under the word DATA that looks like a table of data and selecting the library and data set in that library. Next, under RESPONSE, click the + sign and select the dependent variable for which you want to predict the probability. In this case, it’s whether the person is a smoker or not. Click the arrow next to EVENT OF INTEREST and pick which you want to predict, in this case, your choices are 0 or 1. I selected 1 because I want to predict if the person is a smoker.
Below that, select your classification variable,
There is also a choice for continuous variables (not shown) on the same screen. I selected AGEATSTART.
I’m going to select the defaults for everything but OUTPUT. Click the arrow at the top of the screen next to MODEL and keep clicking until you see the OUTPUT tab. Click on the box next to CREATE OUTPUT DATASET. Browse for a directory where you want to save it. I had set that directory in my LIBNAME statement (remember the LIBNAME statement) so it would be available to save the data. Select that directory and give the data set a name.
Click the arrow next to PREDICTED VALUES and in the 3 boxes that appear below it, click the box next to predicted values.
After this, you are ready to run your analysis. Click the image of the little running guy above. When your analysis runs you will have a data set with all of your original data plus your predicted scores.
Now, we just need to compute quintiles.You could find the quintiles by doing doing this:
PROC FREQ DATA=MYDATA.STATSPSM ;
tables pred_ ;
and look for the 20th, 40th, etc. percentile
However, an easier way if you have thousands of records is
proc univariate data=mydata.statspsm ;
var pred_ ;
output pctlpre=P_ pctlpts= 20 to 80 by 20;
proc print data=data1 ;
Which will give you the percentiles.
For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV
]]>So, today I’m going to recycle a couple of older posts that introduce you to propensity score matching. Then, tomorrow, I will show you how to get your propensity scores with just pointing and clicking with a FREE (as in free beer) version of SAS.
Propensity score matching has had a huge rise in popularity over the past few years. That isn’t a terrible thing, but in my not so humble opinion, many people are jumping on the bandwagon without thinking through if this is what they really need to do.
The idea is quite simple – you have two groups which are non-equivalent, say, people who attend a support group to quit being douchebags and people who don’t. At the end of the group term, you want to test for a decline in douchebaggery.
However, you believe that that people who don’t attend the groups are likely different from those who do in the first place, bigger douchebags, younger, and, it goes without saying, more likely to be male.
The very, very important key phrase in that sentence is YOU BELIEVE.
Before you ever do a propensity score matching program you should test that belief and see if your groups really ARE different. If not, you can stop right now. You’d think doing a few ANOVAs, t-tests or cross-tabs in advance would be common sense. Let me tell you something, common sense suffers from false advertising. It’s not common at all.
Even if there are differences between the groups, it may not matter unless it is related to your dependent variable, in this case, the Unreliable Measure of Douchebaggedness.
Once upon a time there were statisticians who thought the answer to everything was to be as precise, correct and “bleeding edge” as possible. If their analyses were precise to 12 decimal places instead of 5, of course they were better because as everyone knows , 12 is more than 5 (and statisticians knew it better, being better at math than most people).
Occasionally, people came along who suggested that newer was not always better, that perhaps sentences with the word “bleeding” in them were not always reflective of best practices, as in,
“I stuck my hand in the piranha tank and now I am bleeding.”
Such people had their American Statistical Association membership cards torn up by a pack of wolves and were banished to the dungeon where they were forced to memorize regular expressions in Perl until their heads exploded. Either that, or they were eaten by piranhas.
Perhaps I am exaggerating a tad bit, but it is true that there has been an over-emphasis on whatever is the shiniest, new technique on the block. Before my time, factor analysis was the answer to everything. I remember when Structural Equation Modeling was the answer to everything (yes, I am old). After that, Item Response Theory (IRT) was the answer to everything. Multiple imputation and mixed models both had their brief flings at being the answer to everything. Now it is propensity scores.
A study by Sturmer et al. (2006) is just one example of a few recent analyses that have shown an almost logarithmic growth in the popularity of propensity score matching from a handful of studies to in the late nineties to everybody and their brother.
You can read the rest of the post about choosing a method of propensity score matching here. If your clicking finger is tired, the take away message is this — quintiles, which are much simpler, faster to compute and easier to explain, are generally just as effective as more complex methods.
Now that we are all excited about quintiles, the next couple of posts will show you how to compute those in a mostly pointy-clicky manner.
For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV
]]>