Out of curiosity, I decided to perform a statistical analysis of the data that were collected from Allar’s stream.
Here are my results…
General information:
Here’s a table with the score averages of different entries based on submission category.
[table]
[th][/th][th]OnlineMP[/th][th]LocalMP[/th][th]VR[/th][th]<100 mB[/th][th]DevRequested[/th][th]WNE[/th][th]PreReg[/th][/thead][tbody]
[th]N[/th]71419236342115
[th]Averages[/th][th]Allar[/th]6.39285714295.00357142864.35526315795.09130434785.99682539684.21428571434.8469565217
[th]Chat[/th]6.38428571434.67037857144.27594210534.6803739135.79357507943.98869833334.6156611304
[th]Sdev[/th][th]Allar[/th]1.77867660811.91757702542.17037469482.05009158191.56351735951.92034313861.974955995
[th]Chat[/th]1.86701544362.09854123112.20612884422.17190398611.66191492122.05066038342.0509442821
[/tbody]
This table contains relevant statistics about selected variables (includes the entire sample, except disqualified entries).
[table]
[th][/th][th]Timestamp[/th][th]AllarScore[/th][th]ChatScore[/th][th]CoolCharScore[/th][th]TeamSize[/th][/thead][tbody]
[th]Min.[/th]0.002310.3801
[th]1st Quartile[/th]0.09633.252.7821
[th]Median[/th]0.2114.754.442
[th]Mean[/th]0.24044.6324.3944.0372.216
[th]3rd Quartile[/th]0.363465.74563
[th]Max.[/th]0.64499.1595
[th]Standard Dev.[/th]0.16910741.9211891.9866992.2193271.453179
[/tbody]
First, I wanted to assess whether or not the scores were normally distributed. This is important because many of the procedures used assume normality. For that, I plotted a simple histogram, calculated distribution density, and used the Shapiro-Wilk test for normality. Interestingly, the Shapiro-Wilk test returned a p value of 0.02, which indicates that it might not be normally distributed, so I followed up with a QQ plot to verify.
Histogram:
Density:
As you can see from the above two images, there appears to be a skew toward the lower end of the score distribution. It’s minor, but it does exist.
QQ Plot:
Here, we can see where the variance shows itself. At the lower and higher ends, we can see scores deviate from what is expected of a normal population. The red line, by the way, indicates where our observations should land, if our sample is normally distributed. Based on this QQ plot, I decided not to modify the distribution in any way, and go ahead with my planned tests for linear regression. Frankly, with data collected from Humans, it’s rare to see data that appear this close to normal.
Because I didn’t really start with any hypotheses, the next step was to determine where our data correlated so that I could follow up with more statistics. Here is a correlation matrix containing all of the available data:
[table]
[th][/th][th]Post[/th][th]Timestamp[/th][th]AllarScore[/th][th]ChatScore[/th][th]CoolCharScore[/th][th]TeamSize[/th][th]OnlineMP[/th][th]LocalMP[/th][th]VR[/th][th]smallSize[/th][th]Requested[/th][th]WNE[/th][th]PreReg[/th][/thead][tbody]
[th]Post[/th]1.000-0.1670.0450.0800.1830.2820.0480.009-0.001-0.0860.207-0.0210.103
[th]Timestamp[/th]-0.1671.000-0.077-0.0750.011-0.073-0.0430.0410.088-0.066-0.1690.042-0.101
[th]AllarScore[/th]0.045-0.0771.0000.9520.5300.1950.1900.058-0.0510.0950.544-0.1240.161
[th]ChatScore[/th]0.080-0.0750.9521.0000.5790.2220.2080.042-0.0210.0570.540-0.1170.160
[th]CoolCharScore[/th]0.1830.0110.5300.5791.0000.3070.1630.153-0.059-0.0860.223-0.2180.068
[th]TeamSize[/th]0.282-0.0730.1950.2220.3071.0000.132-0.015-0.066-0.1420.1960.0180.078
[th]OnlineMP[/th]0.048-0.0430.1900.2080.1630.1321.000-0.062-0.0730.0050.148-0.0490.081
[th]LocalMP[/th]0.0090.0410.0580.0420.153-0.015-0.0621.000-0.1060.132-0.095-0.022-0.019
[th]VR[/th]-0.0010.088-0.051-0.021-0.059-0.066-0.073-0.1061.000-0.085-0.0390.101-0.031
[th]smallSize[/th]-0.086-0.0660.0950.057-0.086-0.1420.0050.132-0.0851.0000.125-0.1850.056
[th]Requested[/th]0.207-0.1690.5440.5400.2230.1960.148-0.095-0.0390.1251.000-0.0700.171
[th]WNE[/th]-0.0210.042-0.124-0.117-0.2180.018-0.049-0.0220.101-0.185-0.0701.000-0.065
[th]PreReg[/th]0.103-0.1010.1610.1600.0680.0780.081-0.019-0.0310.0560.171-0.0651.000
[/tbody]
There are a few interesting things in here. For example, the fact that teams with more members tended to submit later than teams with fewer members (reflected by the .28 correlation between TeamSize and Post count). For now, though, I’m going to focus on the Scores assigned to the submissions. First, pay attention to the extremely high correlation between the scores that Allar assigned to games and the scores assigned by his Twitch chat. It’s > 0.95!
So, for funsies, let’s look at a regression analysis of the relationship between Allar’s scores and the Twitch Chat scores.
The black line shows what the chat scores should have been if there were no relationship between the two (the null hypothesis). The red line shows a fitted regression line of what we observed. Interestingly, we can see from this that Allar’s scores were, on average, slightly higher than Chat scores (by about 0.25, to be more precise).
If you’re a stats nerd like me, here are the important bits:
[table]
[th]Coefficients:[/th][th]Intercept[/th][th] AllarScore [/th][/thead][tbody]
-0.16640.9847
[th]Residuals:[/th][th]Min[/th][th]1st Quartile[/th][th]Median[/th][th]3rd Quartile[/th][th]Max[/th]
-1.42612-0.430280.004720.40431.54436
[th]Coef[/th][th]SE Coef[/th][th]t[/th][th]p > |t|[/th]
[th]Intercept[/th]-0.16640.1218-1.3660.174
[th]AllarScore[/th]0.98470.024340.525<2e-16 ***
[/tbody]
But, what I was most curious about was the relationship between team size and score - I initially thought of teaming as a potentially large advantage. Before I removed the disqualified entries, the relationship between score and team size was not significant. However, after removing DQs, the correlation rose to .19. Not huge, but it was enough to pique my curiosity. So, here’s a linear regression predicting overall score based on TeamSize.
This graph isn’t too surprising, given what we found in the correlation matrix. So let’s dig into the juicy, numbery bits.
[table]
[th]Coefficients:[/th][th]Intercept[/th][th] TeamSize [/th][/thead][tbody]
4.05940.2582
[th]Residuals:[/th][th]Min[/th][th]1st Quartile[/th][th]Median[/th][th]3rd Quartile[/th][th]Max[/th]
-4.3502-1.32570.18251.41614.4243
[th]Coef[/th][th]SE Coef[/th][th]t[/th][th]p > |t|[/th]
[th]Intercept[/th]4.059380.2641015.370<2e-16 ***
[th]AllarScore[/th]0.258170.099742.5880.0105 *
[/tbody]
Look at those p values! As it turns out, team size certainly does have an effect on score, but it’s not nearly as large as I suspected.
Okay. So this is all fine, but how can we use these data to make the best game jam games that we possibly can? I’m glad you asked, dear reader!
From here, I took the information gleaned from these analyses and performed a couple of multiple regression analyses. If you’re not familiar, the basic idea is that we’re using multiple inputs to predict a single output. With enough inputs, it’s a pretty good method for identifying which behaviors have a desired output. In this case, we want to know what we can do to improve the quality of our game jam submissions. Here we go.
To start, I took the three largest correlates with AllarScore (the ones that made sense, anyway) and threw them into a regression equation. Here’s the output:
[table]
[th]Coefficients:[/th][tbody]
[th]Intercept[/th][th]TeamSize[/th][th]OnlineMP[/th][th]CoolCharScore[/th]
2.749310.035431.006440.43655
[/tbody]
Basically, this tells us that we can calculate an expected score based on the following formula: Score = 2.74931 + TeamSize * .03543 + Coolness of Character * 0.43655
-
You can add 1.00644 to that, if you include online multiplayer.
But, let’s take it a step further and add another predictor variable (pre-Registration).
[table]
[th]Coefficients:[/th][tbody]
[th]Intercept[/th][th]TeamSize[/th][th]OnlineMP[/th][th]CoolCharScore[/th][th]PreReg[/th]
2.464540.026450.929650.432580.48158
[/tbody]
With this new multiple regression, we can come up with a new formula: Score = 2.46454 + TeamSize * .02645 + Coolness of Character * 0.43258
-
You can add 0.92965 to that if you include online multiplayer.
-
You can add 0.48158 to that if you preregister for the jam.
So what does all of this tell us?
- Find a team or add reliable members to your team. The more the merrier.
- Make sure you pay attention to character design. Clearly it’s important. One could make an argument, here, that abstract characters (like glowing spheres) aren’t as interesting to players. Or, it could just be a spurious correlation.
- If it makes sense, add online multiplayer. Apparently, people like playing with other people. Who would’ve guessed?
- Preregister for the jam. This is linked to a broader point - be prepared and informed. Know when a jam is going to happen and jump in as soon as you can.
I realize that some of these conclusions might be fairly obvious, but I had fun playing with the numbers. Hope it helped shed some light on the results of our jam!