Statistical Analysis of Megajam 2016 (Graphs!)

Out of curiosity, I decided to perform a statistical analysis of the data that were collected from 's stream.

Here are my results…

General information:

Here’s a table with the score averages of different entries based on submission category.

[table]
[th][/th][th]OnlineMP[/th][th]LocalMP[/th][th]VR[/th][th]<100 mB[/th][th]DevRequested[/th][th]WNE[/th][th]PreReg[/th][/thead][tbody]
[th]N[/th]71419236342115
[th]Averages[/th][th][/th]6.39285714295.00357142864.35526315795.09130434785.99682539684.21428571434.8469565217
[th]Chat[/th]6.38428571434.67037857144.27594210534.6803739135.79357507943.98869833334.6156611304
[th]Sdev[/th][th][/th]1.77867660811.91757702542.17037469482.05009158191.56351735951.92034313861.974955995
[th]Chat[/th]1.86701544362.09854123112.20612884422.17190398611.66191492122.05066038342.0509442821
[/tbody]

This table contains relevant statistics about selected variables (includes the entire sample, except disqualified entries).

[table]
[th][/th][th]Timestamp[/th][th]AllarScore[/th][th]ChatScore[/th][th]CoolCharScore[/th][th]TeamSize[/th][/thead][tbody]
[th]Min.[/th]0.002310.3801
[th]1st Quartile[/th]0.09633.252.7821
[th]Median[/th]0.2114.754.442
[th]Mean[/th]0.24044.6324.3944.0372.216
[th]3rd Quartile[/th]0.363465.74563
[th]Max.[/th]0.64499.1595
[th]Standard Dev.[/th]0.16910741.9211891.9866992.2193271.453179
[/tbody]

First, I wanted to assess whether or not the scores were normally distributed. This is important because many of the procedures used assume normality. For that, I plotted a simple histogram, calculated distribution density, and used the Shapiro-Wilk test for normality. Interestingly, the Shapiro-Wilk test returned a p value of 0.02, which indicates that it might not be normally distributed, so I followed up with a QQ plot to verify.

Histogram:

Density:

As you can see from the above two images, there appears to be a skew toward the lower end of the score distribution. It’s minor, but it does exist.

QQ Plot:

Here, we can see where the variance shows itself. At the lower and higher ends, we can see scores deviate from what is expected of a normal population. The red line, by the way, indicates where our observations should land, if our sample is normally distributed. Based on this QQ plot, I decided not to modify the distribution in any way, and go ahead with my planned tests for linear regression. Frankly, with data collected from Humans, it’s rare to see data that appear this close to normal.

Because I didn’t really start with any hypotheses, the next step was to determine where our data correlated so that I could follow up with more statistics. Here is a correlation matrix containing all of the available data:

[table]

[th][/th][th]Post[/th][th]Timestamp[/th][th]AllarScore[/th][th]ChatScore[/th][th]CoolCharScore[/th][th]TeamSize[/th][th]OnlineMP[/th][th]LocalMP[/th][th]VR[/th][th]smallSize[/th][th]Requested[/th][th]WNE[/th][th]PreReg[/th][/thead][tbody]
[th]Post[/th]1.000-0.1670.0450.0800.1830.2820.0480.009-0.001-0.0860.207-0.0210.103
[th]Timestamp[/th]-0.1671.000-0.077-0.0750.011-0.073-0.0430.0410.088-0.066-0.1690.042-0.101
[th]AllarScore[/th]0.045-0.0771.0000.9520.5300.1950.1900.058-0.0510.0950.544-0.1240.161
[th]ChatScore[/th]0.080-0.0750.9521.0000.5790.2220.2080.042-0.0210.0570.540-0.1170.160
[th]CoolCharScore[/th]0.1830.0110.5300.5791.0000.3070.1630.153-0.059-0.0860.223-0.2180.068
[th]TeamSize[/th]0.282-0.0730.1950.2220.3071.0000.132-0.015-0.066-0.1420.1960.0180.078
[th]OnlineMP[/th]0.048-0.0430.1900.2080.1630.1321.000-0.062-0.0730.0050.148-0.0490.081
[th]LocalMP[/th]0.0090.0410.0580.0420.153-0.015-0.0621.000-0.1060.132-0.095-0.022-0.019
[th]VR[/th]-0.0010.088-0.051-0.021-0.059-0.066-0.073-0.1061.000-0.085-0.0390.101-0.031
[th]smallSize[/th]-0.086-0.0660.0950.057-0.086-0.1420.0050.132-0.0851.0000.125-0.1850.056
[th]Requested[/th]0.207-0.1690.5440.5400.2230.1960.148-0.095-0.0390.1251.000-0.0700.171
[th]WNE[/th]-0.0210.042-0.124-0.117-0.2180.018-0.049-0.0220.101-0.185-0.0701.000-0.065
[th]PreReg[/th]0.103-0.1010.1610.1600.0680.0780.081-0.019-0.0310.0560.171-0.0651.000
[/tbody]

There are a few interesting things in here. For example, the fact that teams with more members tended to submit later than teams with fewer members (reflected by the .28 correlation between TeamSize and Post count). For now, though, I’m going to focus on the Scores assigned to the submissions. First, pay attention to the extremely high correlation between the scores that assigned to games and the scores assigned by his Twitch chat. It’s > 0.95!

So, for funsies, let’s look at a regression analysis of the relationship between 's scores and the Twitch Chat scores.

The black line shows what the chat scores should have been if there were no relationship between the two (the null hypothesis). The red line shows a fitted regression line of what we observed. Interestingly, we can see from this that 's scores were, on average, slightly higher than Chat scores (by about 0.25, to be more precise).

If you’re a stats nerd like me, here are the important bits:

[table]
[th]Coefficients:[/th][th]Intercept[/th][th] AllarScore [/th][/thead][tbody]
-0.16640.9847

[th]Residuals:[/th][th]Min[/th][th]1st Quartile[/th][th]Median[/th][th]3rd Quartile[/th][th]Max[/th]
-1.42612-0.430280.004720.40431.54436

[th]Coef[/th][th]SE Coef[/th][th]t[/th][th]p > |t|[/th]
[th]Intercept[/th]-0.16640.1218-1.3660.174
[th]AllarScore[/th]0.98470.024340.525<2e-16 ***
[/tbody]

But, what I was most curious about was the relationship between team size and score - I initially thought of teaming as a potentially large advantage. Before I removed the disqualified entries, the relationship between score and team size was not significant. However, after removing DQs, the correlation rose to .19. Not huge, but it was enough to pique my curiosity. So, here’s a linear regression predicting overall score based on TeamSize.

This graph isn’t too surprising, given what we found in the correlation matrix. So let’s dig into the juicy, numbery bits.

[table]
[th]Coefficients:[/th][th]Intercept[/th][th] TeamSize [/th][/thead][tbody]
4.05940.2582

[th]Residuals:[/th][th]Min[/th][th]1st Quartile[/th][th]Median[/th][th]3rd Quartile[/th][th]Max[/th]
-4.3502-1.32570.18251.41614.4243

[th]Coef[/th][th]SE Coef[/th][th]t[/th][th]p > |t|[/th]
[th]Intercept[/th]4.059380.2641015.370<2e-16 ***
[th]AllarScore[/th]0.258170.099742.5880.0105 *
[/tbody]

Look at those p values! As it turns out, team size certainly does have an effect on score, but it’s not nearly as large as I suspected.

Okay. So this is all fine, but how can we use these data to make the best game jam games that we possibly can? I’m glad you asked, dear reader!

From here, I took the information gleaned from these analyses and performed a couple of multiple regression analyses. If you’re not familiar, the basic idea is that we’re using multiple inputs to predict a single output. With enough inputs, it’s a pretty good method for identifying which behaviors have a desired output. In this case, we want to know what we can do to improve the quality of our game jam submissions. Here we go.

To start, I took the three largest correlates with AllarScore (the ones that made sense, anyway) and threw them into a regression equation. Here’s the output:

[table]
[th]Coefficients:[/th][tbody]
[th]Intercept[/th][th]TeamSize[/th][th]OnlineMP[/th][th]CoolCharScore[/th]
2.749310.035431.006440.43655
[/tbody]

Basically, this tells us that we can calculate an expected score based on the following formula: Score = 2.74931 + TeamSize * .03543 + Coolness of Character * 0.43655

  •  You can add 1.00644 to that, if you include online multiplayer.
    

But, let’s take it a step further and add another predictor variable (pre-Registration).

[table]
[th]Coefficients:[/th][tbody]
[th]Intercept[/th][th]TeamSize[/th][th]OnlineMP[/th][th]CoolCharScore[/th][th]PreReg[/th]
2.464540.026450.929650.432580.48158
[/tbody]

With this new multiple regression, we can come up with a new formula: Score = 2.46454 + TeamSize * .02645 + Coolness of Character * 0.43258

  •  You can add 0.92965 to that if you include online multiplayer.
    
  •  You can add 0.48158 to that if you preregister for the jam.
    

So what does all of this tell us?

  1. Find a team or add reliable members to your team. The more the merrier.
  2. Make sure you pay attention to character design. Clearly it’s important. One could make an argument, here, that abstract characters (like glowing spheres) aren’t as interesting to players. Or, it could just be a spurious correlation.
  3. If it makes sense, add online multiplayer. Apparently, people like playing with other people. Who would’ve guessed?
  4. Preregister for the jam. This is linked to a broader point - be prepared and informed. Know when a jam is going to happen and jump in as soon as you can.

I realize that some of these conclusions might be fairly obvious, but I had fun playing with the numbers. Hope it helped shed some light on the results of our jam!

I am in awe of this amazing breakdown of the MegaJam! I really like the conclusions you’ve added at the end as well. Did you try to analyze if there were biases in scoring based on genre?

This is really quite fantastic and a great tribute to a great stream/jam.

It seemed pretty obvious to me that people could do better character design.
I wonder if there was a feel good factor about playing a person rather than game problems, they were inherent with the games.
Team size should win out but obviously it shows how it can be a problem and technically played back into the hands and gave mercy to the lower team size.

Actually my thoughts were:
1)Make a 1 level game
2)Give it at least 3 areas
3)Have about 3 varied mechanics, well that only has to be variations rather than completely different
4) Include clear instructions as player intuition levels were low. (No-one would quite know what to expect and how to play)
5) Get a character artist.
6) Get a musician or at least some decentish music that played for the whole game length.

I thought about analyzing genre (and perhaps a couple of other things) but we don’t have those numbers readily available, so I’d have to go back through all of the entries and tally those up. I might do that in the next couple of days, though, if there’s sufficient interest. Would you like those analyses for the October 27th livestream?

“Frankly, with data collected from Humans, it’s rare to see data that appear this close to normal.”

The one problem you have here is the audience being aware of 's bias ahead of time.

I mentioned it to him and he said it didn’t seem to make a difference if he said a number before the audience voted, or not.

I disagree. People are nothing but biased. They hear his number and the majority will flock around that value.

Next time, I suggest making a separate, non-visible until the voting is closed, poll setup.

THat would at least eliminate the entire bias of the reviewer + audience.

Otherwise, crazy good analysis based on the information at hand.

Cheers.

Indeed. It does seem fairly obvious to improve upon character design, but it’s nice to have more concrete numbers. These data suggest that an absolutely amazing character design can contribute as many as 3-4 points to a game’s overall score, which is far more than I would’ve expected. I think there definitely is a “feel good” factor here. We tend to empathize with characters that are more like ourselves and, well, it’s hard to empathize with an abstraction.

I agree with your thoughts, but I would actually modify your fourth point. Instead of making clear instructions, try to make a game that requires no instructions at all, or 2-3 lines of text, at the absolute maximum. If text is absolutely necessary, then include any text instructions in the first few seconds of game play. Most players won’t read instructions and, even when they do, instructions tend to be forgotten very quickly. (FWIW: I’m an academic with a background in psychology, robotics, and user interfaces, so I’m pretty darn familiar with training, memory, and attention as they relate to conveying information in digital media.)

I agree. Fortunately for the analyses presented above, the bias you mentioned would only apply to the correlation / regression found between 's scores and the chat scores. In psychology, the bias you’re referring to is called Anchoring (or Focalism), and it can be a powerful tool for negotiation. Unfortunately, I don’t have the necessary data to estimate the potential effect of the stated bias. :frowning:

However, because AllarScore was the only dependent variable used in the second regression and the multiple linear regression models, ChatScore would likely have had little-to-no effect on the relationship between AllarScore and the other variables analyzed. Unless, of course, the chat piped up and offered a score before , potentially biasing him… but it seemed very rare for chat to offer a score first, based on my observation of the stream.

Oh, and thanks! I’m glad people appreciate these statistics.

:smiley: way to go above and beyond doctor :smiley:

This is awesome.

I did plan to write a chat bot for voting that would make the two completely independent, unfortunately didn’t get around to it due to Steam Dev Days.

I wish there was a way to automate all this data for future jams / streamers, lots of good information here.

I thought my scores were far less evenly distributed as well, so, yay!

The world needs more stats!