I’m working on creating a new corpus for testing general-purpose data compression codecs and, since I know games are major users of compression, I’d like to make sure they are represented. However, as my knowledge in game design is limited (and I’m being generous there), I’m hoping for some advice in choosing data for the corpus.
What I’m looking for right now is information on what types of files people need to compress. For example, I put together a quick summary of what Unvanquished stores in their *.pk3 files, which includes the extension, total size of all files with that extension, number of files with that extension, and the average size of files with that extension. If you could provide similar information for any game(s) you have access to it would be very helpful. I’m sure each game will be different, so as I get information from different games I hope to get a better understanding of the average composition.
If you’re not comfortable posting the information publicly, feel free to contact me via a PM (or e-mail, irc, etc.).
Once I have a handle on the types of things which are common I plan to start looking for representative data I could use which is licensed in a usable way. It doesn’t have to be open-source, though that would be best; as long as it is redistributable for the purposes of benchmarking, codec development, etc., it’s acceptable. I’m getting a bit ahead of myself, but if anyone has some data they could share I’d be very interested.
Note: I’m also trying to contact people using other engines. If you’re interested in the overall results, I intend to summarize everything at https://github.com/nemequ/squash-corpus/issues/6.