Data compression usage

I’m working on creating a new corpus for testing general-purpose data compression codecs and, since I know games are major users of compression, I’d like to make sure they are represented. However, as my knowledge in game design is limited (and I’m being generous there), I’m hoping for some advice in choosing data for the corpus.

What I’m looking for right now is information on what types of files people need to compress. For example, I put together a quick summary of what Unvanquished stores in their *.pk3 files, which includes the extension, total size of all files with that extension, number of files with that extension, and the average size of files with that extension. If you could provide similar information for any game(s) you have access to it would be very helpful. I’m sure each game will be different, so as I get information from different games I hope to get a better understanding of the average composition.

If you’re not comfortable posting the information publicly, feel free to contact me via a PM (or e-mail, irc, etc.).

Once I have a handle on the types of things which are common I plan to start looking for representative data I could use which is licensed in a usable way. It doesn’t have to be open-source, though that would be best; as long as it is redistributable for the purposes of benchmarking, codec development, etc., it’s acceptable. I’m getting a bit ahead of myself, but if anyone has some data they could share I’d be very interested.

Note: I’m also trying to contact people using other engines. If you’re interested in the overall results, I intend to summarize everything at https://github.com/nemequ/squash-corpus/issues/6.

Obviously I was hoping for more of a response. Perhaps I shouldn’t have posted on a Friday evening, or perhaps it would help to explain why helping with this may be beneficial to you in the long term… TL;DR: by participating, future compression codecs will be better tuned for the type of data you are using.

The main benefit is that people developing compression algorithms will be using this corpus to help tune their codecs. As soon as the corpus is ready I’ll be switching the Squash Compression Benchmark over to it, and I know several codec developers have been using that benchmark to help tune their codecs. There are also other benchmarks, and of course codec developers download the data to run their own tests.

It’s important to understand that compression algorithms vary wildly based on the type of data they are processing in terms of compression ratio, speed, and memory usage. This isn’t just the fact that plain text compresses much better than, for example, a JPEG (though that is true, of course). Some codecs are able to compress text very well and quickly but can’t compress images nearly as well as other codecs or take vastly more time/memory to do so. Some codecs do a great job with images, but have very poor results for text. Some codecs work well for small files, some for large, etc.

The current standard corpus, the Silesia Compression Corpus, was developed in 2003 and doesn’t really reflect modern usage, but it’s what people are still using to benchmark and tune codecs because there isn’t anything else.

So, if you can provide a brief summary of what types of content you are (or would like to be) compressing, there is an excellent chance that future codecs (and possibly future versions of existing codecs) will compress your data better, faster, and/or using less memory. If, OTOH, game developers don’t help, the new corpus will probably not include any data relevant to games and people will optimize compression codecs for the data it does include, quite possibly at the expense of the data you want to compress.