You use RAM to do the pattern matching and then a database to store number of occurrences of patterns that are of interest.
Remember that you should decompress the data in RAM before you do the pattern matching. Trying to compress compressed data often results in more data.
Fort Wayne, IN
reply to yaplej
said by yaplej:You realize you are just describing almost any general data compression algorithm out there. See reference: pkzip, zlib, deflate, etc...
So lets take this another direction. How about analyzing each wireshark stream individually for any possible patterns and only store those if any. Then analyze the collection of matched patterns for the top most common patterns.
So after a few packets you notice a pattern that could be substituted. First, it's a lousy codec if that pattern is obvious and not a function of just the particular voice characteristics of that call. But even beside that, by the time you realize "Hey, there's a pattern here", the packets should have already been sent. If they haven't, they are now useless as part of the conversation.
Its a lot less storage if you only find a few patterns per session.
But even if you do realize that 001010...1010 is a repeating pattern, you now must tell the other end that this particular pattern repeats and that if they see a particular bitpattern "key" come across, it should be replaced with the expanded value. Of course, this takes some extra data across the line. But it might save us some in the long run.
But that particular bit pattern that we've noticed repeats...may never repeat again. We don't know if it will or won't as it's a "real time" protocol and we can't look ahead, or go back in the past. We can only look at a very small bit of time. And the overhead of dictionary-based compression (such as system resources, sharing the dictionary across the link, and encoding the substituted values) becomes far more then what we would save.
I would suggest reading up on data compression algorithms first. I have a feeling that there is a lot more to compression then what you might know already.
Seems like if you had 1000 calls to analyze you could get an idea quickly if there are any common patterns for a particular codec.