I wrote about data leaks before . A painful problem like this is an opportunity for some and we now have quite a few startups selling products to monitor data leaving a company's network for sensitive information. Vontu and Reconnex are a couple of them. Port Authority was another that was acquired by WebSense.
It is interesting to see how one can go about solving this problem. [Note: In this writeup I focus only on detecting leaks through the company's network. There is another ways in which information can be leaked: through storage devices like hard disks and USB flash. The techniques I mention here do not work for them.]
First we need to define sensitive data. A few items like social security numbers can be easily defined as regular expressions (ddd-dd-dddd, where d is a digit) and one can scan all network data for anything that looks like a social security number. But what about other information? We can apply the pattern matching approach to other structured information like patient records in a healthcare facility, account information in a bank etc. What about unstructured information like design documents or patent ideas communicated between team members in emails? It does not follow a pre-defined pattern. Is there a way of monitoring it? Fortunately, there is. One method is to use rabin fingerprints . Calculate these fingerprints for all potentially sensitive data and match it with the fingerprints calculated for network traffic. This method works well because even if the data was changed a little (like a section from a document was copy-pasted into an email etc) it is matched by the fingerprints.
An approach that combines pattern matching for known and/or structured data and fingerprinting for unstructured data works well in detecting unintended accidental data leaks in information passing through a company's network. A report says that 60% of the leaks reported so far are of this nature. So it is a useful approach. What about the other 40% intentional data theft? I will write about it another day but the first thing that will come into mind is to apply some kind of "locks" and "alarms". Locks in the digital world are cryptographic techniques and alarms are data access, modification and transmission logs.
[Some readers will note that I did not mention "watermarks". I consider that as a subset of structured data]
Add to Technorati Favorites