The recent Panama Papers news stories wouldn’t have been possible without big data. The time that it would have taken humans to sort, process and analyse a reported 11.5 million files, some of which were paper based, would have meant we would still be here next year waiting for juicy data about David Cameron’s family tax affairs.
Or maybe we would never have heard about these revelations, because it is one thing to read millions of documents, but it is an entirely other activity to review terabytes of data to uncover links, trends and relationships between many thousands of pieces of information.
Big data has revolutionised information discovery in situations like this – enabling the dissection of data in a matter of seconds, and the pulling of insights based on keyword searches – within a system that instantly links related information to start uncovering interesting relationships between individuals mentioned in the data.
The Panama Papers leak contained a reported 5 million emails, 3 million database files, 2 million PDFs, 1 million images and over 320,000 text files. Some of these files are searchable, but many are PDFs or images that cannot be searched against. Big data technology and complimentary tools have meant that documents can be processed with Optical Character Recognition where images containing written info can be ‘read’ and converted into text – making images searchable and open to analysis. Additionally, many document management systems collect metadata associated with files and store it as searchable data – such as information around who saved the file, when it was saved, what the file location was (i.e. was it in a folder marked “The Camerons” 😉 ).
The data collected in the leak amounted to 2.6tb – with a date range of between 1977 to 2015. It showed that Mossack Fonseca worked with over 14,000 banks, law firms and middlemen to deliver their services – now how would even a department of 100 analysts start to piece these 11 million files to 14,000 clients within a reasonable timeframe?
And how would that data be stored in a way that trends and relationships could be easily spotted?
It’s no wonder that in large scale investigations, (such as the notable Enron case), big data tools are the go-to solution to make sense of the data at hand.
And that is exactly what big data is all about – it’s about making sense of huge volumes of information; turning seamlessly unimportant reams of emails into a story, when linked together and layered against other contextual information.
The process employed by the International Consortium of Investigative Journalists (ICIJ) who handled the Panama Papers meant that all of the documents were inputted into the big data tool, extracting text information and file metadata from all documents. The process took roughly two weeks from starting the information inputting activity to getting a database that was able to be queried and searched against.
Once the information was centralised and in a sensible format, it was then possible to perform keyword searches against the database, which is probably how key public figures have been uncovered as being involved in ‘unusual’ tax activities.
Without that ability of big data to look across millions of files, and pick out instances where a particular name was highlighted (whether that’s many times within a clearly marked document, or just a name buried deep within a 30 page report) – key insights that we have read about in the past few days charting the Cameron family’s money flow through these offshore tax havens would not have been possible, as data would not have been linked together so quickly.
Coming back to my earlier point about Enron, there is also now an Enron Corpus where over 600,000 emails generated by over 150 Enron employees are held within a publicly accessible database – which researchers and anyone with a keen interest in data analytics can get their hands on.
Maybe this is where the Panama Papers will end up once the news stories die down?