Panama Paper leak episode has taken the world in its stride. It has done its rounds rigorously, creating a strong buzz and shaking the world. While everyone’s focus was on those famous personalities named in these papers; for me it was all about those 11.5 million confidential documents, with details of more than 2,14,000 offshore companies which formed this gigantic 2.6 TB of database.
Since the inception of this Panama Paper’s unpleasant episode, somewhere I deemed it correct. A fiasco of this cadre was really required to work as an eye opener for enterprises, small – medium or big. It has set an excellent example, which I think will be remembered forever, about the importance of role of data processing in document management. Panama Paper leak it seems has made everyone realize that adequate and appropriate approaches, backed with professional data processing expertise, are a must to handle present and futuristic data challenges.
By now it seems I would have made you curios enough to think that what has Panama database of celebrities to do with data processing and data management? So, the elements that make this paper leak so very important from my point of view are, those 320,000 text documents, 3 million database entries, 4.8 million emails, 21.5 million PDF files, 1 million images and a lot more. It is supposed to be one of the world’s largest volumes of database ceased; journalists ever had their hands on.
What was the role of data processing in document management?
The leaked database, end to end, was processed to be analyzed with help of latest in technology by providers of expert data processing services. This episode, I strongly feel, puts emphasis on the importance of not only technologies role in helping the International Consortium of Investigative Journalists (ICIJ) in creating the biggest money laundering scandal; but also on the expertise and aptness of the data processing service providers.
The biggest challenge ICIJ faced, was the variety of information that was handed over to them. Almost all the data that they received was unstructured as shown in the above image, and very less or no data was in a structured format. This lead to the challenge of “how to analyze unstructured data”. For which they were required to hire data processing professionals who then converted the entire database into a form that was consumable, queryable and searchable.
Data indexing was at the core of this ferocious data processing activity that was executed.
- Unstructured data was in the form of text files and HTML files at times. Search engines can handle such file formats conveniently as they are basically in plain text format.
- Compared to these common word processor document formats, documents in form of presentations were a bit complex, as they contained metadata and embedded content as well. These popular ones were complex, but didn’t take extra efforts while indexing.
- However, container like structures, including folders and compressed archive files were embedded with several objects along with metadata. Some of the search engines might find this to be a difficult task, to extract items from inside the containers and index them.
- Huge containers with multiple complexities, including Microsoft SharePoint, CMSs, and email archives had more embedded items. There were chances that metadata and embedded components were stored separately. This certainly was an early indication of challenges in indexing. All these made these systems a challenge when it came to indexing for most of the engines that prevail.
- The severest indexing challenge was with the compliance based storage systems. They were highly secure and were designed to be equipped to give confusing outputs; making it really difficult to comprehend once it was stored in such systems.
These challenges were addressed with latest technology. However, understanding the gravity of the episode and confidentiality of the project; understanding of legal and financial documents was a bigger challenge. All this made it mandatory for the journalists to have a robust system to perform the trivial task of transforming unstructured data into a fully indexed searchable database. Only an experienced data processing service provider with prominent delivery experience, project management skills, executional capabilities and scalability can be partnered for this kind of data transformation, which turned out to be a nightmare for celebrities across the globe.
Image Processing and Optical Character Recognition
Usually all files and documents were not available in textual, digital format. Images and PDFs worsen the scenario when it comes to data processing for data analytics. Challenge with these images and PDFs is to assess what do they contain, the language, and how the text does connect with the text in other blocks. i.e., a retrieval system does not facilitate direct feed of a newspaper with dynamic information or data.
It was hence transformed into a digital format and then the text was extracted from it. Here exactly in the workflow, OCR walks into the picture. It is the technology that helps to read, from the images, handwritten, typewritten, or printed text. It produces a text file that can be indexed by an indexing engine.
Data processing or data entry services, well equipped with state of the art OCR apparatus only can help you with custom specification of font name, size, spacing, and so on, and can adjust to different aspect ratios and scales. They scan text off images with high precision and PDFs.
The practices used by these journalists, to collect lists of high profile individuals, seem to be very simple; but actually it was not. Also they did not do it themselves; instead they were wise enough to hire data management professionals and with help of their data entry services and data processing services, journalists successfully identified the links between these individuals and offshore entities.
We all know that the secret behind the leak of Panama papers is certainly the information those documents contained. However; right data management and data security measures by right data management company is the answer. The data mis-management and incorrect data security measures enabled the people to carry out this operation. Choosing correct data management outsourcing company for the data management and analysis which follows right practices for the data security, transfer and management is the right foot forward.
Image via icij.org