Digitalization Alone Does Not Enlighten Dark Data

Chemist frustrated with data silos

Data silos are persistent, and often aren’t resolved just by digitalization technologies.

For a number of years, it has been very fashionable to speak about data silos and dark data, which have piled up in the research and development departments of chemical  business-to-business (B2B) organizations. The root cause for these data silos was the step-by-step digitalization of R&D processes beginning in the 1970s and 80s. If we look at the history of dark data in the chemical industry, we can see that even while trying to catch up on digitalizing R&D processes, the inertia of the status quo is very difficult to overcome. We have found that some of the key components to overcome data isolation include legal aspects, social aspects like business transformation by change of traditional thinking, and technology aspects like platform approaches.

The History of Dark Data

In the life science and chemical industries, digitalization began with several independent parallel streams. Electronically drawing, storing and searching of chemical structures (first 2D then 3D) and the computational prediction of chemical properties from chemical structures led to computational molecular modeling and the concept of storing chemicals and associated information in chemical databases. These took the forms of enterprise registration systems, chemical inventory systems, as well as many specialized local material and formulation databases.

Independently, the digitalization of transactional sample management took place. For the processing and distribution of samples to analytical labs, and the recording of analytical and test results of such samples, laboratory information management systems (LIMS) became very successful in managing the sample-flow between laboratories. Subsequently electronic laboratory notebooks (ELN) flourished, capturing all relevant intellectual property (IP) of research data in a more flexible and descriptive manner. The data stored in ELNs is more easily understandable by humans to be help to generate patents and protect the companies’ IP.

As the boundaries between chemistry and biology research are vanishing, we should also add all the various informatics systems which biologist use to the mix of digital solutions. Finally, there are many systems replacing regular paper documents like local file shares and document management systems, as well as all the public information and data available from outside of the company’s firewall.

Now it becomes obvious what data silos are: Each of the mentioned digital systems has its own technology with its own proprietary way of storing data, which requires special knowledge to access the data. Large companies usually have hundreds of such different data sources, making it difficult to leverage the data effectively in a holistic manner. It is difficult to contextualize scientific results through association with related scientific data from other segregated data sources. Such information, sitting in a “data silo” which is separated from other data sources is often called dark data since it is invisible to a wider audience. The data is therefore eluding itself from interpretation and stays “dark”.

Obviously, within research departments alone that is a tremendous issue since it is hard to leverage all the historical data collected over previous decades. If we broaden our view, we realize that there are often strict borders between research, development, manufacturing and marketing departments, and the IT systems within such departments. The transfer and exchange of information and data is of course crucial for fast innovation in times of shorter product life cycles. There is a tremendous amount of non-structured data exchange, very often manually within transactional email programs, which are not made for long-term storage. Such information, which is often the basis of fundamental business decisions gets lost over time easily.

The Status Quo in the Business-to-Business (B2B) Chemical Industry

Compared to the life sciences industry, the chemical industry lags behind in digitalizing R&D processes, but chemical companies are speeding up the digitalization of R&D Labs by introducing ELN and LIMS technologies, and are capturing more and more data electronically. This is necessary since the chemical industry needs to react more quickly to rapidly changing demands of customers and consumers. With that, we are facing a severe new issue with R&D digitalization:

The question how to preserve data security while preventing dark data.

This problem is unique for two reasons:

First, it is the nature of data captured in paper notebooks and of dark data in data silos (e.g. lab specific local material databases) to be secure by definition, since such data is not available to anybody outside a lab.

Second, unlike in life science industry, where it is key to protect IP against external threads and competition, in large chemical B2B companies the problem is surprisingly often even more severe. Surprisingly, because it is well known that in pharma a drug candidate is worth multiple billions, whereas in chemical companies such huge financial amounts are normally not tied to a single molecule or formula. But due the contractual situation with business partners, scientists in the chemical industry not only need to hide and secure data to the outside world but also need to keep R&D data confidential from other departments. Very often, this confidentiality restricts information even within their own department or from laboratory colleagues sitting just next to them. Contracts are often so tight that researchers who are allowed to see confidential data need to be individually listed in contracts.

Scientists and technicians in different labs

More than a hundred years of paper based workflows, data silos, and dark data and the strict B2B confidentiality contracts lead to the set in stone mindset of “data cannot and should not be shared”. Each researcher is the master of their results and decides on their own what data is shared with whom and when (if at all) by handing over a document with a confidentiality notice and carefully pre-selected data.

At the turn of the millennium, companies in the life sciences industry had to overcome a similar mindset. When moving from paper to electronic lab journals it was the mindset that only paper signed with ink is secure enough to be used as evidence at court to protect IP. Although electronic and digital signature technologies existed and were well facilitated by electronic lab journals, it took 10 to 15 years until companies stopped printing and wet signing their already electronically stored R&D experiments. Today even more sophisticated data security algorithms like block-chain are discussed, but this highly secure modern technology being available in the past would still not have helped to overcome the traditional way of thinking 20 years ago.

The ELN history exemplifies that it is naïve to believe today’s security concerns over sharing information in B2B industries can be easily solved by technology and digitization alone. While one of the key benefits of ELN implementations is the sharing and learning from R&D data, we have observed ELN implementations with overwhelming security functions replicating traditional paper-based workflows and security behaviors , leading to hundreds of data silos within a single enterprise system. Because fine granular security is possible, some companies use it for data isolation and not for data sharing and analysis, and by that continue their traditional way of working like on paper. The notion is not to disclose anything rather than risking a contract breach.

This mindset obviously works against all sharing and collaboration initiatives, and thereby hobbles modern paperless data-driven, single source of truth IT-strategies or platform approaches, which support the mission of immediate data and information access and by that foster innovation by collaboration.

How Can We Solve the Dark Data Issue?

Many strategies and technologies for accessing and contextualizing data silos do exist. Typical approaches are federated searching versa building data warehouses or data lakes, and more recently the innovative indexing of data. All these methods are enabling modern business analytics. However, as discussed before, things start to get complex and confusing if we try to add the sophisticated fine granular source data security of a state of the art ELN system, and try to connect to data from other data silos with completely different security systems and access permissions. Some software companies offer technology to add the security of each single data source, but thereby add a lot of overhead by maintaining the document data access matrix of the sources for each single scientist. Such strategies are putting the effort into even more data separation and security rather than into data sharing and are therefore not solving the dark data issue. Data often stays dark in federated databases if source data permissions are kept.

Now it is more than obvious that the chemical industry has to overcome both a legal and traditional mindset issue as well as a technology issue.

Changing this legal mind set is much more difficult than implementing a software solution since it requires strong leadership involvement to change historically grown data protection behavior, and to convince legal departments and customers that increased collaboration will lead to faster innovation and therefore better products for business partners. Establishing an information sharing culture across multiple departments again requires strong leadership and top level decisions which go far beyond the competencies of software implementation project teams. Companies have to discuss and agree not only on global master data to reach standardization, but also more standards on if and how data can be disclosed, which is even more difficult. Ideally, such a new business philosophy is established prior to lab digitalization initiatives.

The question is where to draw the line and which data can be disclosed and should be accessible for colleagues.

We have worked with companies who would not even allow the statistical analysis of the relationships between samples and their performance criteria since it is already seen as a security risk if somebody would find out that a sample with a specific characteristic exists within the company. Even if there is no relationship visible about what compound, material or other entity the samples was taken from. In the chemical B2B industry not only the material but also the customer and the project needs to stay secret. The current mindset is that it is most easy and secure to not disclose anything, which is the root cause for dark data and data silos.

A much better approach would be to setup simple clear rules such as the following:

  1. An ELN experiment combines sample background info (material definition, process parameters,…) together with performance test results as well as customer info (project, customer name, …) and therefore should only be disclosed if the author does not see a security problem or contract breach. That’s normally easily achievable since ELNs do deliver security mechanisms.
  2. It should be possible to extract material information plus test and performance results if needed for a lab or business unit, while hiding client data if necessary.
  3. Samples can be managed with only minimum source information (only a link to the originating ELN experiment) together with test and performance results in LIMS systems, which should be globally accessible. Access to the source information is still restricted to the original owner.
  4. There should be dedicated data analyst teams who are able to access data independently from data source security restrictions.
    The benefit would be tremendous: it would be possible for departments to feed their data into machine learning algorithms to create their own local models for predicting the next, better performing material. With that, there could also be global searching and data analytics available for specialists in order to find materials with the required performance across all internal department barriers. Samples with interesting profiles would show the link to the associated experiment and its author, so scientists could request the disclosure of the experimental data.
    Establishing and introducing such rules sounds easy but is certainly not simple since it requires involvement of legal departments to review current contracts with customers and adjust them accordingly.

The Future: Business Transformation and Introducing a Digital Platform

In recent years, some of the most profitable and successful companies are those that have adopted a digital platform model—a strategy where disparate groups are interacting over a platform to co-create value by facilitating various technologies in combination; for example, recruiters and employees on LinkedIn, and drivers and customers on Uber.

With the 3DEXPERIENCE platform, Dassault Systèmes puts the consumer as well as the researcher into the center. For the first time this successful PLM concept is now utilized by the life sciences and chemical industries through scientific R&D solutions on the 3DEXPERIENCE platform. But to be successful on the 3DEXPERIENCE platform the chemical industry must open up their data protection behavior to gain benefit from the single source of truth and data continuity paradigms deeply implemented in a platform approach. It just makes no sense to continue to transport the historical boundaries into modern communication platforms like the 3DEXPERIENCE platform in the cloud.


The chemical industry will be most successful and innovative if they do not dismiss the digitalization wave, often called Industry 4.0. This requires the introduction of modern R&D platforms allowing the communication and sharing of data in modern digital systems in the cloud. As we learn from other industries, the platform approach is a critical and proven success factor. What is often neglected by IT and digitalization projects is that is has to go along with a real change in mindset, and changes in company culture – especially concerning historically grown security and protective data separation behaviors – and the required business transformation which is a hugely underestimated problem. Without changing and opening data-protective thinking, such digitalization projects might fail. This requires leadership teams from legal and R&D to discuss new contracts with their customers allowing data visibility within the company, by simultaneously introducing simple data security rules and training of employees on such data security concepts and contracts, and company non-disclosure policies to outside companies. Then chemical companies can achieve tremendous advantages from modern platforms introduced in digitalization projects.

Björn Loeprecht

Dr. Björn Loeprecht is an Industry Process Consultant at BIOVIA, Dassault Systèmes. He has helped building strategic relationships with a number of key Dassault Systèmes partners, by designing scientific innovation and information strategies. He has consulted many chemical and life science companies in streamlining and digitalizing their R&D processes. Dr. Loeprecht holds a Ph.D. in Theoretical Chemistry from Leipzig University.

Latest posts by Björn Loeprecht (see all)