Skip to content

DataSynthesis Platform - Synthetic data building, generating platform for multiple business types

License

Notifications You must be signed in to change notification settings

Project-Herophilus/DataSynthesis

Repository files navigation

This project we have left in place here for existing folks that have used it before. We are in the midst of moving it to its own project to stand on its own.

Background

As we thought about how to help healthcare we continue to focus and believe that data being the asset and that must be core as part of our mindset. A key part we want to ensure is a focus on a wide variety of data enablement capabilities. Our logic is simple, for years companies have focused on most aspects of development, from the tooling to developing the next generation of solutions to support their business needs and provide value. However, building great software to help today's modern needs require data, in many cases, massive amounts of data. It is a HUGE business and technical benefit if that data can closely resemble production data. Since data is the electricity that powers business and the cornerstone of companies’ success in the digital era, we wanted to take a more comperehensive focus on enabling organizations around synthetic data.

Synthetic data is defined as: "any production data applicable to a given situation that are not obtained by direct measurement" according to the McGraw-Hill Dictionary of Scientific and Technical Terms; where Craig S. Mullins, an expert in data management, defines production data as "information that is persistently stored and used by professionals to conduct business processes." With these definitions it is easy to understand the creation of synthetic data is an involved process that can be achieved by numerous measures and ways. Our way was to create a platform to synthesize data (Data Synthesis) for multiple needs based on items like industry standards, coded ontologies, vendor data models, custom defined models all in an on-demand manner. With a focus on data and specifically synthetic data we wanted our platform to clearly express our focus, the name we settled on was DataSynthesis.

The idea for DataSynthesis is in NO WAY new or unique, it purpose and usage is fueled to help reduce and/or remove the struggle that every organization experiences around their data needs. What we believe makes this plaform unique is our perspective and approach.

  • While there are numerous offerings out their across the open source and paid offerings we wanted to build something that could not only be used to support data integration needs but also support application development and integration needs as well.
  • As part of the Project Herophilus community the intent is for it to be leveraged to both support and enable other capabilities to be developed and leveraged. A complete list of components from Connectivity, Data Real-Time Assets, Data Simulators, Data De-Identification and Anonymization components and more can be found here.
  • Simplicity built for complex data and datasets needs. The DataSynthesis platform from its inception has been designed to generate and/or build upon a concept of data attributes. There are currently 21 different data attributes it can use to create data structures.
  • Our focus is on enabling massive amounts of data to be used immediately or very quickly. This we feel helps to focus on reducing data breached and information exposure. Why should organizations risk data breaches or the potential leakage of PHI (in healthcare) or PII (In any other industry)? In today's technology world we wanted to enable a new and different way to innovate within a data-driven organization, an extensible
  • Work with implementations industry based data. Our focus is also on enriching the platform with codes and codesets into data thats generated to ensure it matches existing data systems.
  • Generating industry standargs. For Healthcare specifically this is HL7, FHIR, EDI and so forth. We are actively working on implementing FHIR and improving HL7.
  • Helping to create and grow "Data Driven Organization". To be a data-driven organization requires an overarching information culture driven by data. An information culture is not only a deep knowledge of their data but a major understanding how it relates to any specific testing needed or required. broad access and data literacy along with appropriate data-driven decision-making governance and guidance processes. While it sounds complicated it is really about providing businesses a means for data collection, cleansing, hosting and maintenance data while mitigating the risk of a data breach thru comprehensive testing processes and practices. Data-driven organization can innovate continuously because they understand and can embrace new business models quickly. The focus around tooling in these organizations typically is to enable them.

DataSynthesis Philosophy

Data Synthesis has always intended to be operated under the open/community source model. DataSynthesis open source licensing model is Apache-2.0. Our model is not some "freemium" or offering based model with versions and scaled capabilities. Our approach is to provide the assets and have community enhancements and improvements to support the growth of underlying needs for the platform. data access capabilities. The core assets provided include a highly flexible and extensible data tier, APIs that both enable the platform to be accessed as well as extended and at some point there will be a WebUI.

DataSynthesis: Getting More Familiar

DataSynthesis consists of three core modules - data, apis, and a web interface.

Area Sub-Module
Data DataTier
APIs DataTier APIs
User Interface DataTier Web User Interfaces

Enjoy and Happy Coding!!!