In this blog post, we are going to examine the problem of Data Integration.
In previous parts of this series we covered the following levels of Maturity model of productive analytics platform:
- Data Engineering
- Data Curation
- Data Quality
To get curated and cleaned data into the system the first thing that pops up is technical integration. Keep in mind that data may be coming not only from single source, but from several and the way how it could be transferred is different: by streaming or sending downloaded data (batch). Loaded data can be added to a new table, appended to a table, or can overwrite a table. To load data you must check that you have read access to the data source and make sure you have write access to the destination table. While building the interface for integration don’t forget the basics: security, reliability and monitoring.
Massive amounts of data require very careful selection of the data format. If you want to load your data to different data analytics platforms, you need to know that your data is optimized for that use. Best data format for the platform is usually one from the top three: CSV, JSON, Parquet.
Even though best data formats are known, there is a chain of checks and decision points that you need to go through before making the final decision.
Use the points below as a checklist:
- Data is easily readable and there is no problem in understanding of data.
- Data format is compliant with tools for extracting, transforming and loading the data (ETL process) that are used or will be used in the project.
- Data format is compliant with the tool used to run the majority of queries and analyze the data.
- No conflict with machine learning algorithms that may require input data in specific format.
Mind that your selection would impact allocated memory capacity, time needed to process data, platform performance and required storage space.
At John Snow Labs we recommend running several sets of tests in your system before fully committing to using particular format in production. Choosing the best data format is one of the most critical performance drivers for data analysis platforms. To run the tests, take several data formats that passed through your pre-selection process. Use the query tool that you are targeting to have in production. Prepare test data with different scope, e.g. run test with small dataset, with medium dataset and with large amount of data. File size should be similar to what you expect during real platform usage. Run the routine operations that will be typical on your platform (write data, perform data analysis, etc.). This will allow you to check performance and memory consumption while using different formats.
It is better to invest in analysis and exploration of possible data format options in the beginning of the project than later to be forced to write additional data parsers and converters that will add unwanted complexity to the data analytics platform (not to mention efforts, budget and missed project due dates).
Sometimes you may require data enrichment – adding calculated fields that are not trivial to calculate. For example, geo enrichment when addresses are converted to latitude and longitude to ease the visualization on the map.
Another big thing is integration of multiple datasets. Different sources support different kinds of schema and data types. E.g. databases support primitive data types, while JSON files allow users to have complex types such as maps (key-value pairs), nest objects, etc. When external sources give you the datasets most of the time you cannot use them together because of discrepancy in data formats. Being not able to join the data means being not able to use it. You need to enrich or map the data.
Solution that we use in John Snow Labs is defining unified schema and type system. All newly curated datasets must be mapped to it. Most problems are around dates, null values, currencies, etc.
Even when you have the datasets that can be joined, check the data for semantic interoperability especially if you want to train the model on this data. Very often data needs to be normalized, e.g. to correspond to the same scale. Here is an example from healthcare industry: two laboratories have the same equipment, but it is calibrated differently and scale of results that is considered as normal are not the same. When for one laboratory test results in range 45-80 is normal, for other laboratory this threshold is different, e.g. 40-75. And when you exchange the data with values on the boundary range, e.g. 40-45 or 75-80 the interpretation is completely different. That is why you need to carefully read description of all the fields and make sure they are compliant before training your model on this data.