Here at John Snow Labs, we are delighted to announce that all datasets are now also available in the new highly optimized Apache Parquet format, which delivers an order of magnitude faster query speeds, as well as substantial storage savings, according to multiple industry benchmarks.
The new format drastically accelerates queries on common benchmarks. It also reduces disk space, bandwidth as well as CPU usage. It is available alongside with the existing CSV and JSON data formats and can be found on all subscriptions.
Apache Parquet is an efficient and a general-purpose columnar file format. It is self-describing, language-independent and also supports multiple compression algorithms and partitioning for big data sets and nested data structures. John Snow Labs is the first to deliver a data repository in Parquet format in the healthcare space, which is experiencing fast growing adoption of big data analytics technologies.
Parquet was designed for Apache Hadoop and has been adopted by Apache Spark, Cloudera Impala, Hive, Presto and Apache Drill. The majority of big data analytics platform now recommend it as the most efficient, highest performing data format. Here are recent publicly available benchmarks:
IBM evaluated multiple data formats for Spark SQL showed Parquet to be:
- 11 times faster than querying text files
- 75% reduced data storage thanks to built-in compression
- The only format to query large files (1 TB in size) with no errors
- Higher scan throughput on Spark 1.6
Cloudera examined different queries and discovered that Parquet was:
- 2 to 15 times faster than Avro, and far faster than CSV
- 72% smaller on a wide table and 25% smaller on a narrow table
United Airlines also published that Parquet was:
- 10 times faster than CSV on Presto and 3 times faster than CSV on Hive
According to the founding team, “Our customers expect us to optimize and test the data we provide for whichever analytics platform they use – often for multiple ones. For big data platforms, Apache Parquet is emerging as the gold standard, and we are thrilled to be the first to support it across our entire data catalog. Our customers benefit in two ways. They get turnkey data in an optimized format and do not need to spend time and effort on reformatting, plus they get the day-to-day productivity boost from screaming fast query performance.”
We provide turnkey data for scientists across 15 areas of healthcare. Our service helps in the analysis of healthcare data specializing in data engineering to optimize storage, bandwidth and data access performance. We also invest in optimizing and testing clean, current and enriched healthcare data sets on the latest big data platforms. Our current partners include Cloudera and Hortonworks in big data, Atigeo and Turi in data science and open-source projects Spark, Presto and ElasticSearch.
Here at John Snow Labs, we believe that data science will be a major driver of progress for 21st-century medicine, by providing quality DataOps and finding, cleaning, formatting, updating and publishing turnkey data for technology companies, healthcare providers, research, government and non-profit organizations.