Datasets VS Algorithms – A Breakthrough in AI 6x Faster

Mohamed Tharwat

The past years have witnessed strong emergence for different datasets and algorithms repositories. Some inquiries accompanied this emergence. An increasing amount of market research started to investigate which is more important for the development of Artificial Intelligence (AI) sciences, which segments are of highest demand and can have greater market share in the future.

By reviewing the artificial intelligence (AI) breakthroughs timeline over 30 years, Wissner-Gross found that the availability of high-quality datasets was the key limiting factor for AI advances and not algorithms.

He also found that high-quality dataset availability can cause a breakthrough in the field of AI six times faster than Algorithms.

In the year 1994, a breakthrough has been achieved in the field of “Human-Level spontaneous speech recognition”. The related dataset Spoken Wall Street Journal articles and other texts were first available in the year 1991 (3 years prior to the breakthrough), while the related algorithm (Hidden Markov Model) was first proposed in the year 1984 (18 years prior to the breakthrough).

The same time ratios have been noticed in three Google‘s projects:

GoogleNet objects classification at near-human performance
Google’s Deepmind achieved human parity in playing 19 Atari games by learning general control from video
Google’s Arabic-and Chinese-to-English translation

Moreover, the same time ratios appeared again in two IBM‘s projects:

IBM Deep Blue defeated Garry Kasparov
IBM Watson became the world Jeopardy! Champion.

The whole interesting story is available at:

https://www.edge.org/response-detail/26587.

A table summarizing the timeline of events for the AI breakthrough is also available at:

http://www.kdnuggets.com/2016/05/datasets-over-algorithms.html

Which algorithms could be of highest demand?

This is a common question you can find on different IT professional website, especially blogs and forums.

From my point of view, the advances in medical imaging (PACS/RIS and surgical guidance systems), Clinical Decisions Support Systems (CDSS) and the modern concepts of prediction increased the need for advanced and complex algorithms. Software houses working in the development of Pain Management Systems will be in need for advanced algorithms like: “Coping Strategies Questionnaire (CSQ) Algorithm” and other prognostic algorithms. Other developers working in the field of the medical imaging like dental planning guided Implantology will be in need for algorithms for collision detection (the same algorithm used in games like planes and missiles games), anatomization, image filtering, image segmentation, mandibular or maxillary curve detection).

Speech recognition and translation work are in continuous demand for advanced algorithms. Defense Advanced Research Projects Agency (DARPA) was beyond the great progress in the field of speech recognition since 1970.

Which datasets could be of highest demand?

Here is another common question you can find on most of the famous IT blogs and forums.

Decision Support Systems depend mostly on the availability of high-quality datasets especially in the healthcare and military fields. So, I think nowadays, most of the healthcare datasets are of highest demand and will help AI applications in healthcare. In addition, Geographical Information Systems (GIS) evolution has increased the demand for spatial or geographical data.

Datasets role in Software Quality Assurance (QA)

Another use for datasets is using it in the quality assurance of software.

The appearance of Quality assurance certificates and standards like Kaizen, Lean, Sigma, MMS increased the demand for intensive and thorough testing procedures.

Before issuing the final release of any software, the application has to pass through a testing phase. This is an important part of the software development lifecycle (SDLC).

Professional testers are used to write test cases which use trial data. The test cases could be run manually or using automated testing tools.

For example, for issuing a release of HL7 parser we will need massive numbers of HL7 messages. There are 133 types of HL7 messages. The testers will also need massive numbers of messages including different scenarios for different HL7 messages. The messages must be completely anonymized to comply with HIPAA guidelines. Of course, that will be a great burden for any software house. It would be just like finding a true treasure, if they can find a high-quality dataset of HL7 messages in any dataset repository.

Famous Algorithms repositories:

http://chorochronos.datastories.org/

http://aima.cs.berkeley.edu/code.html (Online Code Repository)

http://www.ccd.pitt.edu/algorithm-data-warehouses/ (Biomedical only)

http://www3.cs.stonybrook.edu/~algorith/# (The Stony Brook Algorithm Repository)

https://www.cs.cmu.edu/Groups/AI/html/other/ga.html (CMU Artificial Intelligence Repository — Genetic only)

http://www.algorithmist.com/index.php/Main_Page

https://xlinux.nist.gov/dads/

http://rosettacode.org/wiki/Rosetta_Code

Famous Datasets repositories:

KDnuggets News published a useful blog containing a collection of links for the most famous datasets repositories:

http://www.kdnuggets.com/datasets/index.html

Next Article

Curated Data in Apache Parquet Format - For Blazing Fast Big Data Analytics on Hadoop & Spark

David Talby

Here at John Snow Labs, we are delighted to announce that all datasets are now also available in the new highly optimized...