Watch Healthcare NLP Summit 2024. Watch now.
was successfully added to your cart.

Forging Partnerships for a Better World – John Snow Labs and Open Knowledge International Collaboration

Exactly a year ago, John Snow Labs made an alliance with other organizations that share the same passion and vision to make graded datasets readily available for the enrichment of society and the betterment of communities.

John Snow Labs, DataHub and Open Knowledge International; all companies that share the same keenness for data and bear the same responsibility towards humanity have come together to see to it that less and less number of people are left behind in terms of access to tools and permission to information that make life better and ease out preventable predicaments in life.

So to mark John Snow Labs contribution for its one-year anniversary with the partnership, a good 209 datasets were updated on the platform. These datasets are now cleaned and normalized for users to explore, to scrutinize, to study and to analyze. The datasets come from 8 categories namely Climate, Demographics, Economy, Geography, Health, Internet, Pharma, and Transportation.

What’s inside these Categories?

  • Climate datasets are carefully gathered together as a reference point for climatologist’s studies and research on past, present and future climatic changes and climate anomalies. There is data on water quality whether on the beach or the water from US residential areas. There is also information on water sea level change, wastewater treatment, available water services, monitoring, distribution, and withdrawals. Temperature analysis and anomalies from the National Aeronautics and Space Administration (NASA) are also available for this category. Other interests for climate researches that are available include atmospheric CO2 emissions and trends, garbage collection and management plus the accompanying penalties for non-compliance to it, and lastly, statistics on tree preservation, trimming and debris removal requests from US residents.
  • Demographics datasets are populated with school statistics on enrollment count by an institution, progress report card data, school attendance and graduation outcomes, and examination results for English, Math as well as Regents exam results. Population and death statistics are also available by US state, by city, by Zipcode, by race, by country, by citizenship or by ethnicity codes. These datasets are population-based information that showcases age, sex and race values among others. This category is very useful for both the public and the private sector for the purposes of research and policy development.
  • Economy category showcases a pool of rich datasets both from the United Kingdom and the United States, that can be used in natural language processing applications in finance. Consumer price indexes, inflation rates, gross domestic products, prices of gold, consumer purchasing power, US and UK bond yields, mutual funds, gas prices, and annual tax collections data are some of the statistics that can be explored. Both the public and private section of society can also probe into business owner’s information and licensing data, business permits, list of Standard and Poors 500 Companies including their earnings or the Federal Deposit Insurance Corporation (FDIC) bank list and insured institutions among others. Other data available include US and UK housing developments, pricing, and properties for sale. Government and private program information on school services, child welfare, community services, employment, and unemployment rates are also made accessible. Lastly, UK and US institutions like Belgium COFOG Nomenclature, GHEITI Data, Euribor Rates, IMF World Economic Outlook Database, International Chamber of Commerce Incoterms, NJ GUDPA Funds Certificate List and NYCHA Development Data Book have been cleaned and standardized for users. These datasets will aid in determining demand and allocation supply of resources in the government and the private sector that will ensure a healthy economic system.
  • Geography is a source of data about places, environments and their relationship with people. Some of this information can be sourced from this category, which includes latitude and longitude data, countries geographical territories, country codes, list of major cities of the world, US states and territories, Geoname IDs, airport codes, International Maritime Organization (IMO) and International Maritime Dangerous Goods (IMDG) classification codes, and the Space Management and Design Group (SMDG) master terminal facilities list.
  • Health is in an ever growing need of maturation as patients continue to clamor for better outcomes in treatment and management from providers. To add value to the Open Knowledge data source, this category includes treatment outcomes, records of adult and children vaccinations, emergency response teams and services, current health IT partners, diabetes prevalence, breast cancer statistics, leading causes of death, list of childcare facilities, volunteering opportunities, health codes, order and referring physicians and non-physicians, safety on drug use and mosquito-borne West Nile virus data have been added.
  • Internet category datasets although are readily available on the World Wide Web needs some filter to ensure that information is from factual sources. There are selected datasets here that are useful for statistical and marketing analysis like the Internet top-level domain names, list of Internet media types and subtypes, IPv4 geolocation data, Google analytics, membership to international copyright treaties and social media usage.
  • Pharma datasets include information on food and drug, whether for its manufacture, use, and sale. In this category, healthy eating index, food affordability, food consumption, health, and nutrition survey and a dataset on adoption of herbicide tolerant and insect resistant crops are included for food; there’s one dataset added for drug information on the UK controlled drugs list to complete this category.
  • Transportation datasets include safety data, vehicle collisions, injuries, accidents, fatalities in different modes of transportation where the number of persons involved, incident associated with alcohol use, vehicle speed or weather condition variables, occupant and non-occupant fatalities by vehicle type are discussed. This category also has aviation information on general safety data, air taxi safety data, air carrier fatal accidents, commuter air carrier fatal accidents and near midair collisions by the degree of hazard.

All these datasets are now available for download and it is expected to help researchers, scientists and innovators save a lot of time in searching and cleaning, therefore giving them more time in analyzing data and making them more serviceable for the consumers and the end-users alike. Adding these datasets amplifies another milestone in the hope of grounding up quality and serviceable data in the open space making sure we are inching closer to a better world.


Data Quality as a Crucial Part of DataOps  

Data quality is an aspect of data operations that should definitely not be overlooked. The amount of data being generated and analyzed...