The easiest step in dealing with big datasets is downloading the files. What will you do after downloading the files is sometimes challenging. Opening the files, editing, or converting the format could problematic if we are dealing with huge number of records (big data) or if the file format needs a special browser or tool to open it in a readable format.

Most of big data specialists working on data scraping, curation or cleansing need to have different toolsets to open, navigate or convert the data from one format to another. There might be another needs like merging more than one table together or being able to import/export data from tables whatever the format is (xls data sheet, csv, txt, xml database, JSON, SQL database,…etc).

In this blog, I tried to gather most of the tools that might be needed during the process of data curation or that might be needed during data scientist work.

To make it simple, I categorized them into six categories:

          1. Tailor-made (Custom-made) tools: this category includes most of the specific professional browsers developed by  certain organizations to handle specific datasets with specific format.

          2. File format converter: changing the files from xml to csv, changing the delimiter from tab to semicolon or comma may sometimes be needed. Other file formats conversion might also be needed.

          3. NoSQL Database: Non-relational databases like xml databases are becoming a common form of databases. This might be challenging to professionals who are used to using traditional SQL database management systems (DBMS) especially if the number of records is enormous.

          4. CSV/XLS data sets: editing files, changing the delimiter, merging and appending CSV and XLS is considered a day-to-day work for any one working with data curation.

          5. Text editors: Opening and editing big data text files might need special tools other than the traditional Notepad.

          6. Hadoop-like software: Handling big data keeping in mind good memory management will need special software. I am just listing here a couple of them.

 

1. TAILOR-MADE (CUSTOM-MADE) TOOLS:

          a. Amazon S3 Browser

S3 Browser is a freeware Browser working as a Windows client for Amazon S3. It can be used to store and retrieve big data. It allows you to upload, download, delete or rename your files and folders with no need to access Amazon website. You cannot rename the folders through the web interface, but you can do that through the Amazon S3 browser.

pic 1

 

           b. Beyond 2020 Professional Browser

Screen Capture   pic 2
Features – Works with files larger than 2 GB

– Saving as CSV uses regional settings for separators

related formats the tool can handle *.ivt
Available at http://www.beyond2020.com/
Freeware/

shareware

Free

 

          c. Protégé

The best tool for Ontology data sets especially those available in (*.owl) or (*.nt) formats.  Both formats need a special browser called “Protégé”.

About the product A free, open-source ontology editor and framework for building intelligent systems
Screen Capture  pic 3
related formats the tool can handle RDF/XML, OWL/XML , OBO, OWL, NT, Turtle(ttl)
Available at http://protege.stanford.edu/
Freeware/shareware Free

 

          d. FDA databases:

Away from FDA website, dealing with FDA data sets is usually very difficult. If you downloaded the complete FDA data sets for medical devices, drugs adverse events and drug approvals, most probably you will not be able to open those data sets using traditional spreadsheets software or relational database management systems (DBMS).

Here are some recommended tools that could help with the complex FDA data sets:

          1. Pragmatic Validator:

FDA Drug Label database can be opened in a readable format through a tool called “Pragmatic Structured Product Labeling Editor (“SPL XForms”)”.

This tool is available at:

http://validator.pragmaticdata.com/validator-lite/validator/spl-form/

You can either open the zip files or the XML files of the drug labels database. The zip files of the database contain jpeg files, in addition to the XML files. Both the image files and the text can be retrieved in the SPL view.

 

pic 4

pic 5

 

          2. RxNav:

RxNorm is a normalized nomenclature for clinical drugs developed by the National Library of Medicine (NLM).

RxNav is a browser for several drug information sources, including RxNorm and RxTerms. The new version has function to retrieve National Drug Code (NDC) properties for an NDC or a Structured Product Label (SPL).

pic 6pic 7

          3. OpenVigil

OpenVigilFDA is a web-based user interface to the FDA Adverse Event Reporting System (AERS) database.

This tool can help in generating hypotheses for new adverse drug reactions, drug-drug-interactions and safety comparisons.

Queries can be run and download data in HTML, CSV, JSON or XML formats through an online tool called “OpenVigil” available at:

http://webcl5top.rz.uni-kiel.de/pharmacology/pvt/openvigilfda.php

Another tool is available at:

http://www.is.informatik.uni-kiel.de/pvt/OpenVigil2.1/search/

If username and password are needed, they will be: dgpt, dgpt

 

2. FILE FORMATS CONVERTERS: 

          a. Advanced XML Converter:

Advanced XML Converter helps you convert XML to other database and document formats: HTML, CSV, DBF, XLS and SQL.

The software is available for download at:

http://www.xml-converter.com/

pic 8pic 9

 

          b. Moor.XmlToCsvConverter

This tool can help you to convert the xml database into CSV format.

The tool is available at:

https://xmltocsv.codeplex.com/downloads/get/806200

 

pic 10

         

          c. Open Refine:

The tool was formerly named Google Refine.

It is created by Google and can convert XML to CSV.

It allows you to:

  • Explore data
  • Clean and transform data
  • Reconcile and match data

TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents are all supported.

pic 11

   

          d. For converting txt to csv and vice versa:

Of course, you can use open office or excel, but for big txt files, I recommend the use of the following tools:

        1 – ConvertXLS tool:

The tools can help to convert XLS to CSV and vice versa. Moreover, it can allow you to do special processing on XLS files. This can be shown in the next screenshot:

pic 12

 

It is available at:

http://www.softinterface.com/Convert-XLS/Features/Convert-Fixed-Width-Text-File-To-CSV.htm

               2. You can also use CSVed:

This tool can allow you to merge or append different CSV files together, change text files to CSV and also change the delimiter (from tab delimiter to comma delimiter or semicolon). It is strongly recommended to have this tool if you are dealing with CSV files.

CSVed is available at:

http://download.cnet.com/CSVed/3000-2351_4-52868.html

pic 13

 

               3. You can try also reCsvEditor:

This tool can handle big CSV files, export the data into different formats and also you can change the delimiter (tab, comma, or semicolon).

The tool is available for download at:

https://sourceforge.net/projects/recsveditor/?source=typ_redirect

pic 14

 

 

3 – NoSQL DATABASE (XML Database)

NoSQL databases refer to non-relational databases. In other words, the data is represented by other means than the traditional relational tables.

XML databases is one of the most famous forms of NoSQL Databases.

BaseX and exist-db are the most famous tools for handling NoSQL databases.

 

          a –BaseX

pic 15Features:

  • XQuery editor
  • Interactive visualization
  • Powerful Client/Server architecture

Available at: http://basex.org/

 

          b. eXist-db

pic 16

Features:

  • Browser-based IDE
  • Rich Stack of Libraries
  • Rapid Prototyping
  • Schema-less Database

Available at:

http://exist-db.org/exist/apps/homepage/index.html

 

4. CSV/XLS DATASETS

You can use one of the following famous software to handle your data set:

a. CSVed

b. CSV buddy

c. MS-Excel

d. Open Office

e. Gnumeric

  • Gnumeric can handle files with the following formats:
    • .gnumeric / .gnm / .xml/.as
    • For Comma/Tab/Semicolon Separated Values it can handle files with .csv/.tsv format

pic 17

 

5. TEXT EDITORS

          a. Notepad++

Features:

  • Syntax Highlighting and Syntax Folding
  • Multi-Document (Tab interface)
  • Auto-completion
  • Bookmark
  • Multi-view

The software is available at:

https://notepad-plus-plus.org/features/

 

          b. EmEditor

Features:

pic 18
The software is available at:

https://www.emeditor.com/download/

 

          c. EditPad Lite

Features:

  • Sort lines alphabetically and delete duplicate lines
  • Open and edit many text files at the same time with no limit
  • Extensive auto-save and backup options
  • Unlimited undo and redo even after saving

The software is available at:

http://www.editpadlite.com/download.html

pic 19

 

6. HADOOP-LIKE SOFTWARES:

          a. Talend Data Preparation – Free Desktop

About the product Helps and save time in exploring, cleansing, and combining big data from different sources.
Screen Capture  pic 20
Features
  • Single user, desktop-based application
  • Import, export, and combine CSV and Excel files
  • Auto-discovery, smart suggestions, and data visualization
  • Cleansing and enrichment functions
  • Completely free
related formats the tool can handle CSV, XLSX, tableau
Available at https://www.talend.com/products/data-preparation
Freeware/shareware Free

 

          b. Hortonworks Sandbox on Oracle VM VirtualBox

Used for data management, data access, data governance, integration, security and operations.

The tool can handle big data files in the following formats: CSV, XLSX, JSON efficiently.

 

Oracle VM VirtualBox is available at:

https://www.virtualbox.org/wiki/Downloads

 

Hortonworks Sandnbox is available for download at:

http://hortonworks.com/products/sandbox/

 

pic 21

 

pic 22

pic 23