The easiest step in dealing with big datasets is downloading the files. What will you do after downloading the files is sometimes challenging. Opening the files, editing, or converting the format could problematic if we are dealing with huge number of records (big data) or if the file format needs a special browser or tool to open it in a readable format.
Most of big data specialists working on data scraping, curation or cleansing need to have different toolsets to open, navigate or convert the data from one format to another. There might be another needs like merging more than one table together or being able to import/export data from tables whatever the format is (xls data sheet, csv, txt, xml database, JSON, SQL database,…etc).
In this blog, I tried to gather most of the tools that might be needed during the process of data curation or that might be needed during data scientist work.
To make it simple, I categorized them into six categories:
1. Tailor-made (Custom-made) tools: this category includes most of the specific professional browsers developed by certain organizations to handle specific datasets with specific format.
2. File format converter: changing the files from xml to csv, changing the delimiter from tab to semicolon or comma may sometimes be needed. Other file formats conversion might also be needed.
3. NoSQL Database: Non-relational databases like xml databases are becoming a common form of databases. This might be challenging to professionals who are used to using traditional SQL database management systems (DBMS) especially if the number of records is enormous.
4. CSV/XLS data sets: editing files, changing the delimiter, merging and appending CSV and XLS is considered a day-to-day work for any one working with data curation.
5. Text editors: Opening and editing big data text files might need special tools other than the traditional Notepad.
6. Hadoop-like software: Handling big data keeping in mind good memory management will need special software. I am just listing here a couple of them.
1. TAILOR-MADE (CUSTOM-MADE) TOOLS:
a. Amazon S3 Browser
S3 Browser is a freeware Browser working as a Windows client for Amazon S3. It can be used to store and retrieve big data. It allows you to upload, download, delete or rename your files and folders with no need to access Amazon website. You cannot rename the folders through the web interface, but you can do that through the Amazon S3 browser.
b. Beyond 2020 Professional Browser
|Features||– Works with files larger than 2 GB|
– Saving as CSV uses regional settings for separators
|related formats the tool can handle||*.ivt|
The best tool for Ontology data sets especially those available in (*.owl) or (*.nt) formats. Both formats need a special browser called “Protégé”.
|About the product||A free, open-source ontology editor and framework for building intelligent systems|
|related formats the tool can handle||RDF/XML, OWL/XML , OBO, OWL, NT, Turtle(ttl)|
d. FDA databases:
Away from FDA website, dealing with FDA data sets is usually very difficult. If you downloaded the complete FDA data sets for medical devices, drugs adverse events and drug approvals, most probably you will not be able to open those data sets using traditional spreadsheets software or relational database management systems (DBMS).
Here are some recommended tools that could help with the complex FDA data sets:
1. Pragmatic Validator:
FDA Drug Label database can be opened in a readable format through a tool called “Pragmatic Structured Product Labeling Editor (“SPL XForms”)”.
This tool is available at:
You can either open the zip files or the XML files of the drug labels database. The zip files of the database contain jpeg files, in addition to the XML files. Both the image files and the text can be retrieved in the SPL view.
RxNorm is a normalized nomenclature for clinical drugs developed by the National Library of Medicine (NLM).
RxNav is a browser for several drug information sources, including RxNorm and RxTerms. The new version has function to retrieve National Drug Code (NDC) properties for an NDC or a Structured Product Label (SPL).
OpenVigilFDA is a web-based user interface to the FDA Adverse Event Reporting System (AERS) database.
This tool can help in generating hypotheses for new adverse drug reactions, drug-drug-interactions and safety comparisons.
Queries can be run and download data in HTML, CSV, JSON or XML formats through an online tool called “OpenVigil” available at:
Another tool is available at:
If username and password are needed, they will be: dgpt, dgpt
2. FILE FORMATS CONVERTERS:
a. Advanced XML Converter:
Advanced XML Converter helps you convert XML to other database and document formats: HTML, CSV, DBF, XLS and SQL.
The software is available for download at:
This tool can help you to convert the xml database into CSV format.
The tool is available at:
c. Open Refine:
The tool was formerly named Google Refine.
It is created by Google and can convert XML to CSV.
It allows you to:
- Explore data
- Clean and transform data
- Reconcile and match data
TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents are all supported.
d. For converting txt to csv and vice versa:
Of course, you can use open office or excel, but for big txt files, I recommend the use of the following tools:
1 – ConvertXLS tool:
The tools can help to convert XLS to CSV and vice versa. Moreover, it can allow you to do special processing on XLS files. This can be shown in the next screenshot:
It is available at:
2. You can also use CSVed:
This tool can allow you to merge or append different CSV files together, change text files to CSV and also change the delimiter (from tab delimiter to comma delimiter or semicolon). It is strongly recommended to have this tool if you are dealing with CSV files.
CSVed is available at:
3. You can try also reCsvEditor:
This tool can handle big CSV files, export the data into different formats and also you can change the delimiter (tab, comma, or semicolon).
The tool is available for download at:
3 – NoSQL DATABASE (XML Database)
NoSQL databases refer to non-relational databases. In other words, the data is represented by other means than the traditional relational tables.
XML databases is one of the most famous forms of NoSQL Databases.
BaseX and exist-db are the most famous tools for handling NoSQL databases.
- XQuery editor
- Interactive visualization
- Powerful Client/Server architecture
Available at: http://basex.org/
- Browser-based IDE
- Rich Stack of Libraries
- Rapid Prototyping
- Schema-less Database
4. CSV/XLS DATASETS
You can use one of the following famous software to handle your data set:
b. CSV buddy
d. Open Office
- Gnumeric can handle files with the following formats:
- .gnumeric / .gnm / .xml/.as
- For Comma/Tab/Semicolon Separated Values it can handle files with .csv/.tsv format
5. TEXT EDITORS
- Syntax Highlighting and Syntax Folding
- Multi-Document (Tab interface)
The software is available at:
- Customizable interface
- Large file support as it easily handles files up to 248 GB
- Split/Combine Files
- Syntax Highlighting
The software is available at:
c. EditPad Lite
- Sort lines alphabetically and delete duplicate lines
- Open and edit many text files at the same time with no limit
- Extensive auto-save and backup options
- Unlimited undo and redo even after saving
The software is available at:
6. HADOOP-LIKE SOFTWARES:
a. Talend Data Preparation – Free Desktop
|About the product||Helps and save time in exploring, cleansing, and combining big data from different sources.|
|related formats the tool can handle||CSV, XLSX, tableau|
b. Hortonworks Sandbox on Oracle VM VirtualBox
Used for data management, data access, data governance, integration, security and operations.
The tool can handle big data files in the following formats: CSV, XLSX, JSON efficiently.
Oracle VM VirtualBox is available at:
Hortonworks Sandnbox is available for download at: