ETL process (Extraction, Transfer, Load) is not an easy process. It may take up to 65% of the data scientist time. Every data scientist knows that data cleansing is a very hard and exhausting task especially if it is done manually or through conventional maneuvers using MS Excel or similar traditional tools. Traditional spreadsheets or relational database management systems are not designed to handle big data sets.
With the great revolution in data science and the marked increased demand for data management tools, different applications evolved trying to bridge the gap.
Most of the tools are too expensive for a startup company or for a freelance data scientist.
In this blog, I am offering five different open source tools that can help you to handle and tame your big datasets. Those tools could improve markedly your organization process for ETL, accordingly keep your project on-time and on-budget.
Jupyter Notebook is a web interactive application where you can combine code execution, rich text, mathematics, plots and rich media. Moreover, you can share documents with live code, equations, visualizations and explanatory text.
Jupyter notebooks help you and others to understand your code workflow and logic, by providing an environment where you can write your code in Python, Scala or R, check your output, and add documentation with stylized text and HTML.
Starting from IPython 4.0, language-agnostic parts of the project have moved to a new project called Jupyter. Jupyter is an open-source and interactive data science programming tool.
Uses include data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. Moreover, Jupyter has many advanced and powerful features:
- support for 40 programming languages (like Python, R, Julia, and Scala)
- Sharing notebooks can be done easily through email, Dropbox, GitHub or Jupyter Notebook Viewer
- The code with Jupyter can be interactive producing HTML, images, videos, LaTeX, and custom MIME types
- A powerful interactive shell.
- A kernel for Jupyter.
- Support for interactive data visualization and use of GUI toolkits.
- Flexible, embeddable interpreters to load into your own projects.
- Easy to use, high-performance tools for parallel computing.
- Notebooks can be shared with others using email, Dropbox, and GitHub.
- Interactive widgets can be used to manipulate and visualize data in real-time.
A completely open web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
It allows Data Ingestion, Data Discovery, Data Analytics, Data Visualization and Collaboration to Hadoop and Spark.
Various languages are supported via Zeppelin language interpreters like Apache Spark, Python, JDBC, Markdown and Shell.
Apache Zeppelin provides full Apache Spark integration with no need to build a separate module, plugin or library.
This integration provides:
- Automatic SparkContext and SQLContext injection
- Runtime jar dependency loading from local filesystem or maven repository.
- Canceling job and displaying its progress
- With regards to data visualization, some basic charts are already included in Apache Zeppelin. Visualizations are not limited to SparkSQL query, any output from any language backend can be recognized and visualized. Moreover, it supports and provides pivot chart with simple drag and drop.
- You can share your notebook with your collaborators like Google Docs.
RStudio IDE is one of the open-source, web-based tools that allow you to analyze data, make use of many statistical packages, and easily obtain data visualizations representation.
From its name, RStudio IDE is dedicated only for R.
RStudio IDE offers script editor console, where you can add variables or display your graphs and plots. Moreover, it grants an access to Apache Spark by just typing “sc” in the console.
You can import different packages and manage them. It offers also full detailed help documents.
Through RStudio IDE you make use of some powerful libraries. One of the powerful libraries which you can use for data visualizations is “Shiny”.
Shiny library allows you to obtain web-based interactive applications.
RStudio will be easy to use for R experts.
Powered by Apache Spark, Seahorse is a new data analytics platform that can help you to create complex dataflows for ETL (Extract, Transform and Load) and machine learning with no need to write a code.
Taking into consideration the end user, seahorse offers through its simple interface an easy to learn way to solve big data problems.
It provides a visual programming approach, where the user can explore and realize the nature of the problem and the logic behind the solution.
Although Seahorse is devoid of any obligation to write a code, still you can customize and predefine set of actions using Python or R.
Seahorse displays the application workflow like a graph through its simple and clean web-based interface.
Working steps with Seahorse include: adding a new operation to the workflow, executing the part that has already been created and checking the results.
The user can monitor and track each step during the whole process, after which the workflow can be exported and deployed as a standalone Spark application on production clusters.
OpenRefine (formerly known as Google Refine) is one of the necessary tools you will need to tam your big data. The ETL (extraction, transformation, and loading) process mandates data cleaning before proceeding further with the process. Open Refine can allow you to import and explore different big data file formats and transform it into another.
Open refine has many powerful features that any data scientist may need; as it allows clustering (gathering together similar words), editing cells with multiple values, extending web services. It also allows you to link between different datasets.
You can deal with Open Refine the same way you deal with MS Excel and MS Access, where you can filter and partition your data using regular expressions.
Those features ensure perfect cleansing process, especially with the powerful clustering feature.