BigSheets-Spread sheet like interface to HDFS files
Bigsheets is a browser-based tool that is included in the BigInsights data scientist package or data analyst package, to analyze and visualize big data.BigSheets uses a spreadsheet-like interface that can model, filter, combine, and chart data collected from multiple sources, such as an application work on big data environment. Since big sheets is a service running on hadoop cluster, user does not need to worry about connectivity.
In this demo we will see how to
- Create a master workbook from existing file in HDFS
- Tailor data by creating child workbook
- Create columns after grouping data
- How to create quick charts
- Export data to other formats
Bigsheets is available in application tab on IBM® InfoSphere® BigInsights™ Enterprise Edition.
Click on Bigsheets tab and launch the application.
Bigsheets works on Hadoop ecosystem files and tables (hive). Unlike MS excel bigsheets does not store data in application, data is always stored in HDFS by default or table (if you save). To start working on sheets we need to create a master work book. In ETL analogy this can be considered as source file.
- Sheets can work on files on HDFS as well as tables. Files on HDFS can be viewed on DFS Files and tables can be viewed from Catalog Tables.
- Choose the file which you want to work on navigating through the folders. Once the file is selected you can see a preview of data on right hand side window. Name the workbook with an option
description. Description will help while you are working on multiple files. File used for demo is available on spatialkey-sample-csv-data
- In the preview, the data is selected as a single column. This format can be modified by choosing appropriate file type from line reader. Click on icon near line reader and choose the type of file you are working on. Sheets give the ability to work in variety of data. Sample file used here is a CSV file with a header.Choose CSV data.
- Check header option if header is available in data. If no header available uncheck the field. Sheets will create default column names to start with which can be modified later.
- Click on the dropdown near field names to validate datatype. You can see that sheets has inferred the datatype based on data. This can be modified if you need to work on with different datatypes than default.
- A master workbook is created and to continue working on this data, we need to create child workbooks. Sheets does not manipulate the original files instead
it create a copy of the parent data file and let you modify.
Create child workbook for Analysis
- Click on build new workbook. This will let you create a child workbook.
- Once child book is created data can be manipulated using formula (spreadsheet like) on fields. New fields can be derived from existing ones. To add a new field click on the dropdown button and Insert field to Right. Once field added, click on the fx window to write formula. I added a field calledSateCodeExpanded.
- Please refer to knowledgecenter for bigsheets function help. Function can be auto searched by typing on formula bar.
- Other fields in the sheet can be accessed by prefixing # with column name. In fx add
- Wide range of functions are available for Math and String manipulation. Refer to knowledgecenterfor all available functions. Once data formatting is done save the child worksheet. You can optionally change the name of worksheet clicking on
Analytical functions on data
- Sheets give ability to run wide range of analytical function on data. To add a new function click on add sheets. When a function is applied, sheets create a new child sheet and so on.Common used functions are Join, Union and Group etc. Click on Group.
- In this demo I am showing how to find number of policies per County (Field name). Name sheet asCount_of_Policy_Per_County and add fields to group b column list. Choose county andstatecodeExpanded (field we added in last step)
- Once columns are added to group by list, move to calculate tab.
- Once again, we are trying to find policy count per county and State combination. In order to do that type a new column name and click button.
- Adding an aggregate field gives option of wide range of aggregate functions like SUM, MAX, MIN etc. Click on COUNT option on field PolicyID and check this will create result data. The result can be saved to a new file.
- Click on overview button, this will give you an overview of the transformation done on different stages.
- Big Sheets under the hood run map reduce and pig scripts. So every time the data changes in file you may need to run the job. Go back to home page where the final data set was saved.
- Big sheets under the hood run map reduce and pig scripts. So every time the data changes in file you may need to run the job. Go back to home page where the final data set was saved. Reopen the workbook you will get an option to run the job again. Once Ran result are displayed.
Exporting Data and Saving data to tables
- Big sheet master and child workbook can be saved to different data formats. Open new or existing workbook. Click on export data to save data. Data can be exported to JSON , CSV , TSV etc
- Alternatively data can also be saved to hive tables. This can be accessed via SAS /Tableu for analysis. Click on the create table option and choose schema name and table name. Sheets create table automatically with inferred datatype.
Creating Quick Charts
- Once the data is analyzed and summarized quick charts can be created from this data. Bigsheets come with a few out of box options to create charts. Click on create chart at the bottom of the page (Add chart). Add chart options will be visible in property bar on page or near the workbook edit option
- Click on the chart option and choose the type of chart
- Chart chosen here is Bar type. Fill in the options for Bar type. Options for other charts will vary obviously. Pick the data columns to represent x and y coordinates from the summary data.
- Once saved, you can see a preview chart is built up. Click on the run button to generate chart with real underlying data.
All set, chart is plotted .
Hope this article helps to get a quick start on BigSheets.