Topics

Deals

100 tips for BigData

What is Big Data

Big Data is known by 3 Vs

  • Volume
  • Variety
  • Velocity
  • Volume

    Amount of data is more. Data is more granular

    Variety

    There ate variety of data sources - social, web, transaction, IoT etc

    Velocity

    Data is coming at high velocity. Also it is difficult to design schema ahead of time

    -------------

    What is Hadoop

    Hadoop is distributed scalable system on commodity hardware. It consist of:-

  • HDFS
  • MapReduce
  • Others - HBASE , R, Pig, Hive, Flume,Mahout etc
  • HDFS is a distributed file system. MapReduce is programming model

    Advantage of Hadoop

  • High availability due to data duplication
  • Scale Horizontally
  • Use commodity hardware
  • Reduce IO as data is stored locally
  • Open sources
  • Connector available in many languages
  • Resource manager - abstract the underlying work

  • In A BI, ETL is used to Extract, Transform and Load data. In Hadoop ITL is used.ITL refer to

  • Ingest structured/unstructured data from hetrogeneous system
  • Transform dara to structured format
  • Loads into database
  • -------------

    Hadoop I/O and computation

    Hadoop jobs has generally bottlneck in I/O. If you compress data I/O bottlenexk is reduced. It also speed up data transfer. However compression comes with Cost as CPU and processing time increase with compression. Once should trade off between I/O and computation

    -------------

    MapReduce in Hadoop

    Map distrbute the computation, Reduce combine the result

    Hadoop is like engine and MapReduce is Driver.

    Mapreduce is easy to understand,scalable, flexible

    However it works well only in batch and and not best solution for complex workflows

    MapReduce also integrates with R

    -------------

    R Language

    R is open source procedural language that is optimized for Data science and Statistics work.R also provide a data visualization framework.

    Advantage

  • Suitable for Data science, Stats work
  • Code once - deploy anywhere
  • Opensource, Big community, Universities teach R language
  • R Limitattion

  • Hold all data in memory
  • Memory management is not efficient
  • Dynamically typed language
  • Interpreted
  • Garbage collection is poor
  • Lack parallel computation
  • -------------

    Microsoft Revolution R

    R memory limilation has 2 potential solution
  • Native R solution
  • Use external tools e.g. Hadoop, RDBMS

  • Revolution R helps you to do both

    You continue to get advantage of R. At same time get better speed, better scale.

    In addition microsoft Revolution R provide Web service based integration platform, Powerful IDE and integration

    Revo R also have high performance Math library for linear algebra functions.

    -------------

    Data storage option - Microsoft Technology

    Understand Data and reuirement

  • Who will use data
  • How data will be used
  • Where it will be used

  • If data is in txt file
    - search
    - process using powershell
    - process using GNU (grep,awk,overlay)
    If data is structured and store locally (only 1 person use it). Option are
    - Excel, Access, SQL Server Compact Edition, SQL Server Express, SQL Server Local DB
    - Excel - Searchable, Data model for power BI, Reporting
    - Access - provide tables and views and make it querable. Good for reporting
    - SQL Server compact - good for SQL features, queries
    - SQL Server Express - SQL queries, security
    - SQL Server Local DB - great for speed and light weight
    If data is structured and need to be used by multiple users
    - SQL Server- for relational db. Provide ACID, Security
    -APS - Sitable for big scale, common query, dedicated hardware
    - Azure Tables - Key Value Pair, Suitable for semi structured data
    - Access - Can be used for multi users with limitation
    -SharePoint - suitable for semi structured & FAST structured
    If data is unstructured
    - SharePoint - Great for documents, FAST Search
    - APS - Suitable for cross query, Fast/Large
    - HDP - Multi use, multi function and multi platform component
    -HDInsight - Multi use, multi function, multi platform component
    - Microsoft Azure Search - Standard web query, Index based,
    - DocumentDB - Schema Free, REST-ful, write optimized, transactional javascript
    -------------

    Type of analysis on data

  • Descriptive
  • Predictive
  • Prescriptive
  • Descriptive

    Descriptive analytics is traditional BI. It use historical data and show report, scorecard, dashboard. Aggregate, Power BI, DWH methods are used to show descriptive analytics

    Predictive analytics find patterns in data and using past patterns - predict future. Generally machine learning, forecasting function, statistical analysis is used for predictive analytics

    Prescriptive analytics go one step beyond predictve analytics. It suggest what should be done to get desired outcome. Machine Learning is used for Prescriptive analytics too

    -------------

    What is life cycle for Analyzing Big Data

  • Business case determination
  • Identify date and sources
  • Acquire data
  • Extract data
  • Do Filtering cleaning
  • Aggregate
  • Do Data analysis
  • Provide visualization
  • Integrate with business process
  • Provide visualization
  • -------------

    What are deployment option for Big Data solution

  • Local deployment
  • On premise cluster
  • Cloud clusters
  • Local deployment

    It is great for initial analysis where 1 person is doing the work. One can use personal computer with external hard disk and there is no additional cost to it.

    On Premise Cluster

    To operationalize a big data solution for enterprise you need a cluster. It will cost and it is not easy to scale. It will take many days if you want to increase capacity. Cost of managing this is very high too.

    Cloud Cluster

    Clould cluster can be set up in hours and there is no initial set up cost. But you pay monthly fee. It is to easy to scale if you are using PaaS solution

    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------
    -------------



    Like us on facebook


    Facebook comments

    Articles in