Hadoop - Bigdata Development

Hadoop - Bigdata Development

Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0,

The Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules

Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

Hadoop Map Reduce – a programming model for large scale data processing.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's Map Reduce and HDFS components originally derived respectively from Google's Map Reduce and Google File System (GFS) papers.

Basics of Hadoop:
    Motivation for Hadoop
    Large scale system training
    Survey of data storage literature
    Literature survey of data processing
    Networking constraints
    New approach requirements
    Basic concepts of Hadoop

What is Hadoop?
    Distributed file system of Hadoop
    Map reduction of Hadoop works
    Hadoop cluster and its anatomy
    Hadoop demons
    Master demons
    Name node
    Tracking of job
    Secondary node detection
    Slave daemons
    Tracking of task
    HDFS(Hadoop Distributed File System)
    Spilts and blocks
    Input Spilts
    HDFS spilts
    Replication of data
    Awareness of Hadoop racking
    High availably of data
    Block placement and cluster architecture
    Practices & Tuning of performances
    Development of mass reduce programs
    Local mode
    Running without HDFS
    Pseudo-distributed mode
    All daemons running in a single mode
    Fully distributed mode
    Dedicated nodes and daemon running

Hadoop administration
    Setup of Hadoop cluster of Cloud era, Apache, Green plum, Horton works
    On a single desktop, make a full cluster of a Hadoop setup.
    Configure and Install Apache Hadoop on a multi node cluster.
    In a distributed mode, configure and install Cloud era distribution.
    In a fully distributed mode, configure and install Hortom works distribution
    In a fully distributed mode, configure the Green Plum distribution.
    Monitor the cluster
    Get used to the management console of Horton works and Cloud era.
    Name the node in a safe mode
    Data backup.
    Case studies
    Monitoring of clusters

Hadoop Development :
    Writing a MapReduce Program
    Sample the mapreduce program.
    API concepts and their basics
    Driver code
    Hadoop AVI streaming
    Performing several Hadoop jobs
    Configuring close methods
    Sequencing of files
    Record reading
    Record writer
    Reporter and its role
    Output collection
    Assessing HDFS
    Tool runner
    Use of distributed CACHE

Several MapReduce jobs (In Detailed)

Identification of mapper
    Identification of reducer
    Exploring the problems using this application
    Debugging the MapReduce Programs
    MR unit testing
    Debugging strategies
    Advanced MapReduce Programming
    Secondary sort
    Output and input format customization
    Mapreduce joins
    Monitoring & debugging on a Production Cluster
    Skipping Bad Records
    Running the local mode
    MapReduce performance tuning
    Reduction network traffic by combiner
    Reducing of input data
    Using Compression
    Reusing the JVM
    Running speculative execution
    Performance Aspects

CDH4 Enhancements :
    1. Name Node – Availability
    2. Name Node federation
    3. Fencing
    4. MapReduce – 2

    1. Concepts of Hive
    2. Hive and its architecture
    3. Install and configure hive on cluster
    4. Type of tables in hive
    5. Functions of Hive library
    6. Buckets
    7. Partitions
    8. Joins
    1. Inner joins
    2. Outer Joins
    9. Hive UDF

    1. Pig basics
    2. Install and configure PIG
    3. Functions of PIG Library
    4. Pig Vs Hive
    5. Writing of sample Pig Latin scripts
    6. Modes of running
    1. Grunt shell
    2. Java program
    7. PIG UDFs
    8. Macros of Pig
    9. Debugging the PIG

    1. Difference between Pig and Impala Hive
    2. Does Impala give good performance?
    3. Exclusive features
    4. Impala and its Challenges
    5. Use cases

    1. HBase
    2. HBase concepts
    3. HBase architecture
    4. Basics of HBase
    5. Server architecture
    6. File storage architecture
    7. Column access
    8. Scans
    9. HBase cases
    10. Installation and configuration of HBase on a multi node
    11. Create database, Develop and run sample applications
    12. Access data stored in HBase using clients like Python, Java and Pearl
    13. Map Reduce client
    14. HBase and Hive Integration
    15. HBase administration tasks
    16. Defining Schema and its basic operations.
    17. Cassandra Basics
    18. MongoDB Basics

Ecosystem Components
    1. Sqoop
    2. Configure and Install Sqoop
    3. Connecting RDBMS
    4. Installation of Mysql
    5. Importing the data from Oracle/Mysql to hive
    6. Exporting the data to Oracle/Mysql
    7. Internal mechanism

    1. Oozie and its architecture
    2. XML file
    3. Install and configuring Apache
    4. Specifying the Work flow
    5. Action nodes
    6. Control nodes
    7. Job coordinator

Avro, Scribe, Flume, Chukwa, Thrift
    1. Concepts of Flume and Chukwa
    2. Use cases of Scribe, Thrift and Avro
    3. Installation and configuration of flume
    4. Creation of a sample application

Challenges of Hadoop
    1. Hadoop recovery
    2. Hadoop suitable cases.