The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Apache::Hadoop::Config - Perl extension for Hadoop node configuration

SYNOPSIS

  use Apache::Hadoop::Config;
  Hadoop configuration setup

Configuration of Apache Hadoop is easy to build a cluster with default settings. But those settings are not suitable for a wide variety of hardware configuration. This perl package proposes optimal properties for some of the configuration parameters based on hardware configuration and user requirement.

It is primarily designed to extract hardware configuration from /proc file system to understand cpu cores, system memory and disk layout information. But these parameters can be manually fed as arguments to generate recommended settings.

This perl package can create namenode and datanode repositories, set appropriate permissions and generate configuration XML files with recommended settings.

DESCRIPTION

Perl extension Apache::Hadoop::Config is designed to address Hadoop deployment and configuration practices, enabling rapid provisioning of Hadoop cluster with customization. It has two distinct capabilities (1) to generate configuration files, (2) create namenode and datanode repositories.

This package need to be installed ideally on at least one of the nodes in the cluster, assuming that all nodes have identical hardware configuration. However, this package can be installed on any other node and required hardware information can be supplied using arguments and configuration files can be generated and copied to actual cluster nodes.

This package is capable of creating repositories for namenode and datanodes, for which it should be installed on ALL hadoop cluster nodes.

Create a new Apache::Hadoop::Config object, either using system configuration or by supplying from command line arguments.

         my $h = Apache::Hadoop::Config->new; 

Basic configuration and memory settings are available using two functions. Calling basic configuration function is required while memory configuration is recommended.

        $h->basic_config;
        $h->memory_config;

The package can print or create XML configuration files independently, using print and write functions, for configuration. It is necessary to provide conf directory, writable, to write configuration XML files.

        $h->print_config;
        $h->write_config (confdir=>'etc/hadoop');

Additional configuration parameters can be supplied at the time of creating the object.

        my $h = Apache::Hadoop::Config->new (
            config=> {
                'mapred-site.xml' => {
                    'mapreduce.task.io.sort.mb' => 256,
                },
                'core-site.xml'   => {
                    'hadoop.tmp.dir' => '/tmp/hadoop',
                },
            },
        );

These parameters will override any automatically generated parameters, built into this package.

The package creates namenode and datanode volumes along with setting permission of hadoop.tmp.dir and log directories. The disk information can be supplied at object construction time.

        my $h = Apache::Hadoop::Config->new (
            hdfs_name_disks => [ '/hdfs/namedisk1', '/hdfs/namedisk2' ],
            hdfs_data_disks => [ '/hdfs/datadisk1', '/hdfs/datadisk2' ],
            hdfs_tmp        => '/hdfs/tmp',
            hdfs_logdir     => [ '/logs', '/logs/userlog' ],
            );

Note that name disks and data disks accept reference to array type of data. The package creates all the namenode and datanode volumes and creates log and tmp directories.

        $h->create_hdfs_name_disks;
        $h->create_hdfs_data_disks;
        $h->create_hdfs_tmpdir;
        $h->create_hadoop_logdir;

The permission will be set as appropriate. It is strongly recommended that this package and associated script is executed by Hadoop Admin user (hduser).

Some of the basic configuration can be customized externally using object arguments. Namenode, secondary namenode, proxy node informations can be customized. Default is localhost for each of them.

        my $h = Apache::Hadoop::Config->new (
            namenode => 'nn.myorg.com',
            secondary=> 'nn2.myorg.com',
            proxynode=> 'pr.myorg.com',
            proxyport=> '8888', # default, optional
            );

These are optional and required only when secondary namenode and proxy node are different than primary namenode.

EXAMPLES

Below are a few examples of different uses. The first example is to create recommended configurations for the localhost or command-line provided data:

        #!/usr/bin/perl -w
        use strict;
        use warnings;
        use Apache::Hadoop::Config;
        use Getopt::Long;
        
        my %opts;
        GetOptions (\%opts, 'disks=s','memory=s','cores=s');
        
        my $h = Apache::Hadoop::Config->new (
                meminfo=>$opts{'memory'} || undef,
                cpuinfo=>$opts{'cores'} || undef,
                diskinfo=>$opts{'disks'} || undef,
                );
        
        # setup configs
        $h->basic_config;
        $h->memory_config;
        
        # print and save
        $h->print_config;
        $h->write_config (confdir=>'.');
        
        exit(0);

The above gives an output like below, if no argument is supplied:

        min cont size (mb)    : 256
        num of containers     : 7
        mem per container (mb): 368
         disk : 4
          cpu : 4
          mem : 3.52075958251953
        ---------------
        hdfs-site.xml
          dfs.namenode.secondary.http-address: 0.0.0.0:50090
          dfs.replication: 1
          dfs.datanode.data.dir: file:///hdfs/data1,file:///hdfs/data2,file:///hdfs/data3,file:///hdfs/data4
          dfs.namenode.secondary.https-address: 0.0.0.0:50091
          dfs.namenode.name.dir: file:///hdfs/name1,file:///hdfs/name2
        yarn-site.xml
          yarn.web-proxy.address: localhost:8888
          yarn.nodemanager.aux-services: mapreduce_shuffle
          yarn.scheduler.minimum-allocation-mb: 368
          yarn.scheduler.maximum-allocation-mb: 2576
          yarn.nodemanager.aux-services.mapreduce.shuffle.class: org.apache.hadoop.mapred.ShuffleHandler
          yarn.nodemanager.resource.memory-mb: 2576
        core-site.xml
          hadoop.tmp.dir: /hdfs/tmp
          fs.defaultFS: http://localhost:9000
        mapred-site.xml
          mapreduce.reduce.java.opts: -Xmx588m
          mapreduce.map.memory.mb: 368
          mapreduce.map.java.opts: -Xmx294m
          mapreduce.framework.name: yarn
          mapreduce.reduce.memory.mb: 736
        ---------------
        -> writing to ./hdfs-site.xml ...
        -> writing to ./yarn-site.xml ...
        -> writing to ./core-site.xml ...
        -> writing to ./mapred-site.xml ...

If supplied with some arguments, basically for a different clusters, the configuration files can still be generated:

        $ perl hadoop_config.pl --cores 16 --memory 64 --disks 6
        min cont size (mb)    : 2048
        num of containers     : 10
        mem per container (mb): 5734
         disk : 6
          cpu : 16
          mem : 64
        ---------------
        hdfs-site.xml
          dfs.namenode.secondary.http-address: 0.0.0.0:50090
          dfs.replication: 1
          dfs.datanode.data.dir: file:///hdfs/data1,file:///hdfs/data2,file:///hdfs/data3,file:///hdfs/data4
          dfs.namenode.secondary.https-address: 0.0.0.0:50091
          dfs.namenode.name.dir: file:///hdfs/name1,file:///hdfs/name2
        yarn-site.xml
          yarn.web-proxy.address: localhost:8888
          yarn.nodemanager.aux-services: mapreduce_shuffle
          yarn.scheduler.minimum-allocation-mb: 5734
          yarn.scheduler.maximum-allocation-mb: 57340
          yarn.nodemanager.aux-services.mapreduce.shuffle.class: org.apache.hadoop.mapred.ShuffleHandler
          yarn.nodemanager.resource.memory-mb: 57340
        core-site.xml
          hadoop.tmp.dir: /hdfs/tmp
          fs.defaultFS: http://localhost:9000
        mapred-site.xml
          mapreduce.reduce.java.opts: -Xmx9174m
          mapreduce.map.memory.mb: 5734
          mapreduce.map.java.opts: -Xmx4587m
          mapreduce.framework.name: yarn
          mapreduce.reduce.memory.mb: 11468
        ---------------
        -> writing to ./hdfs-site.xml ...
        -> writing to ./yarn-site.xml ...
        -> writing to ./core-site.xml ...
        -> writing to ./mapred-site.xml ...

Different customization can be done, using object's constructor arguments.

SEE ALSO

hadoop.apache.org - The Hadoop documentation and authoritative source for Apache Hadoop and its components.

AUTHOR

Snehasis Sinha, <snehasis@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015 by Snehasis Sinha

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.