Apache::Hadoop::Config - Perl extension for Hadoop node configuration
use Apache::Hadoop::Config; Hadoop configuration setup
Configuration of Apache Hadoop is easy to build a cluster with default settings. But those settings are not suitable for a wide variety of hardware configuration. This perl package proposes optimal properties for some of the configuration parameters based on hardware configuration and user requirement.
It is primarily designed to extract hardware configuration from /proc file system to understand cpu cores, system memory and disk layout information. But these parameters can be manually fed as arguments to generate recommended settings.
This perl package can create namenode and datanode repositories, set appropriate permissions and generate configuration XML files with recommended settings.
Perl extension Apache::Hadoop::Config is designed to address Hadoop deployment and configuration practices, enabling rapid provisioning of Hadoop cluster with customization. It has two distinct capabilities (1) to generate configuration files, (2) create namenode and datanode repositories.
This package need to be installed ideally on at least one of the nodes in the cluster, assuming that all nodes have identical hardware configuration. However, this package can be installed on any other node and required hardware information can be supplied using arguments and configuration files can be generated and copied to actual cluster nodes.
This package is capable of creating repositories for namenode and datanodes, for which it should be installed on ALL hadoop cluster nodes.
Create a new Apache::Hadoop::Config object, either using system configuration or by supplying from command line arguments.
my $h = Apache::Hadoop::Config->new;
Basic configuration and memory settings are available using two functions. Calling basic configuration function is required while memory configuration is recommended.
$h->basic_config; $h->memory_config;
The package can print or create XML configuration files independently, using print and write functions, for configuration. It is necessary to provide conf directory, writable, to write configuration XML files.
$h->print_config; $h->write_config (confdir=>'etc/hadoop');
Additional configuration parameters can be supplied at the time of creating the object.
my $h = Apache::Hadoop::Config->new ( config=> { 'mapred-site.xml' => { 'mapreduce.task.io.sort.mb' => 256, }, 'core-site.xml' => { 'hadoop.tmp.dir' => '/tmp/hadoop', }, }, );
These parameters will override any automatically generated parameters, built into this package.
The package creates namenode and datanode volumes along with setting permission of hadoop.tmp.dir and log directories. The disk information can be supplied at object construction time.
my $h = Apache::Hadoop::Config->new ( hdfs_name_disks => [ '/hdfs/namedisk1', '/hdfs/namedisk2' ], hdfs_data_disks => [ '/hdfs/datadisk1', '/hdfs/datadisk2' ], hdfs_tmp => '/hdfs/tmp', hdfs_logdir => [ '/logs', '/logs/userlog' ], );
Note that name disks and data disks accept reference to array type of data. The package creates all the namenode and datanode volumes and creates log and tmp directories.
$h->create_hdfs_name_disks; $h->create_hdfs_data_disks; $h->create_hdfs_tmpdir; $h->create_hadoop_logdir;
The permission will be set as appropriate. It is strongly recommended that this package and associated script is executed by Hadoop Admin user (hduser).
Some of the basic configuration can be customized externally using object arguments. Namenode, secondary namenode, proxy node informations can be customized. Default is localhost for each of them.
my $h = Apache::Hadoop::Config->new ( namenode => 'nn.myorg.com', secondary=> 'nn2.myorg.com', proxynode=> 'pr.myorg.com', proxyport=> '8888', # default, optional );
These are optional and required only when secondary namenode and proxy node are different than primary namenode.
Below are a few examples of different uses. The first example is to create recommended configurations for the localhost or command-line provided data:
#!/usr/bin/perl -w use strict; use warnings; use Apache::Hadoop::Config; use Getopt::Long; my %opts; GetOptions (\%opts, 'disks=s','memory=s','cores=s'); my $h = Apache::Hadoop::Config->new ( meminfo=>$opts{'memory'} || undef, cpuinfo=>$opts{'cores'} || undef, diskinfo=>$opts{'disks'} || undef, ); # setup configs $h->basic_config; $h->memory_config; # print and save $h->print_config; $h->write_config (confdir=>'.'); exit(0);
The above gives an output like below, if no argument is supplied:
min cont size (mb) : 256 num of containers : 7 mem per container (mb): 368 disk : 4 cpu : 4 mem : 3.52075958251953 --------------- hdfs-site.xml dfs.namenode.secondary.http-address: 0.0.0.0:50090 dfs.replication: 1 dfs.datanode.data.dir: file:///hdfs/data1,file:///hdfs/data2,file:///hdfs/data3,file:///hdfs/data4 dfs.namenode.secondary.https-address: 0.0.0.0:50091 dfs.namenode.name.dir: file:///hdfs/name1,file:///hdfs/name2 yarn-site.xml yarn.web-proxy.address: localhost:8888 yarn.nodemanager.aux-services: mapreduce_shuffle yarn.scheduler.minimum-allocation-mb: 368 yarn.scheduler.maximum-allocation-mb: 2576 yarn.nodemanager.aux-services.mapreduce.shuffle.class: org.apache.hadoop.mapred.ShuffleHandler yarn.nodemanager.resource.memory-mb: 2576 core-site.xml hadoop.tmp.dir: /hdfs/tmp fs.defaultFS: http://localhost:9000 mapred-site.xml mapreduce.reduce.java.opts: -Xmx588m mapreduce.map.memory.mb: 368 mapreduce.map.java.opts: -Xmx294m mapreduce.framework.name: yarn mapreduce.reduce.memory.mb: 736 --------------- -> writing to ./hdfs-site.xml ... -> writing to ./yarn-site.xml ... -> writing to ./core-site.xml ... -> writing to ./mapred-site.xml ...
If supplied with some arguments, basically for a different clusters, the configuration files can still be generated:
$ perl hadoop_config.pl --cores 16 --memory 64 --disks 6 min cont size (mb) : 2048 num of containers : 10 mem per container (mb): 5734 disk : 6 cpu : 16 mem : 64 --------------- hdfs-site.xml dfs.namenode.secondary.http-address: 0.0.0.0:50090 dfs.replication: 1 dfs.datanode.data.dir: file:///hdfs/data1,file:///hdfs/data2,file:///hdfs/data3,file:///hdfs/data4 dfs.namenode.secondary.https-address: 0.0.0.0:50091 dfs.namenode.name.dir: file:///hdfs/name1,file:///hdfs/name2 yarn-site.xml yarn.web-proxy.address: localhost:8888 yarn.nodemanager.aux-services: mapreduce_shuffle yarn.scheduler.minimum-allocation-mb: 5734 yarn.scheduler.maximum-allocation-mb: 57340 yarn.nodemanager.aux-services.mapreduce.shuffle.class: org.apache.hadoop.mapred.ShuffleHandler yarn.nodemanager.resource.memory-mb: 57340 core-site.xml hadoop.tmp.dir: /hdfs/tmp fs.defaultFS: http://localhost:9000 mapred-site.xml mapreduce.reduce.java.opts: -Xmx9174m mapreduce.map.memory.mb: 5734 mapreduce.map.java.opts: -Xmx4587m mapreduce.framework.name: yarn mapreduce.reduce.memory.mb: 11468 --------------- -> writing to ./hdfs-site.xml ... -> writing to ./yarn-site.xml ... -> writing to ./core-site.xml ... -> writing to ./mapred-site.xml ...
Different customization can be done, using object's constructor arguments.
hadoop.apache.org - The Hadoop documentation and authoritative source for Apache Hadoop and its components.
Snehasis Sinha, <snehasis@cpan.org>
Copyright (C) 2015 by Snehasis Sinha
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.
To install Apache::Hadoop::Config, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Apache::Hadoop::Config
CPAN shell
perl -MCPAN -e shell install Apache::Hadoop::Config
For more information on module installation, please visit the detailed CPAN module installation guide.