Net::Amazon::EMR - API for Amazon's Elastic Map-Reduce service
use Net::Amazon::EMR; my $emr = Net::Amazon::EMR->new( AWSAccessKeyId => $AWS_ACCESS_KEY_ID, SecretAccessKey => $SECRET_ACCESS_KEY, ssl => 1, ); # start a job flow my $id = $emr->run_job_flow(Name => "Example Job", Instances => { Ec2KeyName => 'myKeyId', InstanceCount => 10, KeepJobFlowAliveWhenNoSteps => 1, MasterInstanceType => 'm1.small', Placement => { AvailabilityZone => 'us-east-1a' }, SlaveInstanceType => 'm1.small', }, BootstrapActions => [{ Name => 'Bootstrap-configure', ScriptBootstrapAction => { Path => 's3://elasticmapreduce/bootstrap-actions/configure-hadoop', Args => [ '-m', 'mapred.compress.map.output=true' ], }, }], Steps => [{ ActionOnFailure => 'TERMINATE_JOB_FLOWS', Name => "Set up debugging", HadoopJarStep => { Jar => 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar', Args => [ 's3://us-east-1.elasticmapreduce/libs/state-pusher/0.1/fetch' ], }, }], ); print "Job flow id = " . $id->JobFlowId . "\n"; # Get details of just-launched job $result = $emr->describe_job_flows(JobFlowIds => [ $id->JobFlowId ]); # or get details of all jobs created after a given time $result = $emr->describe_job_flows(CreatedAfter => '2012-12-17T07:19:57Z'); # or use DateTime $result = $emr->describe_job_flows(CreatedAfter => DateTime->new(year => 2012, month => 12, day => 17)); # See the details of the typed result use Data::Dumper; print Dumper($result); # or dispense with types and see the details as a perl hash use Data::Dumper; print Dumper($result->as_hash); # Flexible Booleans - 1, 0, undef, 'true', 'false' $emr->set_visible_to_all_users(JobFlowIds => $id, VisibleToAllUsers => 1); $emr->set_termination_protection(JobFlowIds => [ $id->JobFlowId ], TerminationProtected => 'false'); # Add map-reduce steps and execute $emr->add_job_flow_steps(JobFlowId => $job_id, Steps => [{ ActionOnFailure => 'CANCEL_AND_WAIT', Name => "Example", HadoopJarStep => { Jar => '/home/hadoop/contrib/streaming/hadoop-streaming.jar', Args => [ '-input', 's3://my-bucket/my-input', '-output', 's3://my-bucket/my-output', '-mapper', '/path/to/mapper-script', '-reducer', '/path/to/reducer-script', ], Properties => [ { Key => 'reduce_tasks_speculative_execution', Value => 'false' } ], }, }, ... ]);
This is an implementation of the Amazon Elastic Map-Reduce API.
This is the constructor. Options are as follows:
AWSAccessKeyId (required)
Your AWS access key.
SecretAccessKey (required)
Your secret key.
base_url (optional)
The base URL for your chosen Amazon region; see http://docs.aws.amazon.com/general/latest/gr/rande.html#emr_region. If not specified, the default URL is used (which implies region us-east-1).
my $emr = Net::Amazon::EMR->new( AWSAccessKeyId => $AWS_ACCESS_KEY_ID, SecretAccessKey => $SECRET_ACCESS_KEY, base_url => 'https://elasticmapreduce.us-west-2.amazonaws.com', );
ssl (optional)
If set to a true value, the default base_url will use https:// instead of http://. Defaults to true.
The ssl flag is not used if base_url is set explicitly.
max_failures (optional)
Number of times to retry if a communications failure occurs, before raising an exception. Defaults to 5.
Detailed information on each of the methods can be found in the Amazon EMR API documentation. Each method takes a hash of parameters using the names given in the documentation. Parameter passing uses the following rules:
Array inputs such as InstanceGroups.member.N use their primary name and a Perl ArrayRef, i.e. InstanceGroups => [ ... ] in this example.
Either hashes or object instances may be passed in; e.g both of the following forms are acceptable:
$emr->run_job_flow(Name => "API Test Job", Instances => { Ec2KeyName => 'xxx', InstanceCount => 1, }, ); $emr->run_job_flow(Name => "API Test Job", Instances => Net::Amazon::EMR::JobFlowInstancesConfig->new( Ec2KeyName => 'xxx', InstanceCount => 1, ), );
Otherwise, the names of parameters are exactly as found in the Amazon documentation for API version 2009-03-31.
AddInstanceGroups adds an instance group to a running cluster. Returns a Net::Amazon::EMR::AddInstanceGroupsResult object.
AddJobFlowSteps adds new steps to a running job flow. Returns 1 on success.
Returns a Net::Amazon::EMR::RunJobFlowResult that describes the job flows that match all of the supplied parameters.
Modifies the number of nodes and configuration settings of an instance group. Returns 1 on success.
Creates and starts running a new job flow. Returns a Net::Amazon::EMR::RunJobFlowResult object that contains the job flow ID.
Locks a job flow so the Amazon EC2 instances in the cluster cannot be terminated by user intervention, an API call, or in the event of a job-flow error. Returns 1 on success.
Sets whether all AWS Identity and Access Management (IAM) users under your account can access the specifed job flows. Returns 1 on success.
Terminates a list of job flows. Returns 1 on success.
If an error occurs in any of the methods, the error will be logged and an Exception::Class exception of type Net::Amazon::EMR::Exception will be thrown.
use Log::Log4perl qw/:easy/; Log::Log4perl->easy_init($DEBUG);
Log::Log4perl provides great flexibility and there are many ways to set it up. A favourite of my own is to use Config::General format to specify all configuration parameters including logging, and to initialise in the following manner:
use Config::General qw/ParseConfig/; my %opts = ParseConfig(-ConfigFile => 'my.conf', -SplitPolicy => 'equalsign', -UTF8 => 1); ... unless (Log::Log4perl->initialized) { if ($opts{log4perl}) { Log::Log4perl::init($opts{log4perl}); } else { Log::Log4perl->easy_init(); } }
And a typical configuration in Config::General format might look like this:
<log4perl> log4perl.rootLogger = DEBUG, Screen, Logfile log4perl.appender.Logfile = Log::Log4perl::Appender::File log4perl.appender.Logfile.filename = debug.log log4perl.appender.Logfile.layout = Log::Log4perl::Layout::PatternLayout log4perl.appender.Logfile.layout.ConversionPattern = "%d %-5p %c - %m%n" log4perl.appender.Screen = Log::Log4perl::Appender::ScreenColoredLevels log4perl.appender.Screen.stderr = 1 log4perl.appender.Screen.layout = Log::Log4perl::Layout::PatternLayout log4perl.appender.Screen.layout.ConversionPattern = "[%d] [%p] %c %m%n" </log4perl>
At DEBUG level, the output can be very lengthy. To see only important messages for Net::Amazon::EMR whilst debugging other parts of your code, you could raise the threshold just for Net::Amazon::EMR by adding the following to your Log4perl configuration:
log4perl.logger.Net.Amazon.EMR = WARN
This is somewhat beyond the scope of the documentation for using Net::Amazon::EMR. Nevertheless, here are a few notes about using EMR with Perl.
Undoubtedly, to run any serious processing, you will need to install additional libraries on the map-reduce servers. A practical way to do this is to pre-configure all of the libraries using local::lib and use a bootstrap task to install them when the servers boot, using steps similar to the following:
Start an interactive EMR job on a single instance using the same machine architecture (e.g. m1.large) that you plan to use for running your jobs.
ssh to instance
setup CPAN, get local::lib and install
setup .bashrc to contain the environment variables required to use local::lib
install all the other modules you need via cpan
clean up files from .cpan that you don't need, such as build and source directories
Create a tar file, e.g. tar cfz local-perl5.tar.gz perl5 .cpan .bashrc
Copy the tar file to your bucket on S3.
Set up a bootstrap script to copy back the tar file from S3 and untar it into the hadoop home directory, e.g.
#!/bin/bash set -e bucket=mybucketname tarfile=local-perl5.tar.gz arch=large cd $HOME hadoop fs -get s3://$bucket/$arch/$tarfile . tar xfz $tarfile
Put the bootstrap script on S3 and use it when creating a new job flow.
Assuming the reader is familiar with the basic principles of map-reduce, in terms of implementation in Perl with hadoop-streaming.jar, a mapper/reducer is simply a script that reads from STDIN and writes to STDOUT, typically line by line using a tab-separated key and value pair on each line. So the main loop of any mapper/reducer script is usually of the form:
while (my $line = <>) { chomp $line; my ($key, $value) = split(/\t/, @line); ... do something with key and value print "$newkey\t$newvalue\n"; }
Scripts can be uploaded to S3 using the web interface, or placed in the bootstrap bundle described above, or uploaded to the master instance using scp and distributed using the hadoop-streaming.jar -file option, or no doubt by many other mechanisms. If due care is taken with quoting, a script can even be specified using the -mapper and -reducer options directly; for example:
Args => [ '-mapper', '"perl -e MyClass->new->mapper"', ... ]
Jon Schutz
http://notes.jschutz.net
Please report any bugs or feature requests to bug-net-amazon-emr at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Net-Amazon-EMR. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-net-amazon-emr at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc Net::Amazon::EMR
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Net-Amazon-EMR
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Net-Amazon-EMR
CPAN Ratings
http://cpanratings.perl.org/d/Net-Amazon-EMR
Search CPAN
http://search.cpan.org/dist/Net-Amazon-EMR/
The core interface code was adapted from Net::Amazon::EC2.
Copyright 2012 Jon Schutz.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://dev.perl.org/licenses/ for more information.
Amazon EMR API: http://http://docs.amazonwebservices.com/ElasticMapReduce/latest/APIReference/Welcome.html
To install Net::Amazon::EMR, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Net::Amazon::EMR
CPAN shell
perl -MCPAN -e shell install Net::Amazon::EMR
For more information on module installation, please visit the detailed CPAN module installation guide.