The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Apache::Hadoop::WebHDFS - interface to Hadoop's WebHDS API that supports GSSAPI/SPNEGO (secure) access.

VERSION

Version 0.04

SYNOPSIS

Hadoop's WebHDFS API, is a rest interface to HDFS. This module provides a perl interface to the API, allowing one to both read and write files to HDFS. Because Apache::Hadoop::WebHDFS supports GSSAPI, it can be used to interface with secure Hadoop Clusters. This module also supports WebHDFS connections with unsecure grids

Apache::Hadoop::WebHDFS is a subclass of WWW:Mechanize, so one could reference WWW::Mechanize methods if needed. One will note that WWW::Mechanize is a subclass of LWP, meaning it's possible to also reference LWP methods from Apache::Hadoop::WebHDFS. For example to debug the GSSAPI calls used during the request, enable LWP::Debug by adding 'use LWP::Debug qw(+);' to your script.

Content returned from WebHDFS is left in the native JSON format. Including your favorite JSON module like JSON::Any will help with mangaging the JSON output. To get access to the content stored in your Apache::Hadoop::WebHDFS object, use the methods provided by WWW::Mechanize, such as 'success', 'status', and 'content'. Please see 'EXAMPLE' below for how this is used.

METHODS

  • new() - creates a new WebHDFS object. Required keys are 'user', 'namenode', 'namenodeport', and 'authmethod'. Default values for 'namenode' and 'namenodeport' are listed below. The default value for authmethod is 'gssapi', which is used on grids where SPNEGO has been enabled. The 'doasuser' is optional and intended to be used when proxying the WebHDFS request as another user.

           my $hdfsclient =  new({ namenode     => "localhost",
                                   namenodeport => "50070",
                                   authmethod   => "gssapi|unsecure|doas",
                                   user         => 'user1',
                                   doasuser     => 'user2',
                                 });      
     
  • getdelegationtoken() - gets a delegation token from the namenode. This token is stored within the WebHDFS object and automatically appended to each WebHDFS request. Delegation tokens are used on grids with security enabled.

           $hdfsclient->getdelegationtoken();
  • renewdelegationtoken() - renews a delegation token from the namenode.

           $hdfsclient->renewdelegationtoken();
  • canceldelegationtoken() - informs the namenode to invalidate the delegation token as it's no longer needed. When calling this method, the delegation token is also removed from the perl WebHDFS object.

           $hdfsclient->canceldelegationtoken();
  • Open() - opens file on HDFS and returns it's content The only required value for Open() is 'file', all others are optional. The values, 'offset', 'length', and 'buffersize', are meant to be sized in bytes.

            $hdfsclient->Open({ file=>'/path/to/my/hdfs/file',
                                offset=>'1024',    
                                length=>'2048',
                                buffersize=>'1024',
                               });
  • create() - creates and writes to a file on HDFS Required values for create are 'srcfile' which is local, and dstfile which is the path for the new file on HDFS. 'blocksize' is represented in bytes and 'overwrite' has two valid values of 'true' or 'false'. While not required, if permissions are not provided they will default to '000'.

             $hdfsclient->create({ srcfile=>'/my/local/file.txt',
                                   dstfile=>'/my/hdfs/location/file.txt',
                                   blocksize=>'524288',
                                   replication=>'3',
                                   buffersize=>'1024',
                                   overwrite=>'true|false',
                                   permission=>'644',
                                  });
  • rename() - renames a file on HDFS. Required values for rename are 'srcfile' and 'dstfile', both of which represent HDFS filenames.

             $hdfsclient->rename({ srcfile=>'/my/old/hdfs/file.txt',
                                   dstfile=>'my/new/hdfs/file.txt',
                                 });
  • getfilestatus() - returns a json structure containing status of file or directory. Required input is a HDFS path.

             $hdfsclient->getfilestatus({ file=>'/path/to/my/hdfs/file.txt' });
  • liststatus() - returns a json structure of contents inside a directory. Note the timestamps are java timestamps so divide by 1000 to convert to ctime before attempting to format time value.

             $hdfsclient->liststatus({ path=>'/path/to/my/hdfs/directory' });
  • mkdirs() - creates a directory on HDFS. The only required input value is path. Their is an optional input value named permissions and if not provided will default to '000'.

             $hdfsclient->mkdirs({ path=>'/path/to/my/hdfs/directory',
                                   permissions=>'755', 
              });
  • getfilechecksum() - gets HDFS checksum on file. Note this is the crc32 checksum that HDFS uses to detect file corruption. It's not the checksum of the file itself. The only required input value is 'file'.

             $hdfsclient->getfilechecksum({ file=>'/path/to/my/hdfs/directory' });
  • Delete() - removes file or directories from HDFS. The only required input value is 'path'. The other optional value is 'recursive' which takes a 'true|false' arguement.

             $hdfsclient->Delete({ path=>'/path/to/my/hdfs/directory',
                                   recursive=>'true|false',
             });
  • getcontentsummary() - list metadata information on a directory. This includes things like file count and quota usage for that directory. The only input value is a path to a HDFS directory.

             $hdfsclient->getcontentsummary({ directory=>'/path/to/my/hdfs/directory' });
  • getfilestatus() - returns access times, blocksize, and permissions on a HDFS file.

             $hdfsclient->getfilestatus({ file=>'/path/to/my/hdfs/file' });
  • gethomedirectory() - returns path to the home directory for the user or 'proxy user'. There is no input for this method.

             $hdfsclient->gethomedirectory();
  • setowner() - changes owner and group ownership on a file or directory on HDFS. The only required input is 'path'.

             $hdfsclient->setowner({ path=>'/path/to/my/hdfs/directory',
                                   user=>'cartman',
                                   group=>'fifthgraders',
             });
  • setpermission() - changes owner and group permissions on a file or directory on HDFS. Path is required and permissions are optional.

             $hdfsclient->setpermisssion({ path=>'/path/to/my/hdfs/directory',
                                           permisssion=>'640',
             });
  • setreplication() - changes replication count for a file on HDFS. Path is required, replication is optional.

             $hdfsclient->setreplication({ path=>'/path/to/my/hdfs/directory',
                                           replication=>'10',
             });
  • settimes() - changes access and modifcation time for a file or directory on HDFS. Path is required, both access and modification times are optional. Remember these times are in java time, so make sure to convert ctime to java time by multiplying by 1000.

             $hdfsclient->setreplication({ path=>'/path/to/my/hdfs/directory',
                                           modificationtime=>$mymodtime,
                                           accesstime=>$myatime,
             });

REQUIREMENTS

 Carp                   is used for various warnings and errors.
 WWW::Mechanize         is needed as this is a subclass.
 LWP::Debug             is required for debugging GSSAPI connections
 LWP::Authen::Negotiate is the magic sauce for working with secure hadoop clusters 
 parent                 included with Perl 5.10.1 and newer or found on CPAN for older versions of perl
 File::Map              required for reading contents of files into mmap'ed memory space instead of perl's symbol table.

EXAMPLE

list a HDFS directory on a secure hadop cluster

  #!/usr/bin/perl
  use strict;
  use warnings;
  use Data::Dumper;
  use Apache::Hadoop::WebHDFS;
  my $username=getlogin();
  my $hdfsclient = Apache::Hadoop::WebHDFS->new( {namenode        =>"mynamenode.example.com",
                                                  namenodeport    =>"50070",
                                                  authmethod      =>"gssapi",
                                                  user            =>$username,
                                                 });
  $hdfsclient->liststatus( {path=>'/user/$username'} );        
  if ($hdfsclient->success()) {
     print "Request SUCCESS: ", $hdfsclient->status() , "\n\n";
     print "Dumping content:\n";
     print Dumper $hdfsclient->content() ;
  } else {
     print "Request FAILED: ", $hdfsclient->status() , "\n";
  } 

          

AUTHOR

Adam Faris, <apache-hadoop-webhdfs at mekanix.org>

BUGS

  Please use github to report bugs and feature requests 
  https://github.com/opsmekanix/Apache-Hadoop-WebHDFS/issues

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Apache::Hadoop::WebHDFS

You can also look for information at:

ACKNOWLEDGEMENTS

I would like to acknowledge Andy Lester plus the numerous people who have worked on WWW::Mechanize, Anchim Grolms and team for providing LWP::Authen::Negotiate, and the contributors to LWP. Thanks for providing awesome modules.

LICENSE AND COPYRIGHT

Copyright 2013 Adam Faris.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

    L<http://www.apache.org/licenses/LICENSE-2.0>

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.