The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

PDL - Perl Data Language extension module

Version 1.03 alpha

"Why is it that we entertain the belief that for every purpose odd numbers are the most effectual?" - Pliny the Elder.

Karl Glazebrook, AAO, 28/11/1996. [kgb@aaoepp.aao.gov.au]

DESCRIPTION

The perlDL concept is to give standard perl5 the ability to COMPACTLY store and SPEEDILY manipulate the large N-dimensional data sets which are the bread and butter of scientific computing. e.g. $a=$b+$c can add two 2048x2048 images in only a fraction of a second.

It is hoped to eventually provide tons of useful functionality for scientific and numeric analysis.

Introduction

The fundamental perl data structures are scalar variables, e.g. $x, which can hold numbers or strings, lists or arrays of scalars, e.g. @x, and associative arrays/hashes of scalars, e.g. %x.

perl v5 introduces to perl data structures and objects. A simple scalar variable $x now be a user-defined data type or full blown object*.

The fundamental idea behind perlDL is to allow $x to hold a whole 1D spectrum, or a 2D image, a 3D data cube, and so on up to large N-dimensional data sets. These can be manipulated all at once, e.g. $a = $b + 2 does a vector operation on each value in the spectrum/image/etc.

You may well ask: "Why not just store a spectrum as a simple perl @x style list with each pixel being a list item?" The two key answers to this are MEMORY and SPEED. Because we know our spectrum consists of pure numbers we can compactly store them in a single block of memory corresponding to a C style numeric array. This takes up a LOT less memory than the equivalent perl list. It is then easy to pass this block of memory to a fast addition routine, or to any other C function which deals with arrays. As a result perlDL is very fast --- for example one can mulitiply a 2048*2048 image in exactly the same time as it would take in C or FORTRAN (0.1 sec on my SPARC). A further advantage of this is that for simple operations (e.g. $x += 2) one can manipulate the whole array without caring about its dimensionality.

I find when using perlDL it is most useful to think of standard perl @x variables as "lists" of generic "things" and PDL variables like $x as "arrays" which can be contained in lists or hashes. Quite often in my perlDL scripts I have @x contain a list of spectra, or a list of images (or even a mix!). Or perhaps one could have a hash (e.g. %x) of images... the only limit is memory!

perlDL variables support a range of data types - arrays can be bytes, short intgers (signed or unsigned), long integers, floats or double precision floats.

* It actually holds a reference (a smart "pointer") to this but that is not relevant for ordinary use of perlDL.

Usage

perlDL is loaded into your perl script using these commands:

   use PDL;  # use the standard perlDL modules (Core Examples Io Graphics::PG)
   
   use PDL::Examples; # use only the Examples module (this will load 
                      # internally whatever other modules it needs).
   
   % perldl  # Invoke interactive shell from system command line.

The default is to import all the function names from a module. If you only want certain names imported just say:

   use PDL::Io qw(rfits rgrep) # Get only rfits() and rgrep from PDL::Io

Also see below on "Object-Orientation".

To create a new PDL variable

Here are some ways of creating a PDL variable:

   $a = pdl [1..10];             # 1D array
   $a = pdl (1,2,3,4);           # Ditto
   $b = pdl [[1,2,3],[4,5,6]];   # 2D 3x2 array
   $b = pdl 42                   # 0-dimensional scalar
   $c = pdl $a;                  # Make a new copy
   
   $d = byte [1..10];            # See "Type conversion"
   $e = zeroes(3,2,4);           # 3x2x4 zero-filled array
   
   $c = rfits $file;             # Read FITS file 
   
   @x = ( pdl(42), zeroes(3,2,4), rfits($file) ); # Is a LIST of PDL variables!

The pdl() function is used to initialise a PDL variable from a scalar, list, list reference or another PDL variable.

In addition all PDL functions automatically convert normal perl scalars to PDL variables on-the-fly.

(also see "Type Conversion" and "Input/Output" sections below)

Arithmetic

 $a = $b + 2; $a++; $a = $b / $c; # Etc.
 
 $c=sqrt($a); $d = log10($b+100); # Etc
 
 $e = $a>42; # Vector condtional (like MATLAB) - 
             # note I think this is much nicer than IDLs WHERE(), e.g.:
 
 $e = 42*($a>42) + $a*($a<=42); # Cap top
 
 $a = $a / ( max($a) - min($a) );
 
 print $a; # $a in string context prints it in a N-dimensional format

(and other perl operators/functions)

Matrix functions

'x' is hijacked as the matrix multiplication operator. e.g. $c = $a x $b;

perlDL is row-major not column major so this is actually c(i,j) = sum_k a(k,j) b(i,k) - but when matrices are printed the results will look right. Just remember the indices are reversed. e.g.:

      $a = [                   $b = [
            [ 1  2  3  0]            [1 1]
            [ 1 -1  2  7]            [0 2]
            [ 1  0  0  1]            [0 2]
           ]                         [1 1]
                                    ]

      gives $c = [
                  [ 1 11]
                  [ 8 10]
                  [ 2  2]
                 ]

Note: transpose() does what it says and is a convenient way to turn row vectors into column vectors. It is bound to the unary operator '~' for convenience.

How to write a simple function

   sub dotproduct { 
       my ($a,$b) = @_;
       return sum($a*$b) ;
   }
   1;

If put in file dotproduct.pdl would be autoloaded (see below).

Type Conversion

Default for pdl() is double. Conversions are:

   $a = float($b); 
   $c = long($d);   # "long" is 4 byte int
   $d = byte($a);

Also double(), short(), ushort().

These routines also automatically convert perl lists to allow the convenient shorthand:

   $a = byte [[1..10],[1..10]];  # Create 2D byte array
   $a = float [1..1000];         # Create 1D float array

etc.

Rules for automatic conversion during arithmetic:

  If INT = any of byte/short/ushort/int and X is generic op

  For VECTOR x SCALAR these rules avoid overpromotion of vector types:

  VECTOR INT   X SCALAR INT            Return is same type as VECTOR
  VECTOR INT   X SCALAR float/double   Return float
  VECTOR float X SCALAR float/double   Return float


  For other VECTORxSCALAR and VECTORxVECTOR returns "highest" of
  two data types. i.e. VECTOR double x float returns float etc.

Printing

Automatically expands array in N-dimensional format:

   print $a;  
    
   $b = "Answer is = $a ";

Sections

perlDL betrays its perl/C heritage in that arrays are zero-offset. Thus a 100x100 image has indices 0..99,0..99.

Further I adopt the convention that the center of the pixel (0,0) IS at coordinate (0.0,0.0). Thus the above image ranges from -0.5..99.5, -0.5..99.5 in real space. All perlDL graphics functions conform to this defintion and hide away the unit-offsetness of, for example, the PGPLOT FORTRAN library.

Again following the usual convention coordinate (0,0) is displayed at the bottom left when displaying an image. It appears at the top right when using "print $a" etc.

   $b  = sec($a,  $x1, $x2, $y1, $y2, $z1, $z2, ... ) # Take subsection
   $newimage = ins($bigimage,$smallimage,$x,$y,$z...) # Insert at x,y,z
   
   $c  = nelem ($a); # Number of pixels
   
   $val = at($object, $x,$y,$z...)    # Pixel value at position
   set($myimage, $x, $y, ... $value)  # Set value in image 
   
   $b = xvals($a); # Fill array with X-coord values (also yvals(), zvals(),
                   # axisvals($x,$axis) and rvals() for radial distance 
                   # from centre).

(Note: I hope to enable syntax like $$a{'0..200,3..200'} using tie() but I am still thinking about the ramifications of this)

Input/Output

The PDL::Io module currently implements the following useful I/O functions:

   $a  = rfits($file)  # Read a FITS file into a PDL variable
                       # (only IEEE float machines as yet)

   wfits ($a, $file)  # Write FITS file 
   
   ([$xaxis],$data) = rdsa($file)   # Read a STARLINK/FIGARO file using
                                    # perl DSA module (available seperately)
   

Read ASCII columns into $x, $y, etc.:

   ($x,$y,...) = rcols($file,[[$pattern],[$col1, $col2,] ...)  

Read $1, $2, etc. pattern matches into $x, $y, etc.

   ($x,$y,...) = rgrep($file, $pattern)    
   

e.g.:

   ($x,$y) = rcols $file, '/Mumble/', 2,3;
   ($a,$b) = rgrep $file, '/Foo (.*) Bar (.*) Mumble/';
   

Graphics

The philosophy behind perlDL is to make it work with a variety of existing graphics libraries since no single package will satisfy all needs and all people and this allows one to work with packages one already knows and likes. Obviously there will be some overlaps in functionality and some lack of consistency and uniformity. This also saves the author from too much work in time he doesn't have!

  1. PGPLOT

    PGPLOT provdes a simple library for line graphics and image display.

    There is an easy interface to this in the internal module PDL::Graphics::PG. (This calls routines in the separately available PGPLOT top-level module.)

    Current display commands:

       imag         -  Display an image (uses pgimag()/pggray() as appropriate)
       ctab         -  Load an image colour table
       line         -  Plot vector as connected points
       points       -  Plot vector as points
       errb         -  Plot error bars
       cont         -  Display image as contour map
       bin          -  Plot vector as histogram ( e.g. bin(hist($data)) )
       hi2d         -  Plot image as 2d histogram (not very good IMHO...)
       poly         -  Draw a polygon
       vect         -  Display 2 images as a vector field

    Device manipulation commands:

       hold         -  Hold current plot window range - allows overlays etc.
       release      -  Release back to autoscaling of new plot window for each command
       rel          -  short alias for 'release'
       env          -  Define a plot window, put on 'hold'
       dev          -  Explicitly set a new PGPLOT graphics device

    The actual PGPLOT module is loaded only when the first of these commands is executed.

    e.g:

       perldl> $a = pdl [1..100]
       perldl> $b = sqrt($a)
       perldl> line $b      
       perldl> hold
       Graphics on HOLD
       perldl> $c = sin($a/10)*2 + 4
       perldl> line $c     

    Notes: $transform for image/cont etc. is used in the same way as the TR() array in the underlying PGPLOT FORTRAN routine but is, fortunately, zero-offset.

    It is also hoped to use other graphic libraries to enable more sophisticated plots then is possible with PGPLOT. Some ideas:

  2. IIS

    Many astronomers like to use SAOimage and Ximtool (or there derivations/clones). These are useful free widgets for inspection and visualisation of images. (They are not provided with perlDL but can easily be obtained from their official sites off the Net.)

    The PDL::Graphics::IIS package provides allows one to display images in these ("IIS" is the name of an ancient item of image display hardware whose protocols these tools conform to.)

    Commands are:

       iis         - display image
       iiscur      - return a cursor position
       iiscirc     - draw circles on image display
       saoimage    - start SAOimage 
       ximtool     - start Ximtool

    Variables are:

       $stdimage  - frame buffer configuration
       $iisframe  - frame buffer number to display in

    The frame buffer configuration is set by the variable $stdimage (analagous to iraf) whose default is "imt1024". System and user imtoolrc files are parsed so if you know about these you can do the same tricks as you can in with IRAF.

  3. Karma

    To come?

Autoloading

If a PDL function, e.g. foo(), is currently undefined a file "foo.pdl" is searched for in the current directory, and any directories in $PDLLIB, $PERL5LIB and $PERLLIB enviroment variables. (These are ":" seperated lists of directories.)

If you want to change the path within perldl simply change the lists @PDLLIB and @INC.

Note: "foo.pdl" is require'd so it must return a true value (see "require" perl documentation).

Call External

This provides a simple way to pass the data arrays from pdl variables to external C routines. It uses perl's built-in dynamic loader to load compiled C code.

The syntax is:

   callext($file,$symbol, @pdl_list) 

@pdl_list is a list of pdl variables. Numbers get converted automatically. The file must be dynamically loadable object code - how the C compiler generates this will be different from system to system so see your man pages.

The C routine takes args (int nargs, pdl *args). The C type "pdl" is a simple data structure representing the perl pdl variable. It is defined in file "pdl.h" which is included in the perlDL distribution and has no perl dependencies. It is trivial to cast the data array (pdl.data) to (float), (double) etc. and pass to any other C routine.

This is all demonstrated in the files "testcallext.*" in the perlDL distribution.

Note: This is only intended as a quick and dirty prototyping interface for the scientist/hacker. perlDL developers should write a module along the lines of the example PDL::Examples.

perldl shell

The perl script perldl provides a simple command line - if the latest Readlines/ReadKey modules have beeen installed perldl detects this and enables command line recall and editing.

e.g.:

   % perldl
   ReadLines enabled
   perldl> $a = rfits "foo.fits"
   BITPIX =  -32  size = 88504 pixels 
   Reading  354016 bytes
   BSCALE =  &&  BZERO = 
   
   perldl> imag log($a+400)
   Displaying 299 x 296 image from 4.6939525604248 to 9.67116928100586 ...

You can also run it from the perl debugger (perl -MPDL -d -e 1) if you want.

Miscellaneous shell features:

  1. The command perldl -oo starts perldl in Object-Oriented mode. It does use PDL::OO instead of use PDL.

  2. The shell aliases p to be a convenient short form of print, e.g.

       perldl> p ones 5,3
        
       [
        [1 1 1 1 1]
        [1 1 1 1 1]
        [1 1 1 1 1]
       ]
  3. The files ~/.perldlrc and local.perldlrc (in the current directory) are sourced if found. This allows the user to have global and local PDL code for startup.

  4. Any line starting with the # character is treated as a shell escape. This character is configurable by setting the perl variable $PERLDL_ESCAPE. This could, for example, be set in ~/.perldlrc.

Overload operators

I have overloaded the following builtin perl operators and functions in order that they work on PDL variables:

   + - * / > < >= <= << >> & | ^ == != <=> ** % ! ~
   sin log abs atan2 sqrt cos exp 

[All the unary functions (sin etc.) may be used with inplace() - see "Memory" below.]

Object-Orientation and perlDL

[Astronomers can ignore this bit! :-)]

pdl variables such as $x are implemented via Perl objects. However I have chosen to use an all-functional approach to perlDL syntax yo be more astronomer friendly.

However you can use perlDL in an OO fashion. In fact if you say:

   use PDL::OO;

It will load PDL functions as OO methods in the PDL class. This means you can say things like:

   $a = PDL->rfits('m51.fits');
   $b = PDL->new([1,2,2,1],[1,2,2,1],[1,2,2,1],[1,2,2,1]);
   $smooth = $a->convolve($b);
   $smooth->iis;

You can start the perldl shell in this mode with "perldl -oo".

Note: as you can see from the above all functions which create pdl variables are used with construct syntax in the OO mode. Finally you can even use both forms by simply saying "use PDL; use PDL::OO".

You can inherit from PDL methods (e.g. to a class Foo) by simply saying:

   @Foo::ISA = ('PDL');               # Method path
   %Foo::OVERLOAD = %PDL::OVERLOAD;   # Copy overload

Then PDL methods will work on Foo objects as long as you simply build on the existing PDL data structure (see below) components.

So it would be possible to provide USER written modules to do really cool stuff for specific application areas, e.g. PDL::Spectrum might provide a $a which understands X-axes and error bars and +-/+ etc. might be overriden to do the Right Thing (tm). And writing the module would not be rocket science - just some cool perl hacking.

And you would not have to even use method syntax - if $a came out of my hypothetical PDL::Spectrum all the standard pdl functions (like hist() to give a concreate example) would work on it in the standard way provided they simply built on the existing PDL data structure (which means simply containing a $$a{Data} etc. PDL::Spectrum could even export it's own hist() function to override the built-in which might do something more sophisticated using the X-axis for example.

If you were feeling really ambitious you might do PDL::Ir::Spectrum which understood about the gaps between the J H and K bands!

Memory usage and references

Messing around with really huge data arrays may require some care. perlDL provides some facilities to let you perform operations on big arrays without generating extra copies though this does require a bit more thought are care from the programmer.

NOTE: On some most systems it is better to configure perl (during the build options) to use the system malloc() function rather than perl's built-in one. This is because perl's one is optimised for speed rather than consumption of virtual memory - this can result in a factor of two improvement in the amount of memory storage you can use.

  1. Simple arithmetic

    If $a is a big image (e.g. occupying 10MB) and I say:

       $a = $a + 1;

    then the total malloc()'d memory usage grows to 20MB. This is because the expression "$a+1" creates a temporary copy of $a to hold the result, then $a is assigned a reference to that. It is obviously done this way so "$c=$a+1" works as expected.

    Also if one says:

       $b = $a;     # $b and $a now point to same data
       $a = $a + 1;

    Then $b and $a end up being different, as one naively expects, because a new reference is created and $a is assigned to it.

    However if $a was a huge memory hog (e.g. a 3D volume) creating a copy of it may not be a good thing. One can avoid this memory overhead in the above example by saying:

       $a++;

    The operations ++,+=,--,-=, etc. all call a special "in-place" version of the arithmetic subroutine. This means no more memory is needed - the downside of this is that if $b=$a then $b is also incremented. To force a copy explicitly:

       $b = pdl $a; # Real copy
  2. Functions

    Most functions, e.g. log(), return a result which is a transformation of their argument. This makes for good programming practice. However many operations can be done "in-place" and this may be required when large arrays are in use and memory is at a premium. For these circumstances the operator inplace() is provided which prevents the extra copy and allows the argument to be modified. e.g.:

       $x = log($array);          # $array unaffected
       log( inplace($bigarray) ); # $bigarray changed in situ

WARNINGS:

1. The usual caveats about duplicate references apply.
2. Obviously when used with some functions which can not be applied in situ (e.g. convolve()) unexpected effects may occur! I try to indicate inplace() safe functions below.
3. Type conversions [e.g. float()] may cause hidden copying.

Data Structure Guts

(For born fiddlers only.)

The data structure for $a is implemented by a hash (associative array) which $a is a (blessed) reference too.

PDL reserves for it's own use:

   $$a{Data} ; # The DATA (byte list) - can be passed directly to F77/C
               # subroutine as long as type matches. e.g. line() does a
               # float() and then calls PGPLOT::pgline_r (bypassing packing)
   
   $$a{Datatype}; # Holds numeric data type, $PDL_F, $PDL_D, etc...
   
   $$a{Dims} ; # List reference holding dimensions. 
               # E.g. @mydims = @{ $$a{Dims} };
   
   $$a{Hdr}  ; # Optional extra hash reference holding header, e.g.
               # $airmass = $$a{Hdr}{'AIRMASS'}; %myhdr = %{ $$a{Hdr} };
               # rfits() populates this from the FITS header.
   
   $$a{Inplace}; # Flag - inplace() sets this. Next time a copy is attempted
                 # it does not occur and the flag is unset.

   $$a{PDL};  # Pointer to cached C values of above. This allows
              # C routines to get at the data FAST.

Anything else stored in the structure will be copied to new objects (e.g. by $b = $a + 1) automatically as long as PDL knows how to copy it. [If it is a reference to another object PDL tries the ->copy method.]

If your perl routine manipulates the data structure guts directly, you don't want it to blow up in your face if you pass it a simple number rather than a PDL variable. Simply call the function topdl() first to make it safe. e.g.:

   sub myfiddle { my $pdl = topdl(shift); $$pdl{Data} = ... }

topdl() does NOT perform a copy if a pdl variable is passed - it just falls through - which is obviosuly the desired behaviour. The routine is not of course necessary in normal user defined functions which do not care about internals.

If you have changed with the guts you ********MUST******** call the method $pdl-flush()> to update the C cache from the new perl values otherwise PDL will get very confused (Mnemonic: 'flush the piddle after a fiddle').

Finally there is no reason why the data structure should not contain another PDL variable!

Complete List of Exported Functions

Defined in PDL::Core

 byte short ushort long float double convert   - Type Conversions
 
 pdl          - Create/copy a pdl 
 topdl        - Coerce to pdl if scalar
 howbig       - Size of pdl datatype in bytes
 nelem        - Number of elements 
 dims         - Return list of dimensions, e.g. @mydims = dims($x);
 list         - Convert pdl to list - e.g. for (list $x) {..}
 listindices  - Return list of index values (1D) - e.g. for $i 
                (listindices $x) {..}
 log10*       - Take log base 10
 min max sum  - Min/max/sum of pdl array
 zeroes/ones  - Create zero/one-filled pdl array
 sequence     - Create sequence-filled pdl array
 reshape      - reshape the dimensions of a pdl array
 sec          - subsection
 ins* / set   - insertion / setting
 at           - return pixel value at (x,y,z...)
 
 axisvals* xvals* yvals* zvals* - Fill pdl with axis values
 
 rvals        - Fill pdl with distance from it's center
 callext      - Call external C routine in dynamically loadable object
 convolve     - convolve image with kernel (real space)
 inplace      - Flag for inplace operation
 hist         - histogram of data 
 stats        - return mean + standard deviation
 transpose    - matrix transpose

Defined in PDL::Examples

This contains examples of how to add C functiions via XS including use of the generic preprocessor (.g files are automatically converted to .c files with code automatically generated for each datatype).

 fibonacci*    - Compute Fibonacci series (simple 1D example)
 cc8compt*     - Connected 8-component labelling (2D example)

Defined in PDL::Io

[See "Io" section above]

Defined in PDL::Graphics::*

[See "Graphics" section above]

Footnotes:

* = indicates inplace() safe & useful with this function

Complete List of PDL methods

For convenience (the ->meth syntax looks more appropriate) some pdl routines are defined as methods:

$pdl->flush - Update C cache from perl values