NAME

Data::Downloader::Config

SYNOPSIS

# command line :
dado config init --filename=my_config_file.txt
dado config update --filename=my_config_file_modified.txt

Module :

use Data::Downloader;
Data::Downloader::Config->init( filename => "my_config_file.txt"):
Data::Downloader::Config->init( yaml => qq[...some yaml...] );

DESCRIPTION

Configure Data::Downloader.

Data::Downloader uses sqlite to store both its configuration and data about the files which are downloaded. (For the location of this sqlite file, see Data::Downloader::DB.) The configuration describes url patterns, file metadata, RSS feeds, and how to create trees of symbolic links to various subsets of the files that are downloaded.

DD::Config can also update the configuration by reading a new file and determining which changes have been made. Any changes will _only_ affect the configuration, they will not cause changes to any of the metadata that has been stored, or the location of any of the files on disk. Certain configuration changes may be invalid, if they would cause the database to inconsistent. In such cases, to force a configuration change, you may need to either remove the database file and start from scratch, or else use SQL commands to manually update the configuration within the database to reflect a re-organization of items on disk.

FORMAT

DD uses YAML to read files; please see that page for documentation about YAML.

A configuration file is a collection of yaml "documents"; a sequence of lines separated by lines containing only three dashes ("---'). Each of these documents represents a Data::Downloader::Repository. A repository is a collection of files stored in a common root directory. The first few fields of a repository are :

---
name: my_repo

An arbitrary name for this repository. This name will reflect the character of this data, e.g. "images", "videos", "web_pages", "ozone_data". (required)

storage_root: /path/to/root/storage

The root directory for the storage of files. (required)

file_url_template: 'http://somehost.com/<variable1>/<variable2>/<date_variable:%Y/%m/%d>

This is a String::Template style string for downloading files. This is required if data will be downloaded directly from urls using this template. The variables listed in the template will become required command-line arguments to dado, e.g.

dado download file --variable1=foo --variable2==bar --date_variable='2001-03-04'

URLs may also come from RSS feeds (below), in which case the file_url_template is not relevant.

cache_strategy: LRU

The strategy for cache expiration. Currently only LRU is supported. (required)

cache_max_size: 1073741824

The approximate maximum size (in bytes) for the cache. The cache size is checked before downloading files (this may change to be less frequent). (required)

disks:
  - root: disk1/
  - root: disk2/
  - root: disk3/

These are top level subdirectories of "storage_root" in which to place files. In practice, these may be located on different devices. Currently new files will be placed in the directory whose device has the most free space (as determined by "df"). If two partitions have the same amount of free space, the new file will be placed on the one which has the most free space within that directory (i.e. the sum of DD's files is the smallest). If those are the same, a random one will be used.

feeds:

If there are RSS feeds that describe the locations (and/or metadata) of the files, they may be listed in a "feeds" section. Each feed is a Data::Downloader::Feed. The syntax is simplest if there is only one feed, but it is possible to specify multiple feeds (see EXAMPLES)

name: georss

Each feed also has an arbitrary "name", used to identify it. The name should correspond to the source of the RSS feed.

feed_template: 'http://example.com/some/feed/<var1>/<var2>/<var3>

This is a String::Template string (or just a string if there are no variables) which describes the url for the RSS feed. Variables in the template will become required command-line arguments to dado when refreshing the feed, e.g.

dado feeds refresh --var1=foo --var2=bar --var3=baz

It is also possible to assign default values to some of the parameters in the template, in which case they will be optional. This happens like so :

feed_parameters:
    - name: var1
      default_value: 'foo'
    - name: var2
      default_value: 'bar'

With the above defaults, var1 and var2 could be omitted from the command line :

dado feeds refresh --var3=baz

An RSS feed contains various items in <item></item> tags. An atom feed uses "entry" instead of "item". Within these tags, there may be information about the location of the files to be downloaded, as well as various pieces of metadata that should be stored (so that they may be used to construct symbolic links and search for files).

file_source:
    filename_xpath: 'some_xpath_within_item'
    md5_xpath: 'another_one'
    url_path: 'yet_another_one'

These lines describe where to find the filename, md5 and url of an individual file within the <item> (or <entry>) tags. e.g. for the example above, the full (document-level) xpath for the filename would be //item/some_xpath_within_item.

Note that if an RSS feed contains tags with namespaces, then (per the xpath specification) all of the tags need namespaces. Data::Downloader assigns tags with no namespace to a namespace named "default". So, e.g. if the RSS feed contains <link> within an <item> (entry), but there are also tags like <datacasting:orbit>, then the xpath for <link> will be //default:item/default:link. And url_path, above, would be "default:link". See XML::LibXML::Node for a discussion of this.

metadata_sources:
    - name: metadata_var1
      xpath: metadata_var1s_xpath_in_an_item
    - name: metadata_var2
      xpath: nother_xpath_in_an_item

These are the xpaths within an //item for pieces of metadata to be stored for each file. The above indicates that //item/metadata_var1s_xpath_in_an_item describes a piece of data that should be called "metadata_var1". Keep reading to see how to use these.

linktrees:

This section (one per data source, not one per feed) describes a list of trees (each is a Data::Downloader::Linktree) of symbolic links to be maintained; the symlinks will point to data within the repository.

- root: /some/path/where/these/symlinks/go
  condition: '{ metadata_var1 => "a value for this piece of data"}'
  path_template: some/subdir/that/uses/vars/<metadata_var1>/<metadata_var2>
- root: /another/path/for/more/symlinks
  condition: '{ metadata_var2 => { ">=" => 42, "<=" => 99 }'
  path_template: anothersubdir/<metadata_var2>

Each linktree has a "root" (an absolute path), a condition (an SQL::Template style clause for limiting which files get symlinks under this path. Use "~" to get all files"), and a "path_template" (a String::Template string for laying out the symlinks).

METHODS

init

Inserts information about repository and feeds using a config file.

Parameters :

filename: the name of a config file
yaml: yaml content of the file (can be sent instead of file)
update_ok: allow updates, not just initialization
update

Update the config

dump

Dump the config.

Parameters :

format - the format (yaml, array)

EXAMPLES

Here's a sample configuration file :

---
name: my_images
storage_root: /some/where
feeds: [ { name          : flickr,
           feed_template : 'http://api.flickr.com/services/feeds/photos_public.gne?tags=<tags>&lang=en-us&format=rss_200',
           file_source   : {
                url_xpath      : 'media:content/@url',
                filename_xpath : 'media:content/@url',
                filename_regex : '/([^/]*)$'
           },
           metadata_sources: [
               { name: 'date_taken', xpath: 'dc:date.Taken'  },
               { name: 'tags',       xpath: 'media:category' } ]
         },
         { name             : smugmug,
           feed_template    : TODO,
           file_source      : TODO,
           metadata_sources : TODO
         }
       ]

linktrees :
     - root: /images
       condition: ~
       path_template: '<date_taken:%Y/%m/%d>'

SEE ALSO

Data::Downloader

Data::Downloader::DB