sets - set operations in Perl
version 0.978
sets [--usage] [--help] [--man] [--version] sets [--binmode|-b <string>] [--cache-sorted|-S <suffix>] [--internal-sort|-I] [--loglevel|-l <level>] [--sorted|-s] [--trim|-t] expression...
# intersect two files sets file1 ^ file2 # things are speedier when files are sorted sets -s sorted-file1 ^ sorted-file2 # you can use a bit caching in case, generating sorted files # automatically for possible multiple or later reuse. For example, # the following is the symmetric difference where the sorting of # the input files will be performed two times only sets -S .sorted '(file1 - file2) + (file2 - file1)' # In the example above, note that expressions with grouping need to be # specified in a single string. # sometimes leading and trailing whitespaces only lead to trouble, so # you can trim data on-the-fly sets -t file1-unix - file2-dos
This program lets you perform set operations working on input files.
The set operations that can be performed are the following:
the binary operation that selects all the elements that are in both the left and the right hand operand. This operation can be specified with any of the following operators:
the binary operation that selects all the elements that are in either the left or the right hand operand. This operation can be specified with any of the following operators:
the binary operation that selects all the elements that are in the left but not in the right hand operand. This operation can be specified with any of the following operators:
Expressions can be grouped with parentheses, so that you can set the precedence of the operations and create complex aggregations. For example, the following expression computes the symmetric difference between the two sets:
(set1 - set2) + (set2 - set1)
Expressions should be normally entered as a single string that is then parsed. In case of simple operations (e.g. one operation on two sets) you can also provide multiple arguments. In other terms, the following invocations should be equivalent:
sets 'set1 - set2' sets set1 - set2
Options can be specified only as the first parameters. If your first set begins with a dash, use a double dash to explicitly terminate the list of options, e.g.:
sets -- -first-set ^ -second-set
In general, anyway, the first non-option argument terminates the list of options as well, so the example above would work also without the --. In the pathological case that your file is named -s, anyway, you would need the explicit termination of options with --. You get the idea.
--
-s
Files with spaces and other weird stuff can be specified by means of quotes or escapes. The following are all valid methods of subtracting to remove from input file:
to remove
input file
sets "'input file' - 'to remove'" sets '"input file" - "to remove"' sets 'input\ file - to\ remove' sets "input\\ file - to\\ remove" sets input\ file - to\ remove
The first two examples use single and double quoting. The third example uses a backslash to escape the spaces, as well as the fourth example in which the escape character is repeated due to the interpolation rules of the shell. The last example leverages upon the shell rules for escaping AND the fact that simple expressions like that can be specified as multiple arguments instead of a single string.
set a string for calling binmode on STDOUT. By default, :raw:encoding(UTF-8) is set, to normalize newlines handling and expect UTF-8 data in.
binmode
:raw:encoding(UTF-8)
input files are sorted and saved into a file with the same name and the suffix appended, so that if this file exists it is used instead of the input file. In this way it is possible to generate sorted files on the fly and reuse them if available. For example, suppose that you want to remove the items in removeme from files file1 and file2; in the following invocations:
removeme
file1
file2
sets file1 - removeme > file1.filtered sets file2 - removeme > file2.filtered
we have that file removeme would be sorted in both calls, while in the following ones:
sets -S .sorted file1 - removeme > file1.filtered sets -S .sorted file2 - removeme > file2.filtered
it would be sorted only in the first call, that generates removeme.sorted that is then reused by the second call. Of course you're trading disk space for speed here, but most of the times it is exactly what you want to do when you have disk space but little time to wait. This means that most of the times you'll e wanting to use this option, unless you're willing to wait more or you already know that input files are sorted (in which case you would use --sorted | -s instead).
removeme.sorted
--sorted | -s
print a somewhat more verbose help, showing usage, this description of the options and some examples from the synopsis.
force using the internal sorting facility even if external sort is available. Some rough benchmark showed that this is slower about 7% with respect to using the external sort, so avoid this if you can.
sort
set the verbosity of the logging subsystem. Allowed values (in increasing verbosity): TRACE, DEBUG, INFO, WARN, ERROR and FATAL.
TRACE
DEBUG
INFO
WARN
ERROR
FATAL
print out the full documentation for the script.
in normal mode, input files are sorted on the fly before being used. If you know that all your input files are already sorted, you can spare the extra sorting operation by using this option:
sets -s file1.sorted ^ file2.sorted
if you happen to have leading and/or trailing white spaces (including tabs, carriage returns, etc.) that you want to get rid of, you can turn this option on. This is particularly useful if some files come from the UNIX world and other ones from the DOS world, becaue they have different ideas about terminating a line.
print a concise usage line and exit.
print the version of the script.
Some options can be set from the environment:
SETS_CACHE
the same as specifying --cache-sorted | -S suffix on the command line. The contents of SETS_CACHE is used as the suffix.
--cache-sorted | -S suffix
SETS_INTERNAL_SORT
the same as specifying --internal-sort | -I on the command line.
--internal-sort | -I
SETS_MAX_FILES
maximum number of (temporary) files to keep when using the internal sorting facility.
SETS_MAX_RECORDS
maximum number of input records to keep in memory when using the internal sorting facility.
SETS_SORTED
the same as specifying --sorted | -s on the command line
SETS_TRIM
the same as specifying --trim | -t on the command line
--trim | -t
Flavio Poletti <polettix@cpan.org>
Copyright (C) 2011-2016 by Flavio Poletti polettix@cpan.org.
This module is free software. You can redistribute it and/or modify it under the terms of the Artistic License 2.0.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
To install App::Sets, copy and paste the appropriate command in to your terminal.
cpanm
cpanm App::Sets
CPAN shell
perl -MCPAN -e shell install App::Sets
For more information on module installation, please visit the detailed CPAN module installation guide.