The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

File::OSS::Scan - Scan the repository of project and detect any OSS ( Open Source Software ) files

VERSION

version 0.04

SYNOPSIS

    use File::OSS::Scan qw(:scan);

    scan_init( 'verbose' => 0, 'inflate' => 1 );

    scan_execute($proj_dir);
    my $ret = scan_result();

DESCRIPTION

This module allows you to scan your project directory based on a set of pre-defined but also customizable rules, to detect all the used source files that originate from OSS ( or commercial software ). Unlike some of those commercial solutions for the OSS management, here we don't have to maintain a OSS code database, it means that we will not conduct code snippet match, and completely rely on the pattern match ( looking for a particular type of file, eg COPYING, LICENSE, or the existence of specific strings in file content, eg Copyright, LGPL License etc ).

ATTRIBUTES

scan_init() takes a set of options. These options will be printed out to the STDOUT if it runs in the CHATTY mode ( 'verbose' => 2 ).

ruleset_config

used to specify the path of your own config file for File::OSS::Scan::Ruleset, where you can write up your own rules for OSS detection. If not specified, then it will try to check the value of $ENV{OSSSCAN_CONFIG} and ./.ossscan.rc, if still can not find a valid configuration file in all of the above places, then it will default to use the embedded rules contained in the __DATA__ section of File::OSS::Scan::Ruleset.

verbose

[0|1|2]. set your Verbosity level, 0 is silent and 2 is verbose, 1 is well. It defaults to 1 if not specified, and only ouput messages about detected matches.

cache

[0|1|2]. set your Cache mode, 0 is no cache, 1 is to use cache, 2 is to refresh cache. It defaults to 0 if not specified, and will not enable the cache feature. if set this option to 1, it checks every file against the records in the cache to see if the file has been changed recently, if there is no change since the last run of scanning, then this file will be skipped. if set to 2, it will not check the change on files and hence process each one of them, also forces the refresh of cache records for every files.

inflate

[0|1]. This option is used to indicate whether we want to inflate a compressed or archived files and scan those extracted content. It defaults to 0 if not specified. Supported file types include: .jar, .tar, .gz, .zip, .Z.

working_dir

used to specify the working directory for file inflating. if not specified, it defaults to use ( create one if not existed ) the dir named .working under the current directory where the program is running. Careful!, scan_init() will empty this dir everytime it is called by using a rm -rf command. so one should be very cautious to any value assigned to this option, make sure that it doesn't clash with any existing dirs where you have important data stored.

strings

path of the cmd strings. If not specified, it defaults to /bin/strings. If can not find an executable strings command, then it will skip any binary files encountered.

jar

path of the cmd jar. If not specified, it defaults to /bin/jar. If can not find an executable jar command, then it will skip any .jar files encountered.

tar

path of the cmd tar. If not specified, it defaults to /bin/tar. If can not find an executable tar command, then it will skip any .tar files encountered.

gunzip

path of the cmd gunzip. If not specified, it defaults to /bin/gunzip. If can not find an executable gunzip command, then it will skip any .gz files encountered.

unzip

path of the cmd unzip. If not specified, it defaults to /bin/unzip. If can not find an executable unzip command, then it will skip any .zip files encountered.

uncompress

path of the cmd uncompress. If not specified, it defaults to /bin/uncompress. If can not find an executable uncompress command, then it will skip any .Z files encountered.

METHODS

scan_init(%params)

    use File::OSS::Scan qw( :scan );

    scan_init(
        'verbose' => 2,     # chatty output
        'inflate' => 1,     # inflate archived files
        'cache'   => 1      # enable cache
    );

Do the necessary initialization works required prior to running the scan, including availability checks on needed commands, initialize the working directory and initiate a File::OSS::Scan::Ruleset and a File::OSS::Scan::Matches instance. Accepted parameters are described in details in "ATTRIBUTES" section.

scan_execute($proj_dir)

    use File::OSS::Scan qw( :scan );

    scan_init();    # we are fine with defaults
    scan_execute($proj_dir);

Do the actual scanning on the given project directory and any detected OSS files will be recorded in the instance of File::OSS::Scan::Matches and can be fetched via method scan_result() later. The only parameter required here is the $proj_dir, which is used to tell the module which project directory you want to scan.

scan_result($format)

    use File::OSS::Scan qw( :scan );

    scan_init();    # we are fine with defaults
    scan_execute($proj_dir);

    my $ret_hash = scan_result();
    my $ret_text = scan_result('txt');
    my $ret_html = scan_result('html');
    my $ret_json = scan_result('json');

Get all the detected matches on files within the project directory. Parameter $format can be one of the txt - plain text, html - formatted HTML tables or json - JSON string. If not specified, then it will return the raw data hash.

clear_cache()

    use File::OSS::Scan qw( :all );

    clear_cache();

Clean all cached results from file system.

SCAN RULES

Scan rules can be configured in the config file specified via param ruleset_config, or in the file .ossscan.rc under the current directory where the program is running. If neither of them exists, then as a last resort, it will read the __DATA__ section of the module File::OSS::Scan::Ruleset. Currently it supports the following types of rules, If you are not sure about how to compose it, then the best approach is to refer to the __DATA__ section of the module File::OSS::Scan::Ruleset.

[SECTION]

    # section for file check
    [FILE]
        ...

    # section for line check
    [LINE]
        ...

This is used to declare section of rules that all following rules are belong to. Valid sections contain GLOBAL, DIRECTORY, FILE and LINE.

filename_match

    100% filename_match COPYING\.\w+
    50%  filename_match AUTHOR[S]?

Detect OSS file based on the filename check. The first element is the Certainty Level, ranging from 0(%) to 100 (%). The second element is the function name which will be called to process this rule. The rest part is a pattern(regex) used for searching.

content_match

    100% content_match MIT\W*Licen[cs]e
    100% content_match Artistic\W*Licen[cs]e

Detect OSS file by checking if the file's content matches some of the license strings. The first element is the Certainty Level, ranging from 0(%) to 100 (%). The second element is the function name which will be called to process this rule. The rest part is a pattern(regex) used for searching.

    50%  copyright_match MY_COMPANY MyCompany

Detect OSS file by checking if there is a copryright declaration statement in the file. The first element is the Certainty Level, ranging from 0(%) to 100 (%). The second element is the function name which will be called to process this rule. The rest part is a list of names to be excluded, usually we specify our own company's name here, so when we found a copyright statement like:

    Copyright (C) 1998 - 2012, MyCompany, <xxx@xxx>

we will know that these are proprietary codes and should be excluded from the detected matches.

exclude_dir

    exclude_dir: data

This is a global setting, so should be defined under the section GLOBAL or in the very begining of the configuration file. It accepts a list of directory names and these directories will be skipped during the scanning.

exclude_file

    exclude_file: Makefile Build\.PL

This is a global setting, so should be defined under the section GLOBAL or in the very begining of the configuration file. It accepts a list of file names ( or pattern ) and these files will be skipped during the scanning.

exclude_extension

    exclude_extension: png jpg gif pdf doc docx html htm xml json xls

This is a global setting, so should be defined under the section GLOBAL or in the very begining of the configuration file. It accepts a list of file extension names and files with the listed extensions will be skipped during the scanning.

SEE ALSO

AUTHOR

Harry Wang <harry.wang@outlook.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2014 by Harry Wang.

This is free software, licensed under:

    Artistic License 1.0