The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

File::FormatIdentification::RandomSampling - methods to identify content of device o media files using random sampling

VERSION

version 0.006

SYNOPSIS

This module is suitable to get a good estimation about the content of media (or files). It uses random sampling of sectors to obtain heuristics about the content types.

To check the base type of a given binary string:

  my $ff = File::FormatIdentification::RandomSampling->new(); # basic instantiation
  my $type = $ff->calc_type($buffer); # calc type of given binary string

NAME

File::FormatIdentification::RandomSampling

TOOLS

The following tools are supplied with this module and are presented below:

crazy_fast_image_scan.pl

This script scans devices or images very fast using random sampling and reports wht kind of content could be found.

For a detailed documentation use the included POD there.

cfi_create_training_data.pl

This script scans a bunch of files and calcs most frequent one- and bigrams and stores them in a CSV file.

cfi_learn_model.pl

This script uses the CSV file and prints a new model module in style of File::FormatIdentification::RandomSampling::Model using AI::DecisionTree.

SOURCE

The actual development version is available at https://art1pirat.spdns.org/art1/crazy-fast-image-scan

METHODS

init_bytegrams

resets the internal bytegram state. Also called if object will be instantiated

update_bytegram

$buffer - updates the internal bytegram states using $buffer

calc_histogram

uses the most significant first 8 bytegram entries to from a histogram, returned as hash reference

is_uniform

returns true, if 1-byte bytegrams are uniform

is_empty

returns true, if 1-byte bytegrams indicating empty buffers

is_text

returns true, if 1-byte bytegrams are typical for texts

is_video

returns true, if 1-byte bytegrams are typical for MPEG/Quicktime Videos

calc_type

returns string indicating type of a given buffer

AUTHOR

Andreas Romeyke <pause@andreas-romeyke.de>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2020 by Andreas Romeyke.

This is free software, licensed under:

  The GNU General Public License, Version 3, June 2007