bin/uniq_wc - metacpan.org


            
              1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
—
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
              #!/usr/bin/env perl
# ABSTRACT: efficiently count unique tokens from a file
# PODNAME: uniq_wc
use autodie;
use strict;
use utf8;
use warnings;
our $VERSION = '0.014'; # VERSION
use Getopt::Long;
use Pod::Usage;
use Text::SpeedyFx;
GetOptions(
    q(help)             => \my $help,
    q(length=i)         => \my $length,
    q(seed=i)           => \my $seed,
    q(bits=i)           => \my $bits,
) or pod2usage(-verbose => 1);
pod2usage(-verbose => 2)
    if $help or $#ARGV != 0;
$length = 1
    unless $length;
$length *= 2 ** 20 << 3;
$seed = 0x4c53_4820
    unless $seed;
$bits = 8
    unless $bits;
my $data;
{
    local $/ = undef;
    open my $fh, q(<:mmap), shift @ARGV;
    $data = <$fh>;
    close $fh;
}
my $feature_vector = Text::SpeedyFx
    ->new($seed, $bits)
    ->hash_fv($data, $length);
my $hamming_weight = unpack q(%32b*) => $feature_vector;
printf qq(%0.0f\n), -$length * log(1 - $hamming_weight / $length);
__END__
=pod
=encoding UTF-8
=head1 NAME
uniq_wc - efficiently count unique tokens from a file
=head1 VERSION
version 0.014
=head1 SYNOPSIS
    uniq_wc [options] FILE
=head1 DESCRIPTION
The I<Linear Probabilistic Counter> is space efficient and allows the implementer to specify the desired level of accuracy.
This algorithm is useful when space efficiency is important but you need to be able to control the error in your results.
This algorithm works in a two-step process.
The first step assigns a bitmap in memory initialized to all zeros.
A hash function is then applied to the each entry in the input data.
The result of the hash function maps the entry to a bit in the bitmap, and that bit is set to 1.
The second step the algorithm counts the number of empty bits and uses that number as input to the following equation to get the estimate.
    n = -m ln Vn
In the equation I<m> is the size of the bitmap and I<Vn> is the ratio of empty bits over the size of the map.
The important thing to note is that the size of the original bitmap can be much smaller than the expected max cardinality.
How much smaller depends on how much error you can tolerate in the result.
Because the size of the bitmap, I<m>, is smaller than the total number of distinct elements, there will be collisions.
These collisions are required to be space-efficient but also result in the error found in the estimation.
So by controlling the size of the original map we can estimate the number of collisions and therefore the amount of error we will see in the end result.
(L<source|http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html>)
=head1 OPTIONS
=over 4
=item --help
This.
=item --length
Feature vector length (in MB, default: 1).
=item --seed
Custom seed (integer).
=item --bits
How many bits do represent one character.
The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint.
The value of 18 encompasses I<Basic Multilingual>, I<Supplementary Multilingual> and I<Supplementary Ideographic> planes.
=back
=head1 EXAMPLES
Use:
    $ time uniq_wc -l 8 enwik8
    361181
    real    0m1.262s
    user    0m1.220s
    sys     0m0.036s
Instead of:
    $ time perl -mbytes -lne '++$u{lc$1}while/(\w+)/g}{print~~keys%u' enwik8
    361990
    real    0m6.798s
    user    0m6.744s
    sys     0m0.028s
=head1 SEE ALSO
=over 4
=item *
L<Text::SpeedyFx>
=item *
L<Big Data Counting: How To Count A Billion Distinct Objects Using Only 1.5KB Of Memory|http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html>
=item *
L<Probabilistic Data Structures for Web Analytics and Data Mining|http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/>
=item *
L<The first 10^8 bytes of the English Wikipedia dump on Mar. 3, 2006|https://cs.fit.edu/~mmahoney/compression/textdata.html>
=back
=head1 AUTHOR
Stanislaw Pusep <stas@sysd.org>
=head1 COPYRIGHT AND LICENSE
This software is copyright (c) 2021 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.
=cut
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)