The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Algorithm::CurveFit::Simple - Convenience wrapper around Algorithm::CurveFit

SYNOPSIS

    use Algorithm::CurveFit::Simple qw(fit);

    my ($max_dev, $avg_dev, $src) = fit(xdata => \@xdata, ydata => \@ydata, ..options..);

    # Alternatively pass xdata and ydata together:
    my ($max_dev, $avg_dev, $src) = fit(xydata => [\@xdata, \@ydata], ..options..);

    # Alternatively pass data as array of [x,y] pairs:
    my ($max_dev, $avg_dev, $src) = fit(xydata => [[1, 2], [2, 5], [3, 10]], ..options..);

DESCRIPTION

This is a convenience wrapper around Algorithm::CurveFit. Given a body of (x, y) data points, it will generate a polynomial formula f(x) = y which fits that data.

Its main differences from Algorithm::CurveFit are:

  • It synthesizes the initial formula for you,

  • It allows for a time limit on the curve-fit instead of an iteration count,

  • It implements the formula as source code (or as a perl coderef, if you want to use the formula immediately in your program).

Additionally it returns a maximum deviation and average deviation of the formula vs the xydata, which is more useful (to me, at least) than Algorithm::CurveFit's square residual output. Closer to 1.0 indicates a better fit. Play with terms => # until these deviations are as close to 1.0 as possible, and beware overfitting.

SUBROUTINES

There is only one public subroutine, fit(). It must be given either xydata or xdata and ydata parameters. All other paramters are optional.

It returns three values: A maximum deviation, the average deviation and the formula implementation.

Options

fit(xdata => \@xdata, ydata => \@ydata)

The data points the formula will fit. Same as Algorithm::CurveFit parameters of the same name.

fit(xydata => [[1, 2, 3, 4], [10, 17, 26, 37]])
fit(xydata => [[1, 10], [2, 17], [3, 26], [4, 37]])

A more convenient way to provide data points. fit() will try to detect how the data points are organized -- list of x and list of y, or list of [x,y].

fit(terms => 3)

Sets the order of the polynomial, which will be of the form k + a*x + b*x**2 + c*x**3 .... The default is 3 and the limit is 10.

There is no need to specify initial k. It will be calculated from xydata.

fit(time_limit => 3)

If a time limit is given (in seconds), fit() will spend no more than that long trying to fit the data. It may return in much less time. The default is 3.

fit(iterations => 10000)

If an iteration count is given, fit() will ignore any time limit and iterate up to iterations times trying to fit the curve. Same as Algorithm::CurveFit parameter of the same name.

fit(inv => 1)

Setting inv inverts the sense of the fit. Instead of f(x) = y the formula will fit f(y) = x.

fit(impl_lang => "perl")

Sets the programming language in which the formula will be implemented. Currently supported languages are "C", "coderef" and the default, "perl".

When impl_lang => "coderef" is specified, a code reference is returned instead which may be used immediately by your perl script:

    my($max_dev, $avg_dev, $x2y) = fit(xydata => \@xy, impl_lang => "coderef");

    my $y = $x2y->(42);

More implementation languages will be supported in the future.

fit(impl_name => "x2y")

Sets the name of the function implementing the formula. The default is "x2y". Has no effect when used with impl_lang => "coderef").

    my($max_dev, $avg_dev, $src) = fit(xydata => \@xy, impl_name => "converto");

    print "$src\n";

    sub converto {
        my($x) = @_;
        my $y = -5340.93059104837 + 249.23009968947 * $x + -3.87745746448 * $x**2 + 0.02114780993 * $x**3;
        return $y;
    }
fit(bounds_check => 1)

When set, the implementation will include logic for checking whether the input is out-of-bounds, per the highest and lowest x points in the data used to fit the formula. For implementation languages which support exceptions, an exception will be thrown. For others (like C), -1.0 will be returned to indicate the error.

For instance, if the highest x in $xydata is 83.0 and the lowest x is 60.0:

    my($max_dev, $avg_dev, $src) = fit(xydata => \@xy, bounds_check => 1);

    print "$src\n";

    sub x2y {
        my($x) = @_;
        die "x out of bounds (high)" if ($x > 83.80000000000);
        die "x out of bounds (low)"  if ($x < 60.80000000000);
        my $y = -5340.93059104837 + 249.23009968947 * $x + -3.87745746448 * $x**2 + 0.02114780993 * $x**3;
        return $y;
    }
fit(round_result => 1)

When set, the implementation will round the output to the nearest whole number. When the implementation language is "C" this adds an #include <math.h> directive to the source code, which will have to be compiled against libm -- see man 3 round.

    my($max_dev, $avg_dev, $src) = fit(xydata => \@xy, round_result => 1);

    print "$src\n";

    sub x2y {
        my($x) = @_;
        my $y = -5340.93059104837 + 249.23009968947 * $x + -3.87745746448 * $x**2 + 0.02114780993 * $x**3;
        $y = int($y + 0.5);
        return $y;
    }
fit(suppress_includes => 1)

When set and lang_impl => "C", any #include directives which the implementation might need will be suppressed.

VARIABLES

The class variable %STATS_H contains various intermediate values which might be helpful. For instance, $STATS_H{deviation_max_offset_datum} contains the x data point which corresponds to the maximum deviation returned.

The contents of %STATS_H is subject to change and might not be fully documented in future versions. The current fields are:

deviation_max_offset_datum: The x data point corresponding with returned maximum deviation.
fit_calib_parar: Arrayref of formula parameters as returned by Algorithm::CurveFit after a short fitting attempt used for timing calibration.
fit_calib_time: The number of seconds Algorithm::CurveFit spent in the calibration run.
fit_iter: The iterations parameter passed to Algorithm::CurveFit.
fit_parar: Arrayref of formula parameters as returned by Algorithm::CurveFit.
fit_time: The number of seconds Algorithm::CurveFit actually spent fitting the formula.
impl_exception: The exception thrown when the implementation was used to calculate the deviations, or the empty string if none.
impl_formula: The formula part of the implementation.
impl_source: The implementation source string.
iter_mode: One of "time" or "iter", indicating whether a time limit was used or an iteration count.
xdata: Arrayref of x data points as passed to Algorithm::CurveFit.
ydata: Arrayref of y data points as passed to Algorithm::CurveFit.

CAVEATS

  • Only simple polynomial functions are supported. Sometimes you need something else. Use Algorithm::CurveFit for such cases.

  • If xydata is very large, iterating over it to calculate deviances can take more time than permitted by time_limit.

  • The dangers of overfitting are real! https://en.wikipedia.org/wiki/Overfitting

  • Using too many terms can dramatically reduce the accuracy of the fitted formula.

  • Sometimes calling Algorithm::CurveFit with a ten-term polynomial causes it to hang.

TO DO

  • Support more programming languages for formula implementation: R, MATLAB, python

  • Calculate the actual term sigfigs and set precision appropriately in the formula implementation instead of just "%.11f".

  • Support trying a range of terms and returning whatever gives the best fit.

  • Support piecewise output formulas.

  • Work around Algorithm::CurveFit's occasional hang problem when using ten-term polynomials.

SEE ALSO

Algorithm::CurveFit

curvefit