The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

AI::Embedding - Perl module for working with text embeddings using various APIs

VERSION

Version 1.11

SYNOPSIS

use AI::Embedding;

my $embedding = AI::Embedding->new(
    api => 'OpenAI',
    key => 'your-api-key'
);

my $csv_embedding  = $embedding->embedding('Some sample text');
my $test_embedding = $embedding->test_embedding('Some sample text');
my @raw_embedding  = $embedding->raw_embedding('Some sample text');

my $cmp = $embedding->comparator($csv_embedding2);

my $similarity = $cmp->($csv_embedding1);
my $similarity_with_other_embedding = $embedding->compare($csv_embedding1, $csv_embedding2);

DESCRIPTION

The AI::Embedding module provides an interface for working with text embeddings using various APIs. It currently supports the OpenAI Embeddings API. This module allows you to generate embeddings for text, compare embeddings, and calculate cosine similarity between embeddings.

Embeddings allow the meaning of passages of text to be compared for similarity. This is more natural and useful to humans than using traditional keyword based comparisons.

An Embedding is a multi-dimensional vector representing the meaning of a piece of text. The Embedding vector is created by an AI Model. The default model (OpenAI's text-embedding-ada-002) produces a 1536 dimensional vector. The resulting vector can be obtained as a Perl array or a Comma Separated String. As the Embedding will typically be used homogeneously, having it as a CSV String is usually more convenient. This is suitable for storing in a TEXT field of a database.

Comparator

Embeddings are used to compare similarity of meaning between two passages of text. A typical work case is to store a number of pieces of text (e.g. articles or blogs) in a database and compare each one to some user supplied search text. AI::Embedding provides a compare method compare two Embeddings.

Alternatively, the comparator method can be called with one Embedding. The comparator returns a reference to a method that takes a single Embedding to be compared to the Embedding from which the Comparator was created.

When comparing multiple Embeddings to the same Embedding (such as search text) it is faster to use a comparator.

CONSTRUCTOR

new

my $embedding = AI::Embedding->new(
    api         => 'OpenAI',
    key         => 'your-api-key',
    model       => 'text-embedding-ada-002',
);

Creates a new AI::Embedding object. It requires the 'key' parameter. The 'key' parameter is the API key provided by the service provider and is required.

Parameters:

  • key - required The API Key

  • api - The API to use. Currently only 'OpenAI' is supported and this is the default.

  • model - The language model to use. Defaults to text-embedding-ada-002 - see OpenAI docs

METHODS

success

Returns true if the last method call was successful

error

Returns the last error message or an empty string if success returned true

embedding

my $csv_embedding = $embedding->embedding('Some text passage', [$verbose]);

Generates an embedding for the given text and returns it as a comma-separated string. The embedding method takes a single parameter, the text to generate the embedding for.

Returns a (rather long) string that can be stored in a TEXT database field.

If the method call fails it sets the "error" message and returns undef. If the optional verbose parameter is true, the complete HTTP::Tiny response object is also returned to aid with debugging issues when using this module.

raw_embedding

my @raw_embedding = $embedding->raw_embedding('Some text passage', [$verbose]);

Generates an embedding for the given text and returns it as an array. The raw_embedding method takes a single parameter, the text to generate the embedding for.

It is not normally necessary to use this method as the Embedding will almost always be used as a single homogeneous unit.

If the method call fails it sets the "error" message and returns undef. If the optional verbose parameter is true, the complete HTTP::Tiny response object is also returned to aid with debugging issues when using this module.

test_embedding

my $test_embedding = $embedding->test_embedding('Some text passage', $dimensions);

Used for testing code without making a chargeable call to the API.

Provides a CSV string of the same size and format as embedding but with meaningless random data.

Returns a random embedding. Both parameters are optional. If a text string is provided, the returned embedding will always be the same random embedding otherwise it will be random and different every time. The dimension parameter controls the number of elements of the returned CSV string. If omitted, the string will have the text-embedding-ada-002 default of 1536 elements.

comparator

$embedding->comparator($csv_embedding2);

Sets a vector as a comparator for future comparisons and returns a reference to a method for using the comparator.

The comparator method takes a single parameter, the comma-separated Embedding string to use as the comparator.

The following two are functionally equivalent. However, where multiple Embeddings are to be compared to a single Embedding, using a Comparator is significantly faster.

my $similarity = $embedding->compare($csv_embedding1, $csv_embedding2);


my $cmp = $embedding->comparator($csv_embedding2);
my $similarity = $cmp->($csv_embedding1);

See "Comparator"

The returned method reference returns the cosine similarity between the Embedding used to call the comparator method and the Embedding supplied to the method reference. See compare for an explanation of the cosine similarity.

compare

my $similarity_with_other_embedding = $embedding->compare($csv_embedding1, $csv_embedding2);

Compares two embeddings and returns the cosine similarity between them. The compare method takes two parameters: $csv_embedding1 and $csv_embedding2 (both comma-separated embedding strings).

Returns the cosine similarity as a floating-point number between -1 and 1, where 1 represents identical embeddings, 0 represents no similarity, and -1 represents opposite embeddings.

The absolute number is not usually relevant for text comparision. It is usually sufficient to rank the comparison results in order of high to low to reflect the best match to the worse match.

SEE ALSO

https://openai.com - OpenAI official website

AUTHOR

Ian Boddison <ian at boddison.com>

BUGS

Please report any bugs or feature requests to bug-ai-embedding at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=bug-ai-embedding. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc AI::Embedding

You can also look for information at:

ACKNOWLEDGEMENTS

Thanks to the help and support provided by members of Perl Monks https://perlmonks.org/.

Especially Ken Cotterill (KCOTT) for assistance with unit tests and Hugo van der Sanden (HVDS) for suggesting the current comparator implementaion.

COPYRIGHT AND LICENSE

This software is copyright (c) 2023 by Ian Boddison.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.