smlar

smlar : Effective similarity search

Overview

IDExtensionPackageVersionCategoryLicenseLanguage
1850
smlar
smlar
1.0
RAG
PostgreSQL
C
AttributeHas BinaryHas LibraryNeed LoadHas DDLRelocatableTrusted
--s-d-r
No
Yes
No
Yes
yes
no
Relationships
See Also
pg_similarity
fuzzystrmatch
pg_trgm
intarray
vector
pg_bigm
unaccent
vchord

fix pg18 break issue by https://github.com/Vonng/smlar

Packages

TypeRepoVersionPG Major CompatibilityPackage PatternDependencies
EXT
PIGSTY
1.0
18
17
16
15
14
smlar-
RPM
PIGSTY
1.0
18
17
16
15
14
smlar_$v-
DEB
PIGSTY
1.0
18
17
16
15
14
postgresql-$v-smlar-
Linux / PGPG18PG17PG16PG15PG14
el8.x86_64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
el8.aarch64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
el9.x86_64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
el9.aarch64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
el10.x86_64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
el10.aarch64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
d12.x86_64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
d12.aarch64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
d13.x86_64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
d13.aarch64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
u22.x86_64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
u22.aarch64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
u24.x86_64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
u24.aarch64
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0
PIGSTY 1.0

Source

pig build pkg smlar;		# build rpm/deb

Install

Make sure PGDG and PIGSTY repo available:

pig repo add pgsql -u   # add both repo and update cache

Install this extension with pig:

pig install smlar;		# install via package name, for the active PG version

pig install smlar -v 18;   # install for PG 18
pig install smlar -v 17;   # install for PG 17
pig install smlar -v 16;   # install for PG 16
pig install smlar -v 15;   # install for PG 15
pig install smlar -v 14;   # install for PG 14

Create this extension with:

CREATE EXTENSION smlar;

Usage

smlar: Effective similarity search for PostgreSQL arrays. Source: README

The smlar extension provides effective similarity search on PostgreSQL arrays using configurable similarity formulas, GiST and GIN index support, and TF/IDF weighting.


Functions

float4 smlar(anyarray, anyarray)

Computes similarity of two arrays. Arrays should be the same type.

float4 smlar(anyarray, anyarray, bool useIntersect)

Computes similarity of two arrays of composite types. Composite type looks like:

CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);

The useIntersect option points to use only intersected elements in the denominator.

float4 smlar(anyarray a, anyarray b, text formula)

Computes similarity of two arrays by a given formula. Predefined variables in formula:

  • N.i – number of common elements in both arrays (intersection)
  • N.a – number of unique elements in first array
  • N.b – number of unique elements in second array

Example:

SELECT smlar('{1,4,6}'::int[], '{5,4,6}');
SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)');
-- These two calls are equivalent.
anyarray % anyarray

Returns true if similarity of the arrays is greater than the threshold limit.

text[] tsvector2textarray(tsvector)

Transforms tsvector type to text array.

anyarray array_unique(anyarray)

Sort and unique array.

float4 inarray(anyarray, anyelement)

Returns zero if second argument does not present in the first one and 1.0 in opposite case.

float4 inarray(anyarray, anyelement, float4, float4)

Returns fourth argument if second argument does not present in the first one and third argument in opposite case.


GUC Configuration Variables

smlar.threshold  FLOAT

Arrays with similarity lower than threshold are not similar by % operation.

smlar.persistent_cache  BOOL

Cache of global stat is stored in transaction-independent memory.

smlar.type  STRING

Type of similarity formula: cosine (default), tfidf, overlap.

smlar.stattable  STRING

Name of table storing set-wide statistic. Table should be defined as:

CREATE TABLE table_name (
    value   data_type UNIQUE,
    ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc > 0)
);

A row with null value means total number of documents. Used only for smlar.type = 'tfidf'.

smlar.tf_method  STRING

Calculation method for term frequency. Values:

  • "n" – simple counting of entries (default)
  • "log" – 1 + log(n)
  • "const" – TF is equal to 1

Used only for smlar.type = 'tfidf'.

smlar.idf_plus_one  BOOL

If false (default), calculate idf as log(d/df). If true, as log(1+d/df). Used only for smlar.type = 'tfidf'.

It is highly recommended to add to postgresql.conf:

smlar.threshold = 0.6  # or any other value > 0 and < 1

GiST/GIN Index Support

The % and && operations are supported with GiST and GIN indexes for many array types:

Array TypeGIN operator classGiST operator class
bit[]_bit_sml_ops
bytea[]_bytea_sml_ops_bytea_sml_ops
char[]_char_sml_ops_char_sml_ops
cidr[]_cidr_sml_ops_cidr_sml_ops
date[]_date_sml_ops_date_sml_ops
float4[]_float4_sml_ops_float4_sml_ops
float8[]_float8_sml_ops_float8_sml_ops
inet[]_inet_sml_ops_inet_sml_ops
int2[]_int2_sml_ops_int2_sml_ops
int4[]_int4_sml_ops_int4_sml_ops
int8[]_int8_sml_ops_int8_sml_ops
interval[]_interval_sml_ops_interval_sml_ops
macaddr[]_macaddr_sml_ops_macaddr_sml_ops
money[]_money_sml_ops
numeric[]_numeric_sml_ops_numeric_sml_ops
oid[]_oid_sml_ops_oid_sml_ops
text[]_text_sml_ops_text_sml_ops
time[]_time_sml_ops_time_sml_ops
timestamp[]_timestamp_sml_ops_timestamp_sml_ops
timestamptz[]_timestamptz_sml_ops_timestamptz_sml_ops
timetz[]_timetz_sml_ops_timetz_sml_ops
varbit[]_varbit_sml_ops
varchar[]_varchar_sml_ops_varchar_sml_ops
Last updated on