slugify() for postgres (almost)
-
Comments:
- here.
A recent discussion in #django suggested “what we need is a PG slugify function”.
The actual algorithm in Django for this is fairly simple, and easy to follow. Shouldn’t be too hard to write it in SQL.
Function slugify(value, allow_unicode=False)
.
- Convert to ASCII if
allow_unicode
is false - Remove characters that aren’t alphanum, underscores, hyphens
- Strip leading/trailing whitespace
- Convert to lowercase
- Convert spaces to hyphens
- Remove repeated hyphens
(As an aside, the comment in the django function is slightly misleading: if you followed the algorithm there, you’d get a different result with respect to leading trailing whitespace. I shall submit a PR).
We can write an SQL function that uses the Postgres unaccent
extension to get pretty close:
CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$
WITH "normalized" AS (
SELECT CASE
WHEN "allow_unicode" THEN "value"
ELSE unaccent("value")
END AS "value"
),
"remove_chars" AS (
SELECT regexp_replace("value", E'[^\\w\\s-]', '', 'gi') AS "value"
FROM "normalized"
),
"lowercase" AS (
SELECT lower("value") AS "value"
FROM "remove_chars"
),
"trimmed" AS (
SELECT trim("value") AS "value"
FROM "lowercase"
),
"hyphenated" AS (
SELECT regexp_replace("value", E'[-\\s]+', '-', 'gi') AS "value"
FROM "trimmed"
)
SELECT "value" FROM "hyphenated";
$$ LANGUAGE SQL STRICT IMMUTABLE;
I’ve used a CTE to get each step as a seperate query: you can do it with just two levels if you don’t mind looking at nested function calls:
CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$
WITH "normalized" AS (
SELECT CASE
WHEN "allow_unicode" THEN "value"
ELSE unaccent("value")
END AS "value"
)
SELECT regexp_replace(
trim(
lower(
regexp_replace(
"value",
E'[^\\w\\s-]',
'',
'gi'
)
)
),
E'[-\\s]+', '-', 'gi'
) FROM "normalized";
$$ LANGUAGE SQL STRICT IMMUTABLE;
To get the default value for the second argument, we can have an overloaded version with only a single argument:
CREATE OR REPLACE FUNCTION slugify(TEXT)
RETURNS TEXT AS 'SELECT slugify($1, false)' LANGUAGE SQL IMMUTABLE STRICT;
Now for some tests. I’ve been using pgTAP lately, so here’s some tests using that:
BEGIN;
SELECT plan(7);
SELECT is(slugify('Hello, World!', false), 'hello-world');
SELECT is(slugify('Héllø, Wørld!', false), 'hello-world');
SELECT is(slugify('spam & eggs', false), 'spam-eggs');
SELECT is(slugify('spam & ıçüş', true), 'spam-ıçüş');
SELECT is(slugify('foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify(' foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('你好', true), '你好');
SELECT * FROM finish();
ROLLBACK;
And we get one failing test:
=# SELECT is(slugify('你好', true), '你好');
is
──────────────────────
not ok 7 ↵
# Failed test 7 ↵
# have: ↵
# want: 你好
(1 row)
Time: 2.004 ms
It seems there is no way to get the equivalent to the python re.U
flag on a postgres regular expression function, so that is as close as we can get.