slugify() for postgres (almost)
- 
      Comments:
      
 - here.
 
A recent discussion in #django suggested “what we need is a PG slugify function”.
The actual algorithm in Django for this is fairly simple, and easy to follow. Shouldn’t be too hard to write it in SQL.
Function slugify(value, allow_unicode=False).
- Convert to ASCII if 
allow_unicodeis false - Remove characters that aren’t alphanum, underscores, hyphens
 - Strip leading/trailing whitespace
 - Convert to lowercase
 - Convert spaces to hyphens
 - Remove repeated hyphens
 
(As an aside, the comment in the django function is slightly misleading: if you followed the algorithm there, you’d get a different result with respect to leading trailing whitespace. I shall submit a PR).
We can write an SQL function that uses the Postgres unaccent extension to get pretty close:
CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$
  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  ),
  "remove_chars" AS (
    SELECT regexp_replace("value", E'[^\\w\\s-]', '', 'gi') AS "value"
    FROM "normalized"
  ),
  "lowercase" AS (
    SELECT lower("value") AS "value"
    FROM "remove_chars"
  ),
  "trimmed" AS (
    SELECT trim("value") AS "value"
    FROM "lowercase"
  ),
  "hyphenated" AS (
    SELECT regexp_replace("value", E'[-\\s]+', '-', 'gi') AS "value"
    FROM "trimmed"
  )
  SELECT "value" FROM "hyphenated";
$$ LANGUAGE SQL STRICT IMMUTABLE;
I’ve used a CTE to get each step as a seperate query: you can do it with just two levels if you don’t mind looking at nested function calls:
CREATE OR REPLACE FUNCTION slugify("value" TEXT, "allow_unicode" BOOLEAN)
RETURNS TEXT AS $$
  WITH "normalized" AS (
    SELECT CASE
      WHEN "allow_unicode" THEN "value"
      ELSE unaccent("value")
    END AS "value"
  )
  SELECT regexp_replace(
    trim(
      lower(
        regexp_replace(
          "value",
          E'[^\\w\\s-]',
          '',
          'gi'
        )
      )
    ),
    E'[-\\s]+', '-', 'gi'
  ) FROM "normalized";
$$ LANGUAGE SQL STRICT IMMUTABLE;
To get the default value for the second argument, we can have an overloaded version with only a single argument:
CREATE OR REPLACE FUNCTION slugify(TEXT)
RETURNS TEXT AS 'SELECT slugify($1, false)' LANGUAGE SQL IMMUTABLE STRICT;
Now for some tests. I’ve been using pgTAP lately, so here’s some tests using that:
BEGIN;
SELECT plan(7);
SELECT is(slugify('Hello, World!', false), 'hello-world');
SELECT is(slugify('Héllø, Wørld!', false), 'hello-world');
SELECT is(slugify('spam & eggs', false), 'spam-eggs');
SELECT is(slugify('spam & ıçüş', true), 'spam-ıçüş');
SELECT is(slugify('foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('    foo ıç bar', true), 'foo-ıç-bar');
SELECT is(slugify('你好', true), '你好');
SELECT * FROM finish();
ROLLBACK;
And we get one failing test:
=# SELECT is(slugify('你好', true), '你好');
          is
──────────────────────
 not ok 7            ↵
 # Failed test 7     ↵
 #         have:     ↵
 #         want: 你好
(1 row)
Time: 2.004 ms
It seems there is no way to get the equivalent to the python re.U flag on a postgres regular expression function, so that is as close as we can get.