Tiny Tiny RSS: Community

[SOLVED] Af_Zz_Img_Phash, Postgres, and count-bits

I was trying to track down a hamming function w/o compiling and installing the count-bits project (not much fun on windows where I have my dev instance :/). The following works and is quite performant:

CREATE OR REPLACE FUNCTION bit_count(value bigint) RETURNS integer AS $$
   SELECT length(replace(value::bit(64)::text,'0','')); $$
LANGUAGE SQL IMMUTABLE STRICT;

I ran bit_count on ~1000 phash-es and it completed in a millisecond or 2. Since bit_count is called on article rendering, there wouldn’t be any perceptual difference between a compiled extension and the SQL function.

The current method isn’t broken, so not sure this is worth a PR. But it is something that I am going to maintain on my end. It would at least lower the bar to entry for the plugin on postgres installs.

Thoughts?

R

Just timed this in psql:

tt-rss=> select sum(bit_count(x)) from generate_series(1,1000000) x;
   sum
---------
 9884999
(1 row)

Time: 2174.508 ms (00:02.175)

Or 460 computations / ms. And this in MariaDB:

select sum(bit_count(0 ^ phash)), count(*) from ttrss_plugin_img_phash_urls;
+---------------------------+----------+
| sum(bit_count(0 ^ phash)) | count(*) |
+---------------------------+----------+
|                     32543 |     1163 |
+---------------------------+----------+

SHOW PROFILES;
+----------+------------+---------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                           |
+----------+------------+---------------------------------------------------------------------------------+
|        3 | 0.00235660 | select sum(bit_count(0 ^ phash)), count(*) from ttrss_plugin_img_phash_urls |
+----------+------------+---------------------------------------------------------------------------------+

Or 494 computations / ms (the time drops if I run it again b/c caching so MariaDB is much faster than this shows).

Not a perfect comparison but good enough to show that the SQL function isn’t going to ruin the user experience.

well…

ttrss_db=> explain analyze select count(*) from ttrss_plugin_img_phash_urls where bit_count(6362624334959546233 # phash) < 10;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1252.87..1252.88 rows=1 width=8) (actual time=149.363..149.364 rows=1 loops=1)
   ->  Seq Scan on ttrss_plugin_img_phash_urls  (cost=0.00..1230.32 rows=9022 width=0) (actual time=0.089..149.355 rows=1 loops=1)
         Filter: (length(replace(((('6362624334959546233'::bigint # phash))::bit(64))::text, '0'::text, ''::text)) < 10)
         Rows Removed by Filter: 27126
 Planning time: 0.422 ms
 Execution time: 149.463 ms
(6 rows)

vs

ttrss_db=> explain analyze select count(*) from ttrss_plugin_img_phash_urls where unique_1bits(6362624334959546233, phash) < 10;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=914.54..914.55 rows=1 width=8) (actual time=7.475..7.475 rows=1 loops=1)
   ->  Seq Scan on ttrss_plugin_img_phash_urls  (cost=0.00..891.99 rows=9022 width=0) (actual time=0.033..7.453 rows=1 loops=1)
         Filter: (unique_1bits('6362624334959546233'::bigint, phash) < 10)
         Rows Removed by Filter: 27126
 Planning time: 0.108 ms
 Execution time: 7.535 ms
(6 rows)

this 150ms doesn’t sound like a lot but it’s for every image URL in every rendered article. 21 times slower.

i suppose this could be implemented as a not-recommended fallback enabled by a config.php knob (with a performance warning attached).

it would be probably easier to check if unique_1bits is not there and switch to this function while displaying a performance warning in the plugin UI.

e: then again maybe it should be explicitly enabled - imagine migrating to a different server and suddenly everything is lagging and you don’t know why (who would bother looking in some random plugin options in the prefs UI)

Or, if a quicker fallback is desirable; I also tried this solution:

abs('$phash' - phash)

: )

it works and flagged similar images that were of different size and had different visual compression noise. But it worked, so the hashing alone was sufficient to flag them.

So that might be a fast (but limited) fallback option.

tldr: no, you can’t just do that and get reliable results.

tldr: no, you can’t just do that and get reliable results.

right, it wouldn’t be a hamming distance at all; it would just be a direct phash comparison. So if the hashes match then it would flag as similar. But it works for very similar photos. For example, NASA’s social media account often have the same photo and it successfully id’ed them.

e: abs is just a thought as an easy but fast alternative. The sql function is an alternative but slow. Not sure this needs fixing at all; maybe just a comment in the README would be enough to show users how to get around a compile/install.

it’s not really an alternative because it would only, sorta-kinda, work for a subset of images with fitting sequence of bits set

on the other hand that is actually a working alternative even though a slow one, which can be optionally supported