[Mimedefang] Spamassassin 3.1 and improved bayes/sql.

Wed Aug 17 15:27:53 EDT 2005

David F. Skoll wrote:
> Matthew Schumacher wrote:
> 
> Have you thought about simply doing:
> 
>      SELECT * FROM bayes WHERE token in ('tok1', 'tok2', ..., 'tokN')
> 
> It seems to me that should be just as fast, and not rely on PostgreSQL
> features or stored procedures.
> 
> You have to be careful with messages that have extremely large numbers
> of tokens; you might need to split the query into chunks of 1000
> tokens each or something like that.
> 

Yes, we tried that.  I attached another version of the proc that gets
rid of the looping altogether, but believe it or not, it's slower.

The reason why we rely on procs is because the SA code doesn't have
transactions yet.  By passing the tokens in as an array we get
transactions because pgsql treats each run of a proc as a trans.

> 
>>Here is the new benchmark:
>>http://wiki.apache.org/spamassassin/BayesBenchmarkResults
> 
> 
> Those results are extremely surprising.  Our CanIt benchmarks show
> Berkeley DB outperforming PostgreSQL by a factor of 6 to 10, but your
> benchmarks show them about equal.  Something is funny there...  I
> wonder if it could be that CanIt never locks the BDB files, whereas
> SpamAssassin does?  If that's the case, then there's still tremendous
> room for improvement on the BDB side.

Yea the SA bdb code is a locking headache.  I have never really coded
against bdb so I have no idea if it's good, bad, or otherwise.

> 
> Also, I don't think the "fsync=false" column should even be presented.
> Nobody who cares about his/her data runs PostgreSQL like that, so the
> timings in that column are unachievable in real-world situations.

I agree that it's useless in the real world, but it is interesting to
see how much time the sync takes.  Before grouping tokens into procs
(read transactions) sync was 25x slower.  Now that they are pretty close
speed wise, that tells you that we are not spending to much time doing a
sync.

> 
> Ironically, just as SpamAssassin is making strides with a centralized
> SQL database, in CanIt, we've revised our thinking and started moving
> to distributed BDB databases. :-)

You mean having a separate bdb for each key/val pair?  I proposed this
to the sa people for the AWL and they shot it down.  Like I said, I'm
ignorant when it comes to bdb.

As much as I love pgsql and think it's light years beyond mysql in
features, stability, flexibility, and even performance (when doing
complex queries against large tables), I am now testing mysql for my
bayes store.  Mysql has one good thing going for it, raw speed on simple
queries, which is really the only requirement for bayes.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: procs.sql
Type: text/x-sql
Size: 2002 bytes
Desc: not available
URL: <https://lists.mimedefang.org/pipermail/mimedefang_lists.mimedefang.org/attachments/20050817/2613b8aa/attachment.sql>