[Mimedefang] Spamassassin 3.1 and improved bayes/sql.

Wed Aug 17 14:14:02 EDT 2005

Matthew Schumacher wrote:

> I thought I would mention to the MD users that spamassassin 3.1 which is
> in rc1 has much better bayes/sql support.

Cool.

> I was working with the SA guys and the people on the pgsql performance
> list where Tom Lane came up with a way to pass the tokens as an array
> via a pgsql proc.  The result is grouping all of the tokens from an
> email into a single transaction which is the difference between pgsql
> being unusable to almost as fast as mysql.

Have you thought about simply doing:

     SELECT * FROM bayes WHERE token in ('tok1', 'tok2', ..., 'tokN')

It seems to me that should be just as fast, and not rely on PostgreSQL
features or stored procedures.

You have to be careful with messages that have extremely large numbers
of tokens; you might need to split the query into chunks of 1000
tokens each or something like that.

> Here is the new benchmark:
> http://wiki.apache.org/spamassassin/BayesBenchmarkResults

Those results are extremely surprising.  Our CanIt benchmarks show
Berkeley DB outperforming PostgreSQL by a factor of 6 to 10, but your
benchmarks show them about equal.  Something is funny there...  I
wonder if it could be that CanIt never locks the BDB files, whereas
SpamAssassin does?  If that's the case, then there's still tremendous
room for improvement on the BDB side.

Also, I don't think the "fsync=false" column should even be presented.
Nobody who cares about his/her data runs PostgreSQL like that, so the
timings in that column are unachievable in real-world situations.

Ironically, just as SpamAssassin is making strides with a centralized
SQL database, in CanIt, we've revised our thinking and started moving
to distributed BDB databases. :-)

Regards,

David.