[Mimedefang] IP Reputation data collection (announcement, Internet draft)

Fri Apr 30 16:01:40 EDT 2010

Hi Dave,

Passionate technical debate follows ;-)

DFS, I believe my comments below also address your comments which I 
received slightly later.

In synopsis, I'd recommend you go with the broader, more flexible RFC.  
This is a great idea IMO either way, though!

regards,
KAM

>> 1 - including the product / version used for auto-ham/spam and the  
>> automated score & threshold of a spam
>
> I see some of this as best handled out of band.  You already need to 
> negotiate a username and shared secret before events can be reported 
> to the aggregator, so that's probably the best time to communicate 
> product and version information.
As versions are always changing, you might want to know that someone is 
using SpamAssassin 3.X and another person is using IHateSpam, etc.

> The issue of scores is tougher, particularly in situations where 
> end-user configuration can change the score at any time.  Here, it may 
> make sense to return the score and threshold with the event, but those 
> two points of data may not provide enough information to be useful.  
> For example, two users (or CanIt streams, or filtering systems, or...) 
> could have the same threshold and arrive at the same score for a 
> nearly identical message, but for entirely different reasons.  It's 
> probably enough for the purposes of reputation tracking to know that 
> someone or something thought they saw a spam event from a given address.
I agree it's not a complete snapshot but the information could be 
invaluable.  How valuable is debatable but my point is that some "extra 
data" per event is likely a good idea.  And, for example, emails that 
score really high on SA are something that could be weighted.  I might 
not even pay attention to the spam threshold as much as the spam score, 
for example.

 From RPs perspective, knowing that 1.2.3.4 is sending a LOT of emails 
all marked 15 and higher by SA could give a lot more credibility than 
marking a bunch of emails 1% over the threshold.
>> 2 - including virii/malware as a note
>
> Another event type for "virus or malware seen" might be a good 
> addition, but I don't see any value in communicating back anything 
> more detailed than that for calculating reputation.   Differentiation 
> between "virus" and other malware might be useful, too.
The virus type would be useful in identifying breakouts, etc.  Again 
though, this isn't a debate of the value of the data because that 
shouldn't be a goal of the RFC. The goal is to provide something that is 
a framework lots of people might use both as aggregators and sensors.  
Towards that end, I would encourage RP to consider packaging the 
aggregator code as well since it's my basic belief

>> 3 - dangerous attachments and a filename
>> 4 - dangerous content
>
> I guess the usefulness of this depends on the definition of 
> "dangerous".  What are you looking for here?
One example is a lot of emails that are phishing are sent with bad PDFs 
and EXEs.

Dangerous content could refer to phishing attacks via social engineer 
that don't have attachments.  Perhaps something like the ClamAV Phishing 
signatures.
>> 5 - reverse DNS failures
>
> This might be good, but handling transient failures due to local or 
> upstream DNS issues vs. failure to configure rDNS for a host might be 
> necessary.
IMO, you are debating what an aggregator should do with a data rather 
than the process of sending / receiving the data.  However, I think we 
can agree that rDNS is an important component in the email ecosystem.

 From RPs perspective, tracking senders that are consistently using 
invalid rDNS especially reported by multiple sensors would lead to 
valuable data especially if it occurred over a period of time suitable 
to remove DNS outages from consideration.

>> 6 - improper HELO/EHLO statements
>
> This is probably a good one to add.
Hooray.  We'll always have Paris.

Seriously though, please realize that this was my first pass at a 
response to the RFC.  I think you should poll for brainstorms on EVENTS 
to consider. There have got to be a lot more I haven't thought 
of/remembered/etc.

>> 7 - invalid MX records
>
> That's not terribly useful for a sending IP address, as there's no 
> legitimate reason the sending IP needs to be an MX of the sender's 
> domain.
While RP's use of the aggregated data is an IP-based index, others might 
use it for a sending email address index, etc.  But knowing that IP 
1.2.3.4 sent me an email from a from address with an invalid MX record 
(which includes checking A records, etc.) is quite useful in real-world 
anti-spam.

>> I liked that in in #3 that REPUTATION database is not specific to  
>> indexing by IPv4 or IPv6.  The system should be extensible to report  
>> more data such as the email address of the sender or recipient, the  
>> subject of the email, etc.  In theory, the system could even replace  
>> Razor so it could include a hash of the email, etc.  But I would 
>> likely  caveatthe first sentence with "index by IPv4 or IPv6 address 
>> as oner  example".
>
> That's probably a bit of scope creep.  The idea here is that filters can
> communicate IP reputation information with a low-overhead UDP 
> protocol. Sender address reputation might be worth investigating in a 
> future iteration (the extensibility is there), but let's concentrate 
> on the IP reputation case for now.
Sure, replacing Razor is feature creep so that's an extreme case.  But 
adding more data to the packet is likely necessary to make this more 
extensible though I did scope it to fairly short bits of data like 
to/from/subject and hash values.

Plus playing devil's advocate, the RFC says specifically the IP 
reputation is NOT the only goal:

"Note that the exact format of the reputation database as well as what 
constitutes "reputation" are beyond the scope of this document.  We are 
concerned only with a standard for reporting events."

So while I'm happy to address it more narrowly, my editorial feedback on 
this version would be to remove that statement if it isn't your goal to 
extend this beyond IP reputation.

> The use of port 6568 could be expanded to stated something like 
> unless  the AGGREGATOR utilizes an alternate port or something.  I 
> have other  listeners on 6568 already, for example.
>
> Well, it's an RFC, so "SHOULD" pretty much covers that.
Agreed and I was happy you added the RFC-eeze description but it never 
hurts to be explicitly flexible and even require that alternate ports be 
possible.

> 4.2 would be best organized into 4.2.0 for reserved, 4.2.1 for  
> GREYLISTED, etc. so that all event types have a clear report  
> restriction.  Then 4.2 should be restrictions for all events like IPv4

Makes sense though if you end up adding a bazillion more EVENT types, 
grouping them could become troublesome.  I was mostly looking for some 
semblance of a 1:1 restriction for each EVENT type to help ensure that 
an EVENT type isn't forgotten in the years to come.

> Does " a priori knowledge" mean something or is it a grammar/spelling 
> issue?
>
> http://en.wikipedia.org/wiki/A_priori_and_a_posteriori#Use_of_the_terms

Thanks.  I wasn't sure if there was some other meaning than who I read 
it originally.

So knowing that, my underlying question is: What is the a reason that a 
sensor should only send 492 bytes?   Because I read the text it as "with 
prior knowledge" which seems a fair paraphrase and that meant to me that 
the very next statement constituted prior knowledge that the aggregator 
has to accept larger than 492 bytes.  In short, sentence one's caveat is 
met by sentence two's caveat that the aggregator MUST handle reports 
equal to or less than 65507, i.e. greater than 492 bytes.  This 
invalidates the need for sentence 1 completely which I imagine isn't 
what you want.
>> I would include an extract definition of [GREY] in section 7 in 
>> addition  to the reference.  It's a term that confuses a lot of 
>> people that I  discuss anti-spam with that aren't anti-spam researchers.
>
> Possibly a good idea, though I don't expect too many people who aren't 
> involved in anti-spam activities will be interested in this RFC
Touche.  I agree to this statement 100%.  I forgot to consider the 
audience.

Regards,
KAM