Data Science DC – Naive Bayes and Logistic Regression
Attending the Data Science DC meetup, will be live blogging..
7:04 PM: First up, we have the introductions and sponsor messages, by Harlan Harris.
7:18 PM: Elena Zheleva, from Living Social, starts the actual presentation. Starts with two examples:
Example 1: Classification of mail to Span or No-Spam
Example 2: Classification of voter to republican/democrat
Talks about features and attributes, and kinds of attributes (continuous, discrete, nominal, etc.)
The basics of Naive Bayes: The idea of Naive Bayes is of course simple enough. We should like to find P(Y | X), where X are the inputs, and Y are the class labels. X is typically composed of many, many attributes, so this may be better written as: P (Y | X1, X2, .. Xn)
Directly finding this would require a very large training set (due to 2^n combinations on binary attributes X1, X2, .. Xn). So, using Bayes theorem, we can rewrite this as:
P (Y | X) = P(X | Y) P (Y) / P(X)
P (X | Y) can be written again as P(X1, X2, .. Xn | Y), and now using the assumption that these attributes are independent (hence the name “Naive”), we can write this as:P (X | Y) = P(X1, X2, .. Xn | Y) = P(X1 | Y) * P(X2 | Y) * … * P(Xn|Y)
Next, she talks about the difference in approach between Naive Bayes and Logistics Regression. The paper by Andrew Ng and Michael Jordan (not that Michael Jordan, but a famous one nevertheless) is a helpful resource in that regard.
Question: How does NB work when the attributes are continuous, not binary.
Answer: If we can assume the distributions are Gaussian (Normal), then we can learn the parameters (sigma and mu). (The Wikipedia article section on Sex Classification contains an example.)
7:54 PM: On to Logistics Regression. Talks about the problem of overfitting which can occur if there are few samples. That was covered under the title of “Regularization” in a separate meetup.
8:13 PM: Time for acknowledgements and list of software available.
After meetup notes:Weka has a good reference implementation of Naive Bayes. Here is a snapshot of one of the examples. (I modified the data file a little bit, so your results may be slightly different.)