17-08-28 Interesting Facts in Machine Learning (Logistic Regression)

1. Two major ways to do multinomial eval:

Softmax Loss
One vs. All with binary (logistic) function
Naming -
1. “Logistic” regression due to Sigmoid (logistic) function
2. “Softmax” regression due to softmax function
No closed form solution, despite convexity
Many, many optimizers:
1. Newton / Newton-CG
2. BFGS
  1. L-BFGS
3. IRLS
4. Trust Region Conjugate Gradient
5. Gradient Descent
  1. GD + Line Search
6. Stochastic Average Gradient
Difficult Bayesian Solutions (No convenient conjugate prior)
Discriminative (Learns P(Y|X), rather than first the joint P(Y, X) and then conditioning on X (the generative approach))
Without regularization, the weights will become arbitrary large, damaging generalization. Penalties are more important than in the regression setting.
You can get better generalization with a stochastic solver [https://arxiv.org/pdf/1708.05070.pdf]
The reason scaling can still be important is for the optimizer - even though you technically have a convex model and will get the same solution
Linear generalization is stronger than almost every other form of generalization for unstructured data (trees + networks overfit)
Every relationship between your feature and the label should be as close to linear as possible
You can use boxcox transform to automatically get close to linear