Ticket #1118 (new bug)
Lasso Regression is slow on auto-mpg dataset.
Reported by: | anze | Owned by: | lan |
---|---|---|---|
Milestone: | 2.6 | Component: | library |
Severity: | major | Keywords: | |
Cc: | lanz | Blocking: | |
Blocked By: |
Description
Learning of Lasso Regression learner on auto-mpg dataset is very slow. It also generates division by zero warnings. Please check and fix if possible.
code to reproduce (run python as "python -Wall" to enable warnings):
import Orange auto = Orange.data.Table("auto-mpg") lr = Orange.regression.lasso.LassoRegressionLearner(n_boot=2, n_perm=2) lr(auto)
Attachments
Change History
comment:1 Changed 2 years ago by lanz
- Type changed from check to bug
- Severity changed from minor to major
- Milestone changed from 2.6 to 2.5
Changed 2 years ago by lanz
- attachment lasso_lbfgs.diff added
Another optimization function. Note that the parameter t here is not the same as before.
comment:2 Changed 2 years ago by lanz
- Cc lanz added
I attached a possible (temporary) fix. In the future, someone could implement the LARS algorithm to see how much it improves the performance. I think for now cvxopt is not necessary since we probably don't need it for anything else.
Note: See
TracTickets for help on using
tickets.
Not just slow, LASSO is (at least on my system: latest orange, scipy 0.10.1) completely BROKEN! The optimization it uses (scipy.optimize.fmin_cobyla) is not working properly for this problem.
If there are more than ~20 variables the optimization is not converging, the solution it returns is not even feasible (!) and, as far as I saw in a couple of tests, largely independent of the parameter t.
Additionaly, even in very small problems when it reports to have converged successfully, it has in fact not found the optimal solution.
I have tried some minor tweaks such as setting the initial approximation to all zeros instead of random (this is at least guaranteed to be feasible), but it didn't help.
It also doesn't work if you use a different parametrization of lasso (with regularization in the objective, but without constraints). But with this formulation it is possible to use scipy.optimize.fmin_l_bfgs_b, which does work.
Data set auto-mpg is now about twice as fast (as it converges): 34s to 17s with no permutations. But the method performs terribly for n<m. The objective here is of course not differentiable and you have to use fmin_l_bfgs_b with approx_grad=1 which can't be fast I guess.
One possibility is to use cvxopt. It is another dependency, but at least it is able to properly do convex optimization (e.g. 100x faster than above for some n<m).
Anze, Lan (or anybody else), if you have an older version of scipy could you try if it is also broken there. Did this work properly before?