Linear regression (linear)¶
Linear regression is a statistical regression method which tries to predict a value of a continuous response (class) variable based on the values of several predictors. The model assumes that the response variable is a linear combination of the predictors, the task of linear regression is therefore to fit the unknown coefficients.
To fit the regression parameters on housing data set use the following code:
import Orange
housing = Orange.data.Table("housing")
learner = Orange.regression.linear.LinearRegressionLearner()
classifier = learner(housing)
- class Orange.regression.linear.LinearRegressionLearner(name=linear regression, intercept=True, compute_stats=True, ridge_lambda=None, imputer=None, continuizer=None, use_vars=None, stepwise=False, add_sig=0.05, remove_sig=0.2, **kwds)¶
Fits the linear regression model, i.e. learns the regression parameters The class is derived from Orange.regression.base.BaseRegressionLearner which is used for preprocessing the data (continuization and imputation) before fitting the regression parameters.
- __call__(table, weight=None, verbose=0)¶
Parameters: - table (Orange.data.Table) – data instances.
- weight (None or list of Orange.feature.Continuous which stores weights for instances) – the weights for instances. Default: None, i.e. all data instances are equally important in fitting the regression parameters
- __init__(name=linear regression, intercept=True, compute_stats=True, ridge_lambda=None, imputer=None, continuizer=None, use_vars=None, stepwise=False, add_sig=0.05, remove_sig=0.2, **kwds)¶
Parameters: - name (string) – name of the linear model, default ‘linear regression’
- intercept (bool) – if True, the intercept beta0 is included in the model
- compute_stats (bool) – if True, statistical properties of the estimators (standard error, t-scores, significances) and statistical properties of the model (sum of squares, R2, adjusted R2) are computed
- ridge_lambda (int or None) – if not None, ridge regression is performed with the given lambda parameter controlling the regularization
- use_vars (list of Orange.feature.Descriptor or None) – the list of independent varaiables included in regression model. If None (default) all variables are used
- stepwise (bool) – if True, stepwise regression based on F-test is performed. The significance parameters are add_sig and remove_sig
- add_sig (float) – lower bound of significance for which the variable is included in regression model default value = 0.05
- remove_sig (float) – upper bound of significance for which the variable is excluded from the regression model default value = 0.2
- class Orange.regression.linear.LinearRegression(class_var=None, domain=None, coefficients=None, F=None, std_error=None, t_scores=None, p_vals=None, dict_model=None, fitted=None, residuals=None, m=None, n=None, mu_y=None, r2=None, r2adj=None, sst=None, sse=None, ssr=None, std_coefficients=None, intercept=None)¶
Linear regression predicts value of the response variable based on the values of independent variables.
- F¶
F-statistics of the model.
- coefficients¶
Regression coefficients stored in list. If the intercept is included the first item corresponds to the estimated intercept.
- std_error¶
Standard errors of the coefficient estimator, stored in list.
- t_scores¶
List of t-scores for the estimated regression coefficients.
- p_vals¶
List of p-values for the null hypothesis that the regression coefficients equal 0 based on t-scores and two sided alternative hypothesis.
- dict_model¶
Statistical properties of the model in a dictionary: Keys - names of the independent variables (or “Intercept”) Values - tuples (coefficient, standard error, t-value, p-value)
- fitted¶
Estimated values of the dependent variable for all instances from the training table.
- residuals¶
Differences between estimated and actual values of the dependent variable for all instances from the training table.
- m¶
Number of independent (predictor) variables.
- n¶
Number of instances.
- mu_y¶
Sample mean of the dependent variable.
- r2adj¶
Adjusted coefficient of determination.
- sst, sse, ssr
Total sum of squares, explained sum of squares and residual sum of squares respectively.
- std_coefficients¶
Standardized regression coefficients.
- __call__(instance, result_type=0)¶
Parameters: instance (Instance) – data instance for which the value of the response variable will be predicted
- __init__(class_var=None, domain=None, coefficients=None, F=None, std_error=None, t_scores=None, p_vals=None, dict_model=None, fitted=None, residuals=None, m=None, n=None, mu_y=None, r2=None, r2adj=None, sst=None, sse=None, ssr=None, std_coefficients=None, intercept=None)¶
Parameters: model (LinearRegressionLearner) – fitted linear regression model
- to_string()¶
Pretty-prints linear regression model, i.e. estimated regression coefficients with standard errors, t-scores and significances.
Utility functions¶
- Orange.regression.linear.stepwise(table, weight, add_sig=0.05, remove_sig=0.2)¶
Performs stepwise linear regression: on table and returns the list of remaing independent variables which fit a significant linear regression model.coefficients
Parameters: - table (Orange.data.Table) – data instances.
- weight (None or list of Orange.feature.Continuous which stores the weights) – the weights for instances. Default: None, i.e. all data instances are eqaully important in fitting the regression parameters
- add_sig (float) – lower bound of significance for which the variable is included in regression model default value = 0.05
- remove_sig (float) – upper bound of significance for which the variable is excluded from the regression model default value = 0.2
Examples¶
Prediction¶
Predict values of the first 5 data instances
# prediction for five data instances and comparison to actual values
for ins in housing[:5]:
print "Actual: %3.2f, predicted: %3.2f " % (ins.get_class(), classifier(ins))
The output of this code is
Actual: 24.00, predicted: 30.00
Actual: 21.60, predicted: 25.03
Actual: 34.70, predicted: 30.57
Actual: 33.40, predicted: 28.61
Actual: 36.20, predicted: 27.94
Poperties of fitted model¶
Print regression coefficients with standard errors, t-scores, p-values and significances
print classifier
The code output is
Variable Coeff Est Std Error t-value p
Intercept 36.459 5.103 7.144 0.000 ***
CRIM -0.108 0.033 -3.287 0.001 **
ZN 0.046 0.014 3.382 0.001 ***
INDUS 0.021 0.061 0.334 0.738
CHAS 2.687 0.862 3.118 0.002 **
NOX -17.767 3.820 -4.651 0.000 ***
RM 3.810 0.418 9.116 0.000 ***
AGE 0.001 0.013 0.052 0.958
DIS -1.476 0.199 -7.398 0.000 ***
RAD 0.306 0.066 4.613 0.000 ***
TAX -0.012 0.004 -3.280 0.001 **
PTRATIO -0.953 0.131 -7.283 0.000 ***
B 0.009 0.003 3.467 0.001 ***
LSTAT -0.525 0.051 -10.347 0.000 ***
Stepwise regression¶
To use stepwise regression initialize learner with stepwise=True. The upper and lower bound for significance are controlled with add_sig and remove_sig.
learner2 = Orange.regression.linear.LinearRegressionLearner(stepwise=True,
add_sig=0.05,
remove_sig=0.2)
classifier = learner2(housing)
print classifier
As you can see from the output, the non-significant coefficients have been removed from the model.
Variable Coeff Est Std Error t-value p
Intercept 36.341 5.067 7.171 0.000 ***
LSTAT -0.523 0.047 -11.019 0.000 ***
RM 3.802 0.406 9.356 0.000 ***
PTRATIO -0.947 0.129 -7.334 0.000 ***
DIS -1.493 0.186 -8.037 0.000 ***
NOX -17.376 3.535 -4.915 0.000 ***
CHAS 2.719 0.854 3.183 0.002 **
B 0.009 0.003 3.475 0.001 ***
ZN 0.046 0.014 3.390 0.001 ***
CRIM -0.108 0.033 -3.307 0.001 **
RAD 0.300 0.063 4.726 0.000 ***
TAX -0.012 0.003 -3.493 0.001 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 empty 1