#
source:
orange/orange/Orange/statistics/contingency.py
@
9550:bd71f96b33d5

Revision 9550:bd71f96b33d5, 17.6 KB checked in by lanz <lan.zagar@…>, 2 years ago (diff) |
---|

Line | |
---|---|

1 | """ |

2 | .. index:: Contingency table |

3 | |

4 | ================= |

5 | Contingency table |

6 | ================= |

7 | |

8 | Contingency table contains conditional distributions. Unless explicitly |

9 | 'normalized', they contain absolute frequencies, that is, the number of |

10 | instances with a particular combination of two variables' values. If they are |

11 | normalized by dividing each cell by the row sum, the represent conditional |

12 | probabilities of the column variable (here denoted as ``innerVariable``) |

13 | conditioned by the row variable (``outerVariable``). |

14 | |

15 | Contingency tables are usually constructed for discrete variables. Tables |

16 | for continuous variables have certain limitations described in a :ref:`separate |

17 | section <contcont>`. |

18 | |

19 | The example below loads the monks-1 data set and prints out the conditional |

20 | class distribution given the value of `e`. |

21 | |

22 | .. literalinclude:: code/statistics-contingency.py |

23 | :lines: 1-7 |

24 | |

25 | This code prints out:: |

26 | |

27 | 1 <0.000, 108.000> |

28 | 2 <72.000, 36.000> |

29 | 3 <72.000, 36.000> |

30 | 4 <72.000, 36.000> |

31 | |

32 | Contingencies behave like lists of distributions (in this case, class |

33 | distributions) indexed by values (of `e`, in this |

34 | example). Distributions are, in turn indexed by values (class values, |

35 | here). The variable `e` from the above example is called the outer |

36 | variable, and the class is the inner. This can also be reversed. It is |

37 | also possible to use features for both, outer and inner variable, so |

38 | the table shows distributions of one variable's values given the |

39 | value of another. There is a corresponding hierarchy of classes: |

40 | :obj:`Table` is a base class for :obj:`VarVar` (both |

41 | variables are attributes) and :obj:`Class` (one variable is |

42 | the class). The latter is the base class for |

43 | :obj:`VarClass` and :obj:`ClassVar`. |

44 | |

45 | The most commonly used of the above classes is :obj:`VarClass` which |

46 | can compute and store conditional probabilities of classes given the feature value. |

47 | |

48 | Contingency tables |

49 | ================== |

50 | |

51 | .. class:: Table |

52 | |

53 | Provides a base class for storing and manipulating contingency |

54 | tables. Although it is not abstract, it is seldom used directly but rather |

55 | through more convenient derived classes described below. |

56 | |

57 | .. attribute:: outerVariable |

58 | |

59 | Outer variable (:class:`Orange.data.variable.Variable`) whose values are |

60 | used as the first, outer index. |

61 | |

62 | .. attribute:: innerVariable |

63 | |

64 | Inner variable(:class:`Orange.data.variable.Variable`), whose values are |

65 | used as the second, inner index. |

66 | |

67 | .. attribute:: outerDistribution |

68 | |

69 | The marginal distribution (:class:`Distribution`) of the outer variable. |

70 | |

71 | .. attribute:: innerDistribution |

72 | |

73 | The marginal distribution (:class:`Distribution`) of the inner variable. |

74 | |

75 | .. attribute:: innerDistributionUnknown |

76 | |

77 | The distribution (:class:`distribution.Distribution`) of the inner variable for |

78 | instances for which the outer variable was undefined. This is the |

79 | difference between the ``innerDistribution`` and (unconditional) |

80 | distribution of inner variable. |

81 | |

82 | .. attribute:: varType |

83 | |

84 | The type of the outer variable (:obj:`Orange.data.Type`, usually |

85 | :obj:`Orange.data.variable.Discrete` or |

86 | :obj:`Orange.data.variable.Continuous`); equals |

87 | ``outerVariable.varType`` and ``outerDistribution.varType``. |

88 | |

89 | .. method:: __init__(outer_variable, inner_variable) |

90 | |

91 | Construct an instance of contingency table for the given pair of |

92 | variables. |

93 | |

94 | :param outer_variable: Descriptor of the outer variable |

95 | :type outer_variable: Orange.data.variable.Variable |

96 | :param outer_variable: Descriptor of the inner variable |

97 | :type inner_variable: Orange.data.variable.Variable |

98 | |

99 | .. method:: add(outer_value, inner_value[, weight=1]) |

100 | |

101 | Add an element to the contingency table by adding ``weight`` to the |

102 | corresponding cell. |

103 | |

104 | :param outer_value: The value for the outer variable |

105 | :type outer_value: int, float, string or :obj:`Orange.data.Value` |

106 | :param inner_value: The value for the inner variable |

107 | :type inner_value: int, float, string or :obj:`Orange.data.Value` |

108 | :param weight: Instance weight |

109 | :type weight: float |

110 | |

111 | .. method:: normalize() |

112 | |

113 | Normalize all distributions (rows) in the table to sum to ``1``:: |

114 | |

115 | >>> cont.normalize() |

116 | >>> for val, dist in cont.items(): |

117 | print val, dist |

118 | |

119 | Output: :: |

120 | |

121 | 1 <0.000, 1.000> |

122 | 2 <0.667, 0.333> |

123 | 3 <0.667, 0.333> |

124 | 4 <0.667, 0.333> |

125 | |

126 | .. note:: |

127 | |

128 | This method does not change the ``innerDistribution`` or |

129 | ``outerDistribution``. |

130 | |

131 | With respect to indexing, contingency table is a cross between dictionary |

132 | and a list. It supports standard dictionary methods ``keys``, ``values`` and |

133 | ``items``. :: |

134 | |

135 | >> print cont.keys() |

136 | ['1', '2', '3', '4'] |

137 | >>> print cont.values() |

138 | [<0.000, 108.000>, <72.000, 36.000>, <72.000, 36.000>, <72.000, 36.000>] |

139 | >>> print cont.items() |

140 | [('1', <0.000, 108.000>), ('2', <72.000, 36.000>), |

141 | ('3', <72.000, 36.000>), ('4', <72.000, 36.000>)] |

142 | |

143 | Although keys returned by the above functions are strings, contingency can |

144 | be indexed by anything that can be converted into values of the outer |

145 | variable: strings, numbers or instances of ``Orange.data.Value``. :: |

146 | |

147 | >>> print cont[0] |

148 | <0.000, 108.000> |

149 | >>> print cont["1"] |

150 | <0.000, 108.000> |

151 | >>> print cont[orange.Value(data.domain["e"], "1")] |

152 | |

153 | The length of the table equals the number of values of the outer |

154 | variable. However, iterating through contingency |

155 | does not return keys, as with dictionaries, but distributions. :: |

156 | |

157 | >>> for i in cont: |

158 | ... print i |

159 | <0.000, 108.000> |

160 | <72.000, 36.000> |

161 | <72.000, 36.000> |

162 | <72.000, 36.000> |

163 | <72.000, 36.000> |

164 | |

165 | |

166 | .. class:: Class |

167 | |

168 | An abstract base class for contingency tables that contain the class, |

169 | either as the inner or the outer variable. |

170 | |

171 | .. attribute:: classVar (read only) |

172 | |

173 | The class attribute descriptor; always equal to either |

174 | :obj:`Table.innerVariable` or :obj:``Table.outerVariable``. |

175 | |

176 | .. attribute:: variable |

177 | |

178 | Variable; always equal either to either ``innerVariable`` or ``outerVariable`` |

179 | |

180 | .. method:: add_var_class(variable_value, class_value[, weight=1]) |

181 | |

182 | Add an element to contingency by increasing the corresponding count. The |

183 | difference between this and :obj:`Table.add` is that the variable |

184 | value is always the first argument and class value the second, |

185 | regardless of which one is inner and which one is outer. |

186 | |

187 | :param variable_value: Variable value |

188 | :type variable_value: int, float, string or :obj:`Orange.data.Value` |

189 | :param class_value: Class value |

190 | :type class_value: int, float, string or :obj:`Orange.data.Value` |

191 | :param weight: Instance weight |

192 | :type weight: float |

193 | |

194 | |

195 | .. class:: VarClass |

196 | |

197 | A class derived from :obj:`Class` in which the variable is |

198 | used as :obj:`Table.outerVariable` and class as the |

199 | :obj:`Table.innerVariable`. This form is a form suitable for |

200 | computation of conditional class probabilities given the variable value. |

201 | |

202 | Calling :obj:`VarClass.add_var_class(v, c)` is equivalent to |

203 | :obj:`Table.add(v, c)`. Similar as :obj:`Table`, |

204 | :obj:`VarClass` can compute contingency from instances. |

205 | |

206 | .. method:: __init__(feature, class_variable) |

207 | |

208 | Construct an instance of :obj:`VarClass` for the given pair of |

209 | variables. Inherited from :obj:`Table`. |

210 | |

211 | :param feature: Outer variable |

212 | :type feature: Orange.data.variable.Variable |

213 | :param class_attribute: Class variable; used as ``innerVariable`` |

214 | :type class_attribute: Orange.data.variable.Variable |

215 | |

216 | .. method:: __init__(feature, data[, weightId]) |

217 | |

218 | Compute the contingency table from data. |

219 | |

220 | :param feature: Outer variable |

221 | :type feature: Orange.data.variable.Variable |

222 | :param data: A set of instances |

223 | :type data: Orange.data.Table |

224 | :param weightId: meta attribute with weights of instances |

225 | :type weightId: int |

226 | |

227 | .. method:: p_class(value) |

228 | |

229 | Return the probability distribution of classes given the value of the |

230 | variable. |

231 | |

232 | :param value: The value of the variable |

233 | :type value: int, float, string or :obj:`Orange.data.Value` |

234 | :rtype: Orange.statistics.distribution.Distribution |

235 | |

236 | |

237 | .. method:: p_class(value, class_value) |

238 | |

239 | Returns the conditional probability of the class_value given the |

240 | feature value, p(class_value|value) (note the order of arguments!) |

241 | |

242 | :param value: The value of the variable |

243 | :type value: int, float, string or :obj:`Orange.data.Value` |

244 | :param class_value: The class value |

245 | :type value: int, float, string or :obj:`Orange.data.Value` |

246 | :rtype: float |

247 | |

248 | .. literalinclude:: code/statistics-contingency3.py |

249 | :lines: 1-23 |

250 | |

251 | The inner and the outer variable and their relations to the class are |

252 | as follows:: |

253 | |

254 | Inner variable: y |

255 | Outer variable: e |

256 | |

257 | Class variable: y |

258 | Feature: e |

259 | |

260 | Distributions are normalized, and probabilities are elements from the |

261 | normalized distributions. Knowing that the target concept is |

262 | y := (e=1) or (a=b), distributions are as expected: when e equals 1, class 1 |

263 | has a 100% probability, while for the rest, probability is one third, which |

264 | agrees with a probability that two three-valued independent features |

265 | have the same value. :: |

266 | |

267 | Distributions: |

268 | p(.|1) = <0.000, 1.000> |

269 | p(.|2) = <0.662, 0.338> |

270 | p(.|3) = <0.659, 0.341> |

271 | p(.|4) = <0.669, 0.331> |

272 | |

273 | Probabilities of class '1' |

274 | p(1|1) = 1.000 |

275 | p(1|2) = 0.338 |

276 | p(1|3) = 0.341 |

277 | p(1|4) = 0.331 |

278 | |

279 | Distributions from a matrix computed manually: |

280 | p(.|1) = <0.000, 1.000> |

281 | p(.|2) = <0.662, 0.338> |

282 | p(.|3) = <0.659, 0.341> |

283 | p(.|4) = <0.669, 0.331> |

284 | |

285 | |

286 | .. class:: ClassVar |

287 | |

288 | :obj:`ClassVar` is similar to :obj:`VarClass` except |

289 | that the class is outside and the variable is inside. This form of |

290 | contingency table is suitable for computing conditional probabilities of |

291 | variable given the class. All methods get the two arguments in the same |

292 | order as :obj:`VarClass`. |

293 | |

294 | .. method:: __init__(feature, class_variable) |

295 | |

296 | Construct an instance of :obj:`VarClass` for the given pair of |

297 | variables. Inherited from :obj:`Table`, except for the reversed |

298 | order of arguments. |

299 | |

300 | :param feature: Outer variable |

301 | :type feature: Orange.data.variable.Variable |

302 | :param class_variable: Class variable |

303 | :type class_variable: Orange.data.variable.Variable |

304 | |

305 | .. method:: __init__(feature, data[, weightId]) |

306 | |

307 | Compute contingency table from the data. |

308 | |

309 | :param feature: Descriptor of the outer variable |

310 | :type feature: Orange.data.variable.Variable |

311 | :param data: A set of instances |

312 | :type data: Orange.data.Table |

313 | :param weightId: meta attribute with weights of instances |

314 | :type weightId: int |

315 | |

316 | .. method:: p_attr(class_value) |

317 | |

318 | Return the probability distribution of variable given the class. |

319 | |

320 | :param class_value: The value of the variable |

321 | :type class_value: int, float, string or :obj:`Orange.data.Value` |

322 | :rtype: Orange.statistics.distribution.Distribution |

323 | |

324 | .. method:: p_attr(value, class_value) |

325 | |

326 | Returns the conditional probability of the value given the |

327 | class, p(value|class_value). |

328 | |

329 | :param value: Value of the variable |

330 | :type value: int, float, string or :obj:`Orange.data.Value` |

331 | :param class_value: Class value |

332 | :type value: int, float, string or :obj:`Orange.data.Value` |

333 | :rtype: float |

334 | |

335 | .. literalinclude:: code/statistics-contingency4.py |

336 | :lines: 1-27 |

337 | |

338 | The role of the feature and the class are reversed compared to |

339 | :obj:`ClassVar`:: |

340 | |

341 | Inner variable: e |

342 | Outer variable: y |

343 | |

344 | Class variable: y |

345 | Feature: e |

346 | |

347 | Distributions given the class can be printed out by calling :meth:`p_attr`. |

348 | |

349 | .. literalinclude:: code/statistics-contingency4.py |

350 | :lines: 30-31 |

351 | |

352 | will print:: |

353 | p(.|0) = <0.000, 0.333, 0.333, 0.333> |

354 | p(.|1) = <0.500, 0.167, 0.167, 0.167> |

355 | |

356 | If the class value is '0', the attribute `e` cannot be `1` (the first |

357 | value), while distribution across other values is uniform. If the class |

358 | value is `1`, `e` is `1` for exactly half of instances, and distribution of |

359 | other values is again uniform. |

360 | |

361 | .. class:: VarVar |

362 | |

363 | Contingency table in which none of the variables is the class. The class |

364 | is derived from :obj:`Table`, and adds an additional constructor and |

365 | method for getting conditional probabilities. |

366 | |

367 | .. method:: VarVar(outer_variable, inner_variable) |

368 | |

369 | Inherited from :obj:`Table`. |

370 | |

371 | .. method:: __init__(outer_variable, inner_variable, data[, weightId]) |

372 | |

373 | Compute the contingency from the given instances. |

374 | |

375 | :param outer_variable: Outer variable |

376 | :type outer_variable: Orange.data.variable.Variable |

377 | :param inner_variable: Inner variable |

378 | :type inner_variable: Orange.data.variable.Variable |

379 | :param data: A set of instances |

380 | :type data: Orange.data.Table |

381 | :param weightId: meta attribute with weights of instances |

382 | :type weightId: int |

383 | |

384 | .. method:: p_attr(outer_value) |

385 | |

386 | Return the probability distribution of the inner variable given the |

387 | outer variable value. |

388 | |

389 | :param outer_value: The value of the outer variable |

390 | :type outer_value: int, float, string or :obj:`Orange.data.Value` |

391 | :rtype: Orange.statistics.distribution.Distribution |

392 | |

393 | .. method:: p_attr(outer_value, inner_value) |

394 | |

395 | Return the conditional probability of the inner_value |

396 | given the outer_value. |

397 | |

398 | :param outer_value: The value of the outer variable |

399 | :type outer_value: int, float, string or :obj:`Orange.data.Value` |

400 | :param inner_value: The value of the inner variable |

401 | :type inner_value: int, float, string or :obj:`Orange.data.Value` |

402 | :rtype: float |

403 | |

404 | The following example investigates which material is used for |

405 | bridges of different lengths. |

406 | |

407 | .. literalinclude:: code/statistics-contingency5.py |

408 | :lines: 1-17 |

409 | |

410 | Short bridges are mostly wooden or iron, and the longer (and most of the |

411 | middle sized) are made from steel:: |

412 | |

413 | SHORT: |

414 | WOOD (56%) |

415 | IRON (44%) |

416 | |

417 | MEDIUM: |

418 | WOOD (9%) |

419 | IRON (11%) |

420 | STEEL (79%) |

421 | |

422 | LONG: |

423 | STEEL (100%) |

424 | |

425 | As all other contingency tables, this one can also be computed "manually". |

426 | |

427 | .. literalinclude:: code/statistics-contingency5.py |

428 | :lines: 18- |

429 | |

430 | |

431 | Contingencies for entire domain |

432 | =============================== |

433 | |

434 | A list of contingency tables, either :obj:`VarClass` or |

435 | :obj:`ClassVar`. |

436 | |

437 | .. class:: Domain |

438 | |

439 | .. method:: __init__(data[, weightId=0, classOuter=0|1]) |

440 | |

441 | Compute a list of contingency tables. |

442 | |

443 | :param data: A set of instances |

444 | :type data: Orange.data.Table |

445 | :param weightId: meta attribute with weights of instances |

446 | :type weightId: int |

447 | :param classOuter: `True`, if class is the outer variable |

448 | :type classOuter: bool |

449 | |

450 | .. note:: |

451 | |

452 | ``classIsOuter`` cannot be given as positional argument, |

453 | but needs to be passed by keyword. |

454 | |

455 | .. attribute:: classIsOuter (read only) |

456 | |

457 | Tells whether the class is the outer or the inner variable. |

458 | |

459 | .. attribute:: classes |

460 | |

461 | Contains the distribution of class values on the entire dataset. |

462 | |

463 | .. method:: normalize() |

464 | |

465 | Call normalize for all contingencies. |

466 | |

467 | The following script prints the contingency tables for features |

468 | "a", "b" and "e" for the dataset Monk 1. |

469 | |

470 | .. literalinclude:: code/statistics-contingency8.py |

471 | :lines: 9 |

472 | |

473 | Contingency tables of type :obj:`VarClass` give |

474 | the conditional distributions of classes, given the value of the variable. |

475 | |

476 | .. literalinclude:: code/statistics-contingency8.py |

477 | :lines: 12- |

478 | |

479 | .. _contcont: |

480 | |

481 | Contingency tables for continuous variables |

482 | =========================================== |

483 | |

484 | If the outer variable is continuous, the index must be one of the |

485 | values that do exist in the contingency table; other values raise an |

486 | exception: |

487 | |

488 | .. literalinclude:: code/statistics-contingency6.py |

489 | :lines: 1-4,17- |

490 | |

491 | Since even rounding can be a problem, the only safe way to get the key |

492 | is to take it from from the contingencies' ``keys``. |

493 | |

494 | Contingency tables with discrete outer variable and continuous inner variables |

495 | are more useful, since methods :obj:`ContingencyClassVar.p_class` |

496 | and :obj:`ContingencyVarClass.p_attr` use the primitive density estimation |

497 | provided by :obj:`Orange.statistics.distribution.Distribution`. |

498 | |

499 | For example, :obj:`ClassVar` on the iris dataset can return the |

500 | probability of the sepal length 5.5 for different classes: |

501 | |

502 | .. literalinclude:: code/statistics-contingency7.py |

503 | |

504 | The script outputs:: |

505 | |

506 | Estimated frequencies for e=5.5 |

507 | f(5.5|Iris-setosa) = 2.000 |

508 | f(5.5|Iris-versicolor) = 5.000 |

509 | f(5.5|Iris-virginica) = 1.000 |

510 | |

511 | """ |

512 | |

513 | from Orange.core import Contingency as Table |

514 | from Orange.core import ContingencyAttrAttr as VarVar |

515 | from Orange.core import ContingencyClass as Class |

516 | from Orange.core import ContingencyAttrClass as VarClass |

517 | from Orange.core import ContingencyClassAttr as ClassVar |

518 | |

519 | from Orange.core import DomainContingency as Domain |

**Note:**See TracBrowser for help on using the repository browser.