source: orange/docs/widgets/rst/visualize/scatterplot.rst @ 11050:e3c4699ca155

Revision 11050:e3c4699ca155, 10.3 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 16 months ago (diff)

Widget docs From HTML to Sphinx.

Line 
1.. _Scatter Plot:
2
3Scatter Plot
4============
5
6.. image:: ../icons/Distributions.png
7
8A standard scatterplot visualization with explorative analysis and  intelligent data visualization enhancements.
9
10Signals
11-------
12
13Inputs:
14   - Examples (ExampleTable)
15      Input data set.
16   - Example Subset (ExampleTable)
17      A subset of data instances from Examples.
18
19
20Outputs:
21   - Selected Examples (ExampleTable)
22      A subset of examples that user has manually selected from the scatterplot.
23   - Unselected Examples (ExampleTable)
24      All other examples (examples not included in the user's selection).
25
26
27Description
28-----------
29
30Scatterplot widget provides a standard 2-dimensional scatterplot visualization for both continuous and discrete-valued attributes. The data is displayed as a collection of points, each having the value of :obj:`X-axis attribute` determining the position on the horizontal axis and the value of :obj:`Y-axis attribute` determining the position on the vertical axis. Various properties of the graph, like color, size and shape of the  points are controlled through the appropriate setting in the :obj:`Main` pane of the widget, while other (like legends and axis titles, maximum point size and jittering) are set in the :obj:`Settings` pane. A snapshot below shows a scatterplot of an Iris data set, with the size of the points proportional to the value of sepal width attribute, and coloring matching that of the class attribute.
31
32.. image:: images/Scatterplot-Iris.png
33
34In the case of discrete attributes, jittering (:obj:`Jittering options` ) should be used to circumvent the overlap of the points with the same value for both axis, and to obtain a plot where density of the points in particular region corresponds better to the density of the data with that particular combination of values. As an example of such a plot, the scatterplot for the Titanic data reporting on the gender of the passenger and the traveling class is shown below; withouth jittering, scatterplot would display only eight distinct points.
35
36.. image:: images/Scatterplot-Titanic.png
37
38Most of the scatterplot options are quite standard, like those for selecting attributes for point colors, labels, shape and size (:obj:`Main` pane), or those that control the display of various elements in the graph like axis title, grid lines, etc. (:obj:`Settings` pane). Beyond these, the Orange's scatterplot also implements an intelligent visualization technique called VizRank that is invoked through :obj:`VizRank` button in :obj:`Main` tab.
39
40Intelligent Data Visualization
41
42If a data set has many (many!) attributes, it is impossible to manually scan through all the pairs of attributes to find interesting scatterplots. Intelligent data visualizations techniques are about finding such visualizations automatically. Orange's Scatterplot includes one such tool called VizRank <a href="#Leban2006" title="">(Leban et al., 2006)</a>, that can be in current implementation used only with classification data sets, that is, data sets where instances are labeled with a discrete class. The task of optimization is to find those scatterplot projections, where instances with different class labels are well separated. For example, for a data set `brown-selected.tab <http://orange.biolab.si/doc/datasets/brown-selected.tab>`_ (comes with Orange installation) the two attributes that best separate instances of different class are displayed in the snapshot below, where we have also switched on the :obj:`Show Probabilities` option from Scatterplot's :obj:`Settings` pane. Notice that this projection appears at the top of :obj:`Projection list, most interesting first`, followed by a list of other potentially interesting projections. Selecting each of these would change the projection displayed in the scatterplot, so the list and associated projections can be inspected in this way.
43
44.. image:: images/Scatterplot-VizRank-Brown.png
45
46The number of different projections that can be considered by VizRank may be quite high. VizRank searches the space of possible projections heuristically. The search is invoked by pressing :obj:`Start Evaluating Projections`, which may be stopped anytime. Search through modification of top-rated projections (replacing one of the two attributes with another one) is invoked by pressing a :obj:`Locally Optimize Best Projections` button.
47
48.. image:: images/Scatterplot-VizRank-Settings.png
49
50<td valign="top">
51VizRank's options are quite elaborate, and if you are not the expert in machine learning it would be best to leave them at their defaults. The options are grouped according to the different aspects of the methods as described in <a href="#Leban2006" title="">(Leban et al., 2006)</a>. The projections are evaluated through testing a selected classifier (:obj:`Projection evaluation method` default is k-nearest neighbor classification) using some standard evaluation technique (:obj:`Testing method`). For very large data set use sampling to speed-up the evaluation (:obj:`Percent of data used`). Visualizations will then be ranked according to the prediction accuracy (:obj:`Measure of classification success`), in our own tests :obj:`Average Probability Assigned to the Correct Class` worked somehow better than more standard measures like :obj:`Classification Accuracy` or :obj:`Brier Score`. To avoid exhaustive search for data sets with many attributes, these are ranked by heuristics (:obj:`Measure for attribute ranking`), testing most likely projection candidates first. Number of items in the list of projections is controlled in :obj:`Maximum length of projection list`.
52</tr></table>
53
54A set of tools that deals with management and post-analysis of list of projections is available under :obj:`Manage &amp; Save` tab. Here you may decide which classes the visualizations should separate (default set to separation of all the classes). Projection list can saved (:obj:`Save` in :obj:`Manage projections` group), loaded (:obj:`Load`), a set of best visualizations may be saved (:obj:`Saved Best Graphs`). :obj:`Reevalutate Projections` is used when you have loaded the list of best projections from file, but the actual data has changed since the last evaluation. For evaluating the current projection without engaging the projection search there is an :obj:`Evaluate Projection` button. Projections are evaluated based on performance of k-nearest neighbor classifiers, and the results of these evaluations in terms of which data instances were correctly or incorrectly classified is available through the two :obj:`Show k-NN` buttons.
55
56.. image:: images/Scatterplot-VizRank-ManageSave.png
57
58Based on a set of interesting projections found by VizRank, a number of post-analysis tools is available. :obj:`Attribute Ranking` displays a graph which show how many times the attributes appear in the top-rated projections. Bars can be colored according to the class with maximal average value of the attribute. :obj:`Attribute Interactions` displays a heat map showing how many times the two attributes appeared in the top-rated projections. :obj:`Graph Projection Scores` displays the distribution of projection scores.
59
60.. image:: images/Scatterplot-VizRank-AttributeHistogram.png
61
62.. image:: images/Scatterplot-VizRank-Interactions.png
63
64.. image:: images/Scatterplot-VizRank-Scores.png
65
66List of best-rated projections may also be used for the search and analysis of outliers. The idea is that the outliers are those data instances, which are incorrectly classified in many of the top visualizations. For example, the class of the 33-rd instance in `brown-selected.tab <http://orange.biolab.si/doc/datasets/brown-selected.tab>`_ should be Resp, but this instance is quite often misclassified as Ribo. The snapshot below shows one particular visualization displaying why such misclassification occurs. Perhaps the most important part of the :obj:`Outlier Identification` window is a list in the lower left (:obj:`Show predictions for all examples`) with a list of candidates for outliers sorted by the probabilities of classification to the right class. In our case, the most likely outlier is the instance 171, followed by an instance 33, both with probabilities of classification to the right class below 0.5.
67
68.. image:: images/Scatterplot-VizRank-Outliers.png
69
70Explorative Data Analysis
71
72.. image:: images/Scatterplot-ZoomSelect.png
73
74Scatterplot, together with the rest of the Orange's widget, provides for a explorative data analysis environment by supporting zooming-in and out of the part of the plot and selection of data instances. These functions are enabled through :obj:`Zoom/Select` toolbox. The default tool is zoom: left-click and drag on the plot area defines the rectangular are to zoom-in. Right click to zoom out. Next two buttons in this tool bar are rectangular and polygon selection. Selections are stacked and can be removed in order from the last one defined, or all at once (back-arrow and cross button from the tool bar). The last button in the tool bar is used to resend the data from this widget. Since this is done automatically after every change of the selection, this last function is not particularly useful. An example of a simple schema where we selected data instances from two polygon regions and send them to the Data Table widget is shown below. Notice that by counting the dots from the scatterplot there should be 12 data instances selected, whereas the data table shows 17. This is because some data instances overlap (have the same value of the two attributes used) - we could use Jittering to expose them.
75
76.. image:: images/Scatterplot-Iris-Selection.png
77
78
79Examples
80--------
81
82Scatterplot can be nicely combined with other widgets that output a list of selected data instances. For example, a combination of classification tree and scatterplot, as shown below, makes for a nice exploratory tool displaying data instances pertinent to a chosen classification tree node (clicking on any node of classification tree would send a set of selected data instances to scatterplot, updating the visualization and marking selected instances with filled symbols).
83
84.. image:: images/Scatterplot-ClassificationTree.png
85
86
87References
88----------
89
90   - Leban G, Zupan B, Vidmar G, Bratko I. VizRank: Data Visualization Guided by Machine Learning. Data Mining and Knowledge Discovery 13(2): 119-136, 2006.
91   - Mramor M, Leban G, Demsar J, Zupan B. Visualization-based cancer microarray data classification analysis. Bioinformatics 23(16): 2147-2154, 2007.
Note: See TracBrowser for help on using the repository browser.