source: orange/docs/widgets/rst/visualize/scatterplot.rst @ 11359:8d54e79aa135

Revision 11359:8d54e79aa135, 10.1 KB checked in by Ales Erjavec <ales.erjavec@…>, 14 months ago (diff)

Cleanup of 'Widget catalog' documentation.

Fixed rst text formating, replaced dead hardcoded reference links (now using
:ref:), etc.

RevLine 
[11050]1.. _Scatter Plot:
2
3Scatter Plot
4============
5
6.. image:: ../icons/Distributions.png
7
[11359]8A standard scatterplot visualization with explorative analysis and  intelligent
9data visualization enhancements.
[11050]10
11Signals
12-------
13
14Inputs:
15   - Examples (ExampleTable)
16      Input data set.
17   - Example Subset (ExampleTable)
18      A subset of data instances from Examples.
19
20
21Outputs:
22   - Selected Examples (ExampleTable)
[11359]23      A subset of examples that user has manually selected from the
24      scatterplot.
[11050]25   - Unselected Examples (ExampleTable)
26      All other examples (examples not included in the user's selection).
27
28
29Description
30-----------
31
[11359]32Scatterplot widget provides a standard 2-dimensional scatterplot visualization
33for both continuous and discrete-valued attributes. The data is displayed as a
34collection of points, each having the value of :obj:`X-axis attribute`
35determining the position on the horizontal axis and the value of
36:obj:`Y-axis attribute` determining the position on the vertical axis.
37Various properties of the graph, like color, size and shape of the  points are
38controlled through the appropriate setting in the :obj:`Main` pane of the
39widget, while other (like legends and axis titles, maximum point size and
40jittering) are set in the :obj:`Settings` pane. A snapshot below shows a
41scatterplot of an Iris data set, with the size of the points proportional to
42the value of sepal width attribute, and coloring matching that of the class
43attribute.
[11050]44
45.. image:: images/Scatterplot-Iris.png
46
[11359]47In the case of discrete attributes, jittering (:obj:`Jittering options` )
48should be used to circumvent the overlap of the points with the same value for
49both axis, and to obtain a plot where density of the points in particular
50region corresponds better to the density of the data with that particular
51combination of values. As an example of such a plot, the scatterplot for the
52Titanic data reporting on the gender of the passenger and the traveling class
53is shown below; withouth jittering, scatterplot would display only eight
54distinct points.
[11050]55
56.. image:: images/Scatterplot-Titanic.png
57
[11359]58Most of the scatterplot options are quite standard, like those for selecting
59attributes for point colors, labels, shape and size (:obj:`Main` pane), or
60those that control the display of various elements in the graph like axis
61title, grid lines, etc. (:obj:`Settings` pane). Beyond these, the Orange's
62scatterplot also implements an intelligent visualization technique called
63VizRank that is invoked through :obj:`VizRank` button in :obj:`Main` tab.
[11050]64
65Intelligent Data Visualization
66
[11359]67If a data set has many attributes, it is impossible to manually scan through
68all the pairs of attributes to find interesting scatterplots. Intelligent data
69visualizations techniques are about finding such visualizations automatically.
70Orange's Scatterplot includes one such tool called VizRank ([Leban2006]_), that
71can be in current implementation used only with classification data sets, that
72is, data sets where instances are labeled with a discrete class. The task of
73optimization is to find those scatterplot projections, where instances with
74different class labels are well separated. For example, for a data set
75`brown-selected.tab <http://orange.biolab.si/doc/datasets/brown-selected.tab>`_
76(comes with Orange installation) the two attributes that best separate
77instances of different class are displayed in the snapshot below, where we have
78also switched on the :obj:`Show Probabilities` option from Scatterplot's
79:obj:`Settings` pane. Notice that this projection appears at the top of
80:obj:`Projection list, most interesting first`, followed by a list of
81other potentially interesting projections. Selecting each of these would
82change the projection displayed in the scatterplot, so the list and associated
83projections can be inspected in this way.
[11050]84
85.. image:: images/Scatterplot-VizRank-Brown.png
86
[11359]87The number of different projections that can be considered by VizRank may be
88quite high. VizRank searches the space of possible projections heuristically.
89The search is invoked by pressing :obj:`Start Evaluating Projections`, which
90may be stopped anytime. Search through modification of top-rated projections
91(replacing one of the two attributes with another one) is invoked by pressing a
92:obj:`Locally Optimize Best Projections` button.
[11050]93
94.. image:: images/Scatterplot-VizRank-Settings.png
[11359]95   :align: left
[11050]96
[11359]97VizRank's options are quite elaborate, and if you are not the expert in machine
98learning it would be best to leave them at their defaults. The options are
99grouped according to the different aspects of the methods as described in
100[Leban2006]_. The projections are evaluated through testing a selected
101classifier (:obj:`Projection evaluation method` default is k-nearest neighbor
102classification) using some standard evaluation technique
103(:obj:`Testing method`). For very large data set use sampling to speed-up the
104evaluation (:obj:`Percent of data used`). Visualizations will then be ranked
105according to the prediction accuracy (:obj:`Measure of classification success`
106), in our own tests :obj:`Average Probability Assigned to the Correct Class`
107worked somehow better than more standard measures like
108:obj:`Classification Accuracy` or :obj:`Brier Score`. To avoid exhaustive
109search for data sets with many attributes, these are ranked by heuristics
110(:obj:`Measure for attribute ranking`), testing most likely projection
111candidates first. Number of items in the list of projections is controlled in
112:obj:`Maximum length of projection list`.
[11050]113
114
115.. image:: images/Scatterplot-VizRank-ManageSave.png
[11359]116   :align: left
[11050]117
[11359]118A set of tools that deals with management and post-analysis of list of
119projections is available under :obj:`Manage & Save` tab. Here you may decide
120which classes the visualizations should separate (default set to separation of
121all the classes). Projection list can saved (:obj:`Save` in
122:obj:`Manage projections` group), loaded (:obj:`Load`), a set of best
123visualizations may be saved (:obj:`Saved Best Graphs`).
124:obj:`Reevalutate Projections` is used when you have loaded the list of best
125projections from file, but the actual data has changed since the last
126evaluation. For evaluating the current projection without engaging the
127projection search there is an :obj:`Evaluate Projection` button. Projections
128are evaluated based on performance of k-nearest neighbor classifiers, and the
129results of these evaluations in terms of which data instances were correctly or
130incorrectly classified is available through the two :obj:`Show k-NN` buttons.
131
132
133Based on a set of interesting projections found by VizRank, a number of
134post-analysis tools is available. :obj:`Attribute Ranking` displays a graph
135which show how many times the attributes appear in the top-rated projections.
136Bars can be colored according to the class with maximal average value of the
137attribute. :obj:`Attribute Interactions` displays a heat map showing how many
138times the two attributes appeared in the top-rated projections.
139:obj:`Graph Projection Scores` displays the distribution of projection scores.
[11050]140
141.. image:: images/Scatterplot-VizRank-AttributeHistogram.png
142
143.. image:: images/Scatterplot-VizRank-Interactions.png
144
145.. image:: images/Scatterplot-VizRank-Scores.png
146
[11359]147List of best-rated projections may also be used for the search and analysis of
148outliers. The idea is that the outliers are those data instances, which are
149incorrectly classified in many of the top visualizations. For example, the
150class of the 33-rd instance in `brown-selected.tab
151<http://orange.biolab.si/doc/datasets/brown-selected.tab>`_ should be Resp,
152but this instance is quite often misclassified as Ribo. The snapshot below
153shows one particular visualization displaying why such misclassification
154occurs. Perhaps the most important part of the :obj:`Outlier Identification`
155window is a list in the lower left (:obj:`Show predictions for all examples`)
156with a list of candidates for outliers sorted by the probabilities of
157classification to the right class. In our case, the most likely outlier is the
158instance 171, followed by an instance 33, both with probabilities of
159classification to the right class below 0.5.
[11050]160
161.. image:: images/Scatterplot-VizRank-Outliers.png
162
163Explorative Data Analysis
164
165.. image:: images/Scatterplot-ZoomSelect.png
166
[11359]167Scatterplot, together with the rest of the Orange's widget, provides for a
168explorative data analysis environment by supporting zooming-in and out of the
169part of the plot and selection of data instances. These functions are enabled
170through :obj:`Zoom/Select` toolbox. The default tool is zoom: left-click and
171drag on the plot area defines the rectangular are to zoom-in. Right click to
172zoom out. Next two buttons in this tool bar are rectangular and polygon
173selection. Selections are stacked and can be removed in order from the last
174one defined, or all at once (back-arrow and cross button from the tool bar).
175The last button in the tool bar is used to resend the data from this widget.
176Since this is done automatically after every change of the selection, this
177last function is not particularly useful. An example of a simple schema where
178we selected data instances from two polygon regions and send them to the
179:ref:`Data Table` widget is shown below. Notice that by counting the dots from
180the scatterplot there should be 12 data instances selected, whereas the data
181table shows 17. This is because some data instances overlap (have the same
182value of the two attributes used) - we could use Jittering to expose them.
[11050]183
184.. image:: images/Scatterplot-Iris-Selection.png
185
186
187Examples
188--------
189
[11359]190Scatterplot can be nicely combined with other widgets that output a list of
191selected data instances. For example, a combination of classification tree and
192scatterplot, as shown below, makes for a nice exploratory tool displaying data
193instances pertinent to a chosen classification tree node (clicking on any node
194of classification tree would send a set of selected data instances to
195scatterplot, updating the visualization and marking selected instances with
196filled symbols).
[11050]197
198.. image:: images/Scatterplot-ClassificationTree.png
199
200
201References
202----------
203
[11359]204.. [Leban2006] Leban G, Zupan B, Vidmar G, Bratko I. VizRank: Data
205   Visualization Guided by Machine Learning. Data Mining and Knowledge
206   Discovery 13(2): 119-136, 2006.
Note: See TracBrowser for help on using the repository browser.