source: orange/docs/widgets/rst/visualize/scatterplot.rst @ 11477:c1fdcc4deb7b

Revision 11477:c1fdcc4deb7b, 9.9 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 12 months ago (diff)

Removed broken links.

Line 
1.. _Scatter Plot:
2
3Scatter Plot
4============
5
6.. image:: ../icons/Distributions.png
7
8A standard scatterplot visualization with explorative analysis and  intelligent
9data visualization enhancements.
10
11Signals
12-------
13
14Inputs:
15   - Examples (ExampleTable)
16      Input data set.
17   - Example Subset (ExampleTable)
18      A subset of data instances from Examples.
19
20
21Outputs:
22   - Selected Examples (ExampleTable)
23      A subset of examples that user has manually selected from the
24      scatterplot.
25   - Unselected Examples (ExampleTable)
26      All other examples (examples not included in the user's selection).
27
28
29Description
30-----------
31
32Scatterplot widget provides a standard 2-dimensional scatterplot visualization
33for both continuous and discrete-valued attributes. The data is displayed as a
34collection of points, each having the value of :obj:`X-axis attribute`
35determining the position on the horizontal axis and the value of
36:obj:`Y-axis attribute` determining the position on the vertical axis.
37Various properties of the graph, like color, size and shape of the  points are
38controlled through the appropriate setting in the :obj:`Main` pane of the
39widget, while other (like legends and axis titles, maximum point size and
40jittering) are set in the :obj:`Settings` pane. A snapshot below shows a
41scatterplot of an Iris data set, with the size of the points proportional to
42the value of sepal width attribute, and coloring matching that of the class
43attribute.
44
45.. image:: images/Scatterplot-Iris.png
46
47In the case of discrete attributes, jittering (:obj:`Jittering options` )
48should be used to circumvent the overlap of the points with the same value for
49both axis, and to obtain a plot where density of the points in particular
50region corresponds better to the density of the data with that particular
51combination of values. As an example of such a plot, the scatterplot for the
52Titanic data reporting on the gender of the passenger and the traveling class
53is shown below; withouth jittering, scatterplot would display only eight
54distinct points.
55
56.. image:: images/Scatterplot-Titanic.png
57
58Most of the scatterplot options are quite standard, like those for selecting
59attributes for point colors, labels, shape and size (:obj:`Main` pane), or
60those that control the display of various elements in the graph like axis
61title, grid lines, etc. (:obj:`Settings` pane). Beyond these, the Orange's
62scatterplot also implements an intelligent visualization technique called
63VizRank that is invoked through :obj:`VizRank` button in :obj:`Main` tab.
64
65Intelligent Data Visualization
66
67If a data set has many attributes, it is impossible to manually scan through
68all the pairs of attributes to find interesting scatterplots. Intelligent data
69visualizations techniques are about finding such visualizations automatically.
70Orange's Scatterplot includes one such tool called VizRank ([1]_), that
71can be in current implementation used only with classification data sets, that
72is, data sets where instances are labeled with a discrete class. The task of
73optimization is to find those scatterplot projections, where instances with
74different class labels are well separated. For example, for a data set
75brown-selected.tab
76(comes with Orange installation) the two attributes that best separate
77instances of different class are displayed in the snapshot below, where we have
78also switched on the :obj:`Show Probabilities` option from Scatterplot's
79:obj:`Settings` pane. Notice that this projection appears at the top of
80:obj:`Projection list, most interesting first`, followed by a list of
81other potentially interesting projections. Selecting each of these would
82change the projection displayed in the scatterplot, so the list and associated
83projections can be inspected in this way.
84
85.. image:: images/Scatterplot-VizRank-Brown.png
86
87The number of different projections that can be considered by VizRank may be
88quite high. VizRank searches the space of possible projections heuristically.
89The search is invoked by pressing :obj:`Start Evaluating Projections`, which
90may be stopped anytime. Search through modification of top-rated projections
91(replacing one of the two attributes with another one) is invoked by pressing a
92:obj:`Locally Optimize Best Projections` button.
93
94.. image:: images/Scatterplot-VizRank-Settings.png
95   :align: left
96
97VizRank's options are quite elaborate, and if you are not the expert in machine
98learning it would be best to leave them at their defaults. The options are
99grouped according to the different aspects of the methods as described in
100[1]_. The projections are evaluated through testing a selected
101classifier (:obj:`Projection evaluation method` default is k-nearest neighbor
102classification) using some standard evaluation technique
103(:obj:`Testing method`). For very large data set use sampling to speed-up the
104evaluation (:obj:`Percent of data used`). Visualizations will then be ranked
105according to the prediction accuracy (:obj:`Measure of classification success`
106), in our own tests :obj:`Average Probability Assigned to the Correct Class`
107worked somehow better than more standard measures like
108:obj:`Classification Accuracy` or :obj:`Brier Score`. To avoid exhaustive
109search for data sets with many attributes, these are ranked by heuristics
110(:obj:`Measure for attribute ranking`), testing most likely projection
111candidates first. Number of items in the list of projections is controlled in
112:obj:`Maximum length of projection list`.
113
114
115.. image:: images/Scatterplot-VizRank-ManageSave.png
116   :align: left
117
118A set of tools that deals with management and post-analysis of list of
119projections is available under :obj:`Manage & Save` tab. Here you may decide
120which classes the visualizations should separate (default set to separation of
121all the classes). Projection list can saved (:obj:`Save` in
122:obj:`Manage projections` group), loaded (:obj:`Load`), a set of best
123visualizations may be saved (:obj:`Saved Best Graphs`).
124:obj:`Reevalutate Projections` is used when you have loaded the list of best
125projections from file, but the actual data has changed since the last
126evaluation. For evaluating the current projection without engaging the
127projection search there is an :obj:`Evaluate Projection` button. Projections
128are evaluated based on performance of k-nearest neighbor classifiers, and the
129results of these evaluations in terms of which data instances were correctly or
130incorrectly classified is available through the two :obj:`Show k-NN` buttons.
131
132
133Based on a set of interesting projections found by VizRank, a number of
134post-analysis tools is available. :obj:`Attribute Ranking` displays a graph
135which show how many times the attributes appear in the top-rated projections.
136Bars can be colored according to the class with maximal average value of the
137attribute. :obj:`Attribute Interactions` displays a heat map showing how many
138times the two attributes appeared in the top-rated projections.
139:obj:`Graph Projection Scores` displays the distribution of projection scores.
140
141.. image:: images/Scatterplot-VizRank-AttributeHistogram.png
142
143.. image:: images/Scatterplot-VizRank-Interactions.png
144
145.. image:: images/Scatterplot-VizRank-Scores.png
146
147List of best-rated projections may also be used for the search and analysis of
148outliers. The idea is that the outliers are those data instances, which are
149incorrectly classified in many of the top visualizations. For example, the
150class of the 33-rd instance in brown-selected.tab should be Resp,
151but this instance is quite often misclassified as Ribo. The snapshot below
152shows one particular visualization displaying why such misclassification
153occurs. Perhaps the most important part of the :obj:`Outlier Identification`
154window is a list in the lower left (:obj:`Show predictions for all examples`)
155with a list of candidates for outliers sorted by the probabilities of
156classification to the right class. In our case, the most likely outlier is the
157instance 171, followed by an instance 33, both with probabilities of
158classification to the right class below 0.5.
159
160.. image:: images/Scatterplot-VizRank-Outliers.png
161
162Explorative Data Analysis
163
164.. image:: images/Scatterplot-ZoomSelect.png
165
166Scatterplot, together with the rest of the Orange's widget, provides for a
167explorative data analysis environment by supporting zooming-in and out of the
168part of the plot and selection of data instances. These functions are enabled
169through :obj:`Zoom/Select` toolbox. The default tool is zoom: left-click and
170drag on the plot area defines the rectangular are to zoom-in. Right click to
171zoom out. Next two buttons in this tool bar are rectangular and polygon
172selection. Selections are stacked and can be removed in order from the last
173one defined, or all at once (back-arrow and cross button from the tool bar).
174The last button in the tool bar is used to resend the data from this widget.
175Since this is done automatically after every change of the selection, this
176last function is not particularly useful. An example of a simple schema where
177we selected data instances from two polygon regions and send them to the
178:ref:`Data Table` widget is shown below. Notice that by counting the dots from
179the scatterplot there should be 12 data instances selected, whereas the data
180table shows 17. This is because some data instances overlap (have the same
181value of the two attributes used) - we could use Jittering to expose them.
182
183.. image:: images/Scatterplot-Iris-Selection.png
184
185
186Examples
187--------
188
189Scatterplot can be nicely combined with other widgets that output a list of
190selected data instances. For example, a combination of classification tree and
191scatterplot, as shown below, makes for a nice exploratory tool displaying data
192instances pertinent to a chosen classification tree node (clicking on any node
193of classification tree would send a set of selected data instances to
194scatterplot, updating the visualization and marking selected instances with
195filled symbols).
196
197.. image:: images/Scatterplot-ClassificationTree.png
198
199
200References
201----------
202
203.. [1] Leban G, Zupan B, Vidmar G, Bratko I. VizRank: Data
204   Visualization Guided by Machine Learning. Data Mining and Knowledge
205   Discovery 13(2): 119-136, 2006.
Note: See TracBrowser for help on using the repository browser.