source: orange/Orange/doc/widgets/Visualize/Scatterplot.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 12.3 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

4<link rel=stylesheet href="../../../style.css" type="text/css" media=screen>
5<link rel=stylesheet href="style-print.css" type="text/css" media=print></link>
12<img class="screenshot" src="../icons/Distributions.png">
13<p>A standard scatterplot visualization with explorative analysis and  intelligent data visualization enhancements.</p>
19<DL class=attributes>
20<DT>Examples (ExampleTable)</DT>
21<DD>Input data set.</DD>
22<DT>Example Subset (ExampleTable)</DT>
23<DD>A subset of data instances from Examples.</DD>
28<dl class=attributes>
29  <DT>Selected Examples (ExampleTable)</DT>
30<DD>A subset of examples that user has manually selected from the scatterplot.</DD>
31<DT>Unselected Examples (ExampleTable)</DT>
32<DD>All other examples (examples not included in the user's selection).</DD>
38<p>Scatterplot widget provides a standard 2-dimensional scatterplot visualization for both continuous and discrete-valued attributes. The data is displayed as a collection of points, each having the value of <span class="option">X-axis attribute</span> determining the position on the horizontal axis and the value of <span class="option">Y-axis attribute</span> determining the position on the vertical axis. Various properties of the graph, like color, size and shape of the  points are controlled through the appropriate setting in the <span class="option">Main</span> pane of the widget, while other (like legends and axis titles, maximum point size and jittering) are set in the <span class="option">Settings</span> pane.</p> A snapshot below shows a scatterplot of an Iris data set, with the size of the points proportional to the value of sepal width attribute, and coloring matching that of the class attribute.</p>
40<p><img class="screenshot" src="Scatterplot-Iris.png" alt="Scatterplot widget"></p>
42<p>In the case of discrete attributes, jittering (<span class="option">Jittering options</span> ) should be used to circumvent the overlap of the points with the same value for both axis, and to obtain a plot where density of the points in particular region corresponds better to the density of the data with that particular combination of values. As an example of such a plot, the scatterplot for the Titanic data reporting on the gender of the passenger and the traveling class is shown below; withouth jittering, scatterplot would display only eight distinct points.</p>
44<p><img class="screenshot" src="Scatterplot-Titanic.png" img="Scatterplot with categorical attributes"></p>
46<p>Most of the scatterplot options are quite standard, like those for selecting attributes for point colors, labels, shape and size (<span class="option">Main</span> pane), or those that control the display of various elements in the graph like axis title, grid lines, etc. (<span class="option">Settings</span> pane). Beyond these, the Orange's scatterplot also implements an intelligent visualization technique called VizRank that is invoked through <span class="option">VizRank</span> button in <span class="option">Main</span> tab.</p>
48<h3>Intelligent Data Visualization</h2>
50<p>If a data set has many (many!) attributes, it is impossible to manually scan through all the pairs of attributes to find interesting scatterplots. Intelligent data visualizations techniques are about finding such visualizations automatically. Orange's Scatterplot includes one such tool called VizRank <a href="#Leban2006" title="">(Leban et al., 2006)</a>, that can be in current implementation used only with classification data sets, that is, data sets where instances are labeled with a discrete class. The task of optimization is to find those scatterplot projections, where instances with different class labels are well separated. For example, for a data set <a href=""></a> (comes with Orange installation) the two attributes that best separate instances of different class are displayed in the snapshot below, where we have also switched on the <span class="option">Show Probabilities</span> option from Scatterplot's <span class="option">Settings</span> pane. Notice that this projection appears at the top of <span class="option">Projection list, most interesting first</span>, followed by a list of other potentially interesting projections. Selecting each of these would change the projection displayed in the scatterplot, so the list and associated projections can be inspected in this way.</p>
52<p><img class="screenshot" src="Scatterplot-VizRank-Brown.png" alt="VizRank and scatterplot"></p>
54<p>The number of different projections that can be considered by VizRank may be quite high. VizRank searches the space of possible projections heuristically. The search is invoked by pressing <span class="option">Start Evaluating Projections</span>, which may be stopped anytime. Search through modification of top-rated projections (replacing one of the two attributes with another one) is invoked by pressing a <span class="option">Locally Optimize Best Projections</span> button.</p>
57<td valign="top">
58<img class="screenshot" src="Scatterplot-VizRank-Settings.png" alt="VizRank settings" border=0 img="VizRank settings">
61<td valign="top">
62<p>VizRank's options are quite elaborate, and if you are not the expert in machine learning it would be best to leave them at their defaults. The options are grouped according to the different aspects of the methods as described in <a href="#Leban2006" title="">(Leban et al., 2006)</a>. The projections are evaluated through testing a selected classifier (<span class="option">Projection evaluation method</span> default is k-nearest neighbor classification) using some standard evaluation technique (<span class="option">Testing method</span>). For very large data set use sampling to speed-up the evaluation (<span class="option">Percent of data used</span>). Visualizations will then be ranked according to the prediction accuracy (<span class="option">Measure of classification success</span>), in our own tests <span class="option">Average Probability Assigned to the Correct Class</span> worked somehow better than more standard measures like <span class="option">Classification Accuracy</span> or <span class="option">Brier Score</span>. To avoid exhaustive search for data sets with many attributes, these are ranked by heuristics (<span class="option">Measure for attribute ranking</span>), testing most likely projection candidates first. Number of items in the list of projections is controlled in <span class="option">Maximum length of projection list</span>.</p>
65<p>A set of tools that deals with management and post-analysis of list of projections is available under <span class="option">Manage &amp; Save</span> tab. Here you may decide which classes the visualizations should separate (default set to separation of all the classes). Projection list can saved (<span class="option">Save</span> in <span class="option">Manage projections</span> group), loaded (<span class="option">Load</span>), a set of best visualizations may be saved (<span class="option">Saved Best Graphs</span>). <span class="option">Reevalutate Projections</span> is used when you have loaded the list of best projections from file, but the actual data has changed since the last evaluation. For evaluating the current projection without engaging the projection search there is an <span class="option">Evaluate Projection</span> button. Projections are evaluated based on performance of k-nearest neighbor classifiers, and the results of these evaluations in terms of which data instances were correctly or incorrectly classified is available through the two <span class="option">Show k-NN</span> buttons.</p>
67<img class="screenshot" src="Scatterplot-VizRank-ManageSave.png" alt="VizRank manage and save">
69<p>Based on a set of interesting projections found by VizRank, a number of post-analysis tools is available. <span class="option">Attribute Ranking</span> displays a graph which show how many times the attributes appear in the top-rated projections. Bars can be colored according to the class with maximal average value of the attribute. <span class="option">Attribute Interactions</span> displays a heat map showing how many times the two attributes appeared in the top-rated projections. <span class="option">Graph Projection Scores</span> displays the distribution of projection scores.</p>
71<p><img class="screenshot" src="Scatterplot-VizRank-AttributeHistogram.png" alt="VizRank attribute use histogram"></p>
72<p><img class="screenshot" src="Scatterplot-VizRank-Interactions.png" alt="VizRank and attribute interactions"></p>
73<p><img class="screenshot" src="Scatterplot-VizRank-Scores.png" alt="VizRank and attribute scoring"></p>
75<p>List of best-rated projections may also be used for the search and analysis of outliers. The idea is that the outliers are those data instances, which are incorrectly classified in many of the top visualizations. For example, the class of the 33-rd instance in <a href=""></a> should be Resp, but this instance is quite often misclassified as Ribo. The snapshot below shows one particular visualization displaying why such misclassification occurs. Perhaps the most important part of the <span class="option">Outlier Identification</span> window is a list in the lower left (<span class="option">Show predictions for all examples</span>) with a list of candidates for outliers sorted by the probabilities of classification to the right class. In our case, the most likely outlier is the instance 171, followed by an instance 33, both with probabilities of classification to the right class below 0.5.<p>
77<p><img class="screenshot" src="Scatterplot-VizRank-Outliers.png" alt="Outliers"></p>
79<h3>Explorative Data Analysis</h3>
81<p><img class="screenshot" src="Scatterplot-ZoomSelect.png" alt="Zoom/Select tool bar"></p>
83<p>Scatterplot, together with the rest of the Orange's widget, provides for a explorative data analysis environment by supporting zooming-in and out of the part of the plot and selection of data instances. These functions are enabled through <span class="option">Zoom/Select</span> toolbox. The default tool is zoom: left-click and drag on the plot area defines the rectangular are to zoom-in. Right click to zoom out. Next two buttons in this tool bar are rectangular and polygon selection. Selections are stacked and can be removed in order from the last one defined, or all at once (back-arrow and cross button from the tool bar). The last button in the tool bar is used to resend the data from this widget. Since this is done automatically after every change of the selection, this last function is not particularly useful. An example of a simple schema where we selected data instances from two polygon regions and send them to the Data Table widget is shown below. Notice that by counting the dots from the scatterplot there should be 12 data instances selected, whereas the data table shows 17. This is because some data instances overlap (have the same value of the two attributes used) - we could use Jittering to expose them.</p>
85<p><img class="screenshot" src="Scatterplot-Iris-Selection.png" alt="Brushing in scatterplot"></p>
90<p>Scatterplot can be nicely combined with other widgets that output a list of selected data instances. For example, a combination of classification tree and scatterplot, as shown below, makes for a nice exploratory tool displaying data instances pertinent to a chosen classification tree node (clicking on any node of classification tree would send a set of selected data instances to scatterplot, updating the visualization and marking selected instances with filled symbols).</p>
92<p><img class="screenshot" src="Scatterplot-ClassificationTree.png" alt="Scatterplot and classification trees"></p>
97<p id="Leban2006">Leban G, Zupan B, Vidmar G, Bratko I. VizRank: Data Visualization Guided by Machine Learning. Data Mining and Knowledge Discovery 13(2): 119-136, 2006. <a href="">[PDF]</a></p>
99<p id="Mramor2007">Mramor M, Leban G, Demsar J, Zupan B. Visualization-based cancer microarray data classification analysis. Bioinformatics 23(16): 2147-2154, 2007. <a href="">[PDF]</a></p>
Note: See TracBrowser for help on using the repository browser.