--- title: "A Step-by-step Tutorial for Interaction Graphs Package 'integr'" author: "Petar Markovic" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A Tutorial for Interaction Graphs Package integr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction This vignette provides a step-by-step tutorial for using the Interaction graphs package "integr". The package is an implementation of Aleks Jakulin's Interaction Analysis methodology (http://stat.columbia.edu/~jakulin/Int/) inspired by implementation in Orange 2 data mining software (https://orange.biolab.si/). # The Concept^[See http://stat.columbia.edu/~jakulin/Int/ for more details on the methodology] In the context of supervised machine learning, an interaction (i.e statistically relevant dependence) between two attributes $X$ and $Y$, in the presence of the context (i.e. class) atribute $C$, is called **3-way interaction**. A strength of such interaction is measured with **3-way Interaction gain**: $I(X;Y;C) = I(X,Y;C) − I(X;C) − I(Y;C)$. Here, $I(X,Y;C) = I(X,Y|C) = H(X|C) + H(Y|C) − H(X,Y|C)$ is conditional Information gain (i.e. conditional Mutual information) between $X$ and $Y$ in the context $C$, and $I(X;Y) = H(X) + H(Y) − H(X,Y)$ is measure of dependence (i.e. "correlation") between $X$ and $Y$ regardless of context, where $H(X) = P_i \sum_{i}log_{2}P_i$ is Shannon's entropy measured in bits, and $P_i$ the probability of the $i-th$ class; **2-way Interaction gains** of the single attributes $X$ and $Y$ is represented with $I(X;C) = InfoGain_{c}(X) = \sum_{x}\sum_{c}P(x,c)log\frac{P(x,c)}{P(x)P(c)}$ and $I(Y;C) = InfoGain_{c}(Y) = \sum_{y}\sum_{c}P(y,c)log\frac{P(y,c)}{P(y)P(c)}$, respectively. **Interaction graphs** (Figure 1) are a graphical representation of the $k$-most significant 3-way interactions ($2 \leq k \leq 20$). The graph consists of nodes which represent interracting attributes (and their 2-way interactions indicated below the name), and weighted edges which represent the strength of 3-way interaction. There are two types of edges: * The positively interacting (i.e. green) edges indicate that the observed pair of attributes provides more information for making a decision if observed together, rather than observed alone. E.g. (Figure 1): _Outlook_ alone explains 24.69% of the entropy, _Windy_ alone explains 4.79% of the entropy, whilst combined, they additionally explain 30.59% of the entropy. Thus, if observed together, they explain 24.69% + 4.79% + 30.59% = 60.07% of the entropy of the dataset. * The negatively interacting (i.e. red) edges indicate that the the observed pair of attributes repeat the same information and should not be combined. E.g. (Figure 1): _Outlook_ alone explains 24.69% of the entropy, _Others_ alone explains 94.02% of the entropy, whilst combined, they repeat 24.69% of the previous information (i.e. this is why the edge _Outlook_ - _Others_ is negative: -24.69%). Thus, if observed together, they explain 24.69% + 94.02% - 24.69% = 94.02% of the entropy of the dataset. ```{r, fig.show='hold', echo=FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.cap="Figure 1: Interaction graph based on the toy-dataset 'Golf'", fig.height=4, fig.align='center'} library(integr) integr::plotIntGraph(integr::interactionGraph(integr::golf, classAtt = "Play", intNo = 10)) ``` Hence, interaction graphs can be used as a tool for understanding the most important interactions and selection of the attributes suitable for grouping/including in a machine learning model. # The toy-data description In this tutorial, the **'Golf'** toy-dataset will be used. It is included in the package, and its structure is presented in the Table below. It represents a 14-row discrete data.frame (i.e. all columns are factors) with 6 discrete attributes of which 5 are input, and 1 is the class attribute. The input attributes are used to determine whether a game of golf was played given the conditions, and the decision is recorded in the class attribute: * __Outlook:__ values: Overcast, Rainy, Sunny (_input attribute_) * __Temperature:__ values: Cool, Hot, Mild (_input attribute_) * __Humidity:__ values: High, Normal (_input attribute_) * __Windy:__ values: True, False (_input attribute_) * __Others:__ artificially added attribute indicating whether the players on the other courts were playing the golf at the given time, values: Yes, No (_input attribute_) * __Play:__ indicating whether the decision was to play or not to play a party of golf, values: Yes, No (_class attribute_) ```{r, echo=FALSE, results='asis', fig.cap="Table 1: The 'Golf' toy-dataset"} knitr::kable(integr::golf) ``` # Step-by-step tutorial ## Reading the data First the 'integr' package, and a dataset needs to be loaded. The dataset needs to be discrete, and to have a class attribute. Here the 'Golf' toy-dataset will be used: ```{r, eval=FALSE} #load integr package (needs to be installed first!) library("integr") #read Golf toy-dataset data("golf") ``` ## Generating the interaction graph object When the data is loaded, an interaction graph object needs to be created. A data.frame containing the data needs to be provided, as well as the name of the class attribute as a string: ```{r, eval=FALSE} #create an Interaction graph object g <- interactionGraph(golf, classAtt = "Play", intNo = 10, speedUp = FALSE) ``` The additional parameters _intNo_ (_integer_) and _speedUp_ (_boolean_) are optional. The first indicates the desired number of interactions to be displayed on the interaction graph (2 <= _intNo_ <= 20, default 16), whilst the latter indicates if during the interactions computation all attributes that have 2-way interaction gain equal to zero (on the 4th decimal) should be pruned; this speeds up computation for larger datasets but it can lead to less precise results so it is turned off (i.e. set to FALSE) by default. In case the __intNo__ parameter is set to an inappropriate value (i.e <2, >20 or larger than theoretically possible number of interactions for the given dataset) it is automatically adjusted to fit and a warning message is printed. ## Plotting the interaction graph object After the interaction graph object has been obtained, it can be plotted using _plotIntGraph()_: ```{r, eval=FALSE} #plot an Interaction graph object (in RStudio!) plotIntGraph(g) ``` It only requires an interaction graph object as an input. Here the result of the previous step is used. The result of this comand is Figure 1. ## Exporting the interaction graph object Integr package allows interaction graphs to be export to a binary file. The supported formats are: a Graphviz graph, SVG image, PNG image, PostScript (PS) file, or PDF. The code for exporting the corresponding binary file is provided below. ### Export to a Graphviz binary file ```{r, eval=FALSE} #export an Interaction graph object to a Graphviz file igToGrViz(g, path = "myFolder", fName = "myInteractionGraph") ``` _g_ is the interaction graph object; _path_ parameter is a string indicating the path (folder) in which the output should be saved. _fName_ parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, 'InteractionGraph' by default. ### Export to a SVG image ```{r, eval=FALSE} #export an Interaction graph object to a SVG image igToSVG(g, path = "myFolder", fName = "myInteractionGraph", h = 2000) ``` _g_ is the interaction graph object; _path_ parameter is a string indicating the path (folder) in which the output should be saved. _fName_ parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, 'InteractionGraph' by default; _h_ is the desired height of the output image in pixels. If not defined differently, 2000 by default. ### Export to a PNG image ```{r, eval=FALSE} #export an Interaction graph object to a PNG image igToPNG(g, path = "myFolder", fName = "myInteractionGraph", h = 2000) ``` _g_ is the interaction graph object; _path_ parameter is a string indicating the path (folder) in which the output should be saved. _fName_ parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, 'InteractionGraph' by default; _h_ is the desired height of the output image in pixels. If not defined differently, 2000 by default. ### Export to a PDF image ```{r, eval=FALSE} #export an Interaction graph object to a PDF image igToPDF(g, path = "myFolder", fName = "myInteractionGraph", h = 2000) ``` _g_ is the interaction graph object; _path_ parameter is a string indicating the path (folder) in which the output should be saved. _fName_ parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, 'InteractionGraph' by default; _h_ is the desired height of the output image in pixels. If not defined differently, 2000 by default. ### Export to a PS image ```{r, eval=FALSE} #export an Interaction graph object to a PS image igToPS(g, path = "myFolder", fName = "myInteractionGraph", h = 2000) ``` _g_ is the interaction graph object; _path_ parameter is a string indicating the path (folder) in which the output should be saved. _fName_ parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, 'InteractionGraph' by default; _h_ is the desired height of the output image in pixels. If not defined differently, 2000 by default.