COmmunity Detection Algorithm COMparator (CoDACom)

Invocation

Main operations

To call CoDACom, the main script is :

sh execute.sh [-pmac] [varpath]

The options correspond to the subparts that are executed, if no option is provided all the subparts will be run sequentially. The optional argument is the path to the vars.sh configuration file, found by default in the execution folder. Please note that, without inputs and a configuration step, the program will do nothing. You need to :

Provide graphs by placing them in the appropriate folders, raw/inputs by default
The inputs are handled with whitelists. You need to update them so that CoDACom recognises your inputs, by using the update_whitelists.sh script or directly modifying the configuration file vars.sh.

Main files

To give instructions to the program, the three following files are important.

vars.sh is a configuration script that is included in most scripts, and that contains system and behaviour configuration. You may set the number of parallel jobs, a timeout, a verbose level, etc. You may notably set the list of the files on which the computation will be carried on.
update_whitelists.sh is a utilitary script that updates vars.sh with folder content. You may only update part some of the whitelists, the default behaviour being to update everything.
execute.sh is the main execution script. You may run it using a subset of the four options "-pmac", which will only run the corresponding subparts. Please note that the files generated by the preprocessing step are not automatically added to the corresponding whitelist, you will therefore need to update it between the preprocessing and the method subpart.

First, one need to provide the graphs to the program by placing them in the appropriate folders. It corresponds to the RAW_INPUT_DIR variable in vars.sh, which is raw/inputs by default. The corresponding ground truth file goes to the RAW_GROUND_TRUTH folder, raw/ground_truth by default.

The numerical results may be found in the RESULT_DIR folder, specified in vars.sh. The default folder is the subfolder results. Clusterings may be found in the OUTPUT_DIR, default being clusterings.

Subparts

Four purposes have been identified for CoDACom, corresponding to four different behaviors.

Preprocessing

In order to normalise and to avoid errors, the inputs graphs are :

checked for consistency
reduced to their largest connected component
reordered to get consecutive ids
normalised to an edgelist format

This reordering uses a BFS algorithm that improves most of the processing thanks to caching.

A ground-truth may also be provided. It is preprocessed as well in order to keep consistency with the graph ids. A translation file is also outputted, which tracks the translation from old to new id.

Methods

The implementation of various community detection algorithms is executed. Sometimes, some change of the graph format from the internal one is needed to match the expectation of the specific implementation.

Analysis

In this subpart, comparison methods and quality functions are used. Comparison methods are functions that compare the result of two clusterings, without taking the graph topology into account. Quality functions quantify the quality of clusterings.

First, a simple comparison of all pairs of clusterings is processed. Second, the qualities of each clustering are computed. If the graphs feature a ground-truth, the clusterings are compared to this ground-truth, and Spearman's coefficient is computed between the ranking of the clusters from the perspective of the quality functions and the ranking of the ground-truth.

Conversion

You may convert the clusterings in a graphml format, which can be visualised using any modern graph visualisation software, such as Gephi. The clusterings are represented as edges labels (internal/external edges) and node labels (cluster id).

All the outputs of the program are originally available as space separated table files. To make things easier, a tabular LaTeX conversion is provided.

The quotient graph is constituted by the clusters as nodes and an edge exist if there is at least one edge connecting the clusters in the base graph. Useful data about the clusters and their connections is available in the labels.

Available components

When using CoDACom for a scientific publication, it would be kind of you to cite the original author of the used components.

Methods

walktrap : The walktrap algorithm, from Pons&Latapy, 2005
spinglass : The spinglass algorithm, from Reichardt&Bornholdt, 'Statistical mechanics of community detection', 2006
scd : The Scalable Community Detection, from Prat-Pérez&al., 'High Quality, Scalable and Parallel Community Detection for Large Real Graphs', 2014
louvain : The louvain method, from Blondel&al., 'Fast unfolding of communities in large networks', 2008
leading_eigenvector : The leading eigenvector method, from Newman, 'Finding community structure in networks using the eigenvectors of matrices', 2006
label_prop : The label propagation algorithm, from Raghavan et al., 'Near linear time algorithm to detect community structures in large-scale networks', 2007
infomap : The infomap algorithm from Rosvall&Bergstrom, 'Maps of random walks on complex networks reveal community structure', 2008
conclude : Conclude algorithm, from De Meo&al., 'Mixing local and global information for community detection in large networks', 2014
clauset : Clauset's greedy modularity algorithm, from Clauset et al., 'Finding community structure in very large networks', 2004
betweenness : Betweenness centrality method, from Girvan&Newman, 'Finding and evaluating community structure in networks', 2004

Comparison

fbcubed : The overlapping F-BCubed measure. See Amigo et al., 'A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints', 2009. Warning : the method uses 18000 random samples if the number of elements is bigger
nmi : The NMI measure. See Lanichinetti et al., 'Detecting the overlapping and hierarchical community structure in complex networks', 2009.
omega_index : The Omega index measure. See Xie et al., 'Overlapping community detection in networks: The state-of-the-art and comparative study', 2013. Warning : the method uses 18000 random samples if the binomial of the number of elements is bigger.

Quality functions

modularity : The modularity, from Newman, 'Mixing patterns in networks', 2003
local_modularity : The local modularity, from Muff et al., 'Local modularity measure for network clusterizations', 2005
modularity_density : The modularity density, from Chen et al., 'A new metric for quality of network community structure', 2013
clust_coef : The average internal clustering coefficient
permanence : The average permanence, from Chakraborty et al., 'On the permanence of vertices in network communities', 2014
s_compactness : The compactness, from Creusefond et al., 'Finding compact communities in large graphs', 2015
s_1-cond : One minus the conductance (normalised by the number of nodes)
s_FOMD : Fraction Over Median Degree, from Yang&Leskovec, 'Defining and Evaluating Network Communities based on Ground-truth', 2012
sur : Surprise, from Aldecoa and Marin, 'Surprise maximization reveals the community structure of complex networks', 2013. With the asymptotic approximation from the 2015 article (the m factor is removed)
sign : Signifiance, from Traag et al., 'Significant Scales in Community Structure', 2013
s_f-odf : Flake Out-Degree Fraction, from Flake et al., 'Efficient identificiation of web communities', 2000
s_cut_ratio : the cut ratio

Restrictions

CoDACom does not (yet) support directed or weighted graphs nor overlapping community detection.

Formats

Accepted graph formats (see the related igraph documentation for more information) :

gml
graphml
ncol
lgl

Accepted ground-truth format : list of nodes and membership

node_id1 cluster_id1 cluster_id2 ...
node_id2 cluster_id1 cluster_id3 ...
...

Translation format : list of nodes id

old_id new_id
...

Internal graph format : list of edges

node_id1 node_id2
...

Internal ground-truth format : list of nodes and membership

node_id1 cluster_id1 cluster_id2 ...
node_id2 cluster_id1 cluster_id3 ...
...

File hierarchy breakdown

src : Source directory
- analysis : The code used for the analysis phase
- conversion : The code used for the conversion phase
- libcomparison : The library implementing comparison methods
- libmisc : Miscalleanous utility functions
- libquality : The library implementing quality functions
- preprocessing : The code used for the analysis phase
- tools : scripts, tools used by methods or external scripts
html : This website
methods : The community detection methods. Any subdir containing execute.sh will be considered as a method
preprocessed : The output of the preprocessing phase
raw : The input folder
tmp : Internal temporary folder