Invocation
Main operations
To call CoDACom, the main script is :
sh execute.sh [-pmac] [varpath]
The options correspond to the subparts that are executed, if no option is provided all the subparts will be run sequentially.
The optional argument is the path to the vars.sh configuration file, found by default in the execution folder.
Please note that, without inputs and a configuration step, the program will do nothing.
You need to :
- Provide graphs by placing them in the appropriate folders, raw/inputs by default
- The inputs are handled with whitelists. You need to update them so that CoDACom recognises your inputs, by using the update_whitelists.sh script or directly modifying the configuration file vars.sh.
Main files
To give instructions to the program, the three following files are important.
- vars.sh is a configuration script that is included in most scripts, and that contains system and behaviour configuration. You may set the number of parallel jobs, a timeout, a verbose level, etc. You may notably set the list of the files on which the computation will be carried on.
- update_whitelists.sh is a utilitary script that updates vars.sh with folder content. You may only update part some of the whitelists, the default behaviour being to update everything.
- execute.sh is the main execution script. You may run it using a subset of the four options "-pmac", which will only run the corresponding subparts. Please note that the files generated by the preprocessing step are not automatically added to the corresponding whitelist, you will therefore need to update it between the preprocessing and the method subpart.
First, one need to provide the graphs to the program by placing them in the appropriate folders.
It corresponds to the RAW_INPUT_DIR variable in vars.sh, which is raw/inputs by default.
The corresponding ground truth file goes to the RAW_GROUND_TRUTH folder, raw/ground_truth by default.
The numerical results may be found in the RESULT_DIR folder, specified in vars.sh.
The default folder is the subfolder results.
Clusterings may be found in the OUTPUT_DIR, default being clusterings.
Subparts
Four purposes have been identified for CoDACom, corresponding to four different behaviors.
Preprocessing
In order to normalise and to avoid errors, the inputs graphs are :
- checked for consistency
- reduced to their largest connected component
- reordered to get consecutive ids
- normalised to an edgelist format
This reordering uses a BFS algorithm that improves most of the processing thanks to caching.
A ground-truth may also be provided.
It is preprocessed as well in order to keep consistency with the graph ids.
A translation file is also outputted, which tracks the translation from old to new id.
Methods
The implementation of various community detection algorithms is executed.
Sometimes, some change of the graph format from the internal one is needed to match the expectation of the specific implementation.
Analysis
In this subpart, comparison methods and quality functions are used.
Comparison methods are functions that compare the result of two clusterings, without taking the graph topology into account.
Quality functions quantify the quality of clusterings.
First, a simple comparison of all pairs of clusterings is processed.
Second, the qualities of each clustering are computed.
If the graphs feature a ground-truth, the clusterings are compared to this ground-truth, and Spearman's coefficient is computed between the ranking of the clusters from the perspective of the quality functions and the ranking of the ground-truth.
Conversion
You may convert the clusterings in a graphml format, which can be visualised using any modern graph visualisation software, such as Gephi.
The clusterings are represented as edges labels (internal/external edges) and node labels (cluster id).
All the outputs of the program are originally available as space separated table files.
To make things easier, a tabular LaTeX conversion is provided.
The quotient graph is constituted by the clusters as nodes and an edge exist if there is at least one edge connecting the clusters in the base graph.
Useful data about the clusters and their connections is available in the labels.
Available components
When using CoDACom for a scientific publication, it would be kind of you to cite the original author of the used components.
Methods
- walktrap : The walktrap algorithm, from Pons&Latapy, 2005
- spinglass : The spinglass algorithm, from Reichardt&Bornholdt, 'Statistical mechanics of community detection', 2006
- scd : The Scalable Community Detection, from Prat-PĂ©rez&al., 'High Quality, Scalable and Parallel Community Detection for Large Real Graphs', 2014
- louvain : The louvain method, from Blondel&al., 'Fast unfolding of communities in large networks', 2008
- leading_eigenvector : The leading eigenvector method, from Newman, 'Finding community structure in networks using the eigenvectors of matrices', 2006
- label_prop : The label propagation algorithm, from Raghavan et al., 'Near linear time algorithm to detect community structures in large-scale networks', 2007
- infomap : The infomap algorithm from Rosvall&Bergstrom, 'Maps of random walks on complex networks reveal community structure', 2008
- conclude : Conclude algorithm, from De Meo&al., 'Mixing local and global information for community detection in large networks', 2014
- clauset : Clauset's greedy modularity algorithm, from Clauset et al., 'Finding community structure in very large networks', 2004
- betweenness : Betweenness centrality method, from Girvan&Newman, 'Finding and evaluating community structure in networks', 2004
Comparison
- fbcubed : The overlapping F-BCubed measure. See Amigo et al., 'A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints', 2009. Warning : the method uses 18000 random samples if the number of elements is bigger
- nmi : The NMI measure. See Lanichinetti et al., 'Detecting the overlapping and hierarchical community structure in complex networks', 2009.
- omega_index : The Omega index measure. See Xie et al., 'Overlapping community detection in networks: The state-of-the-art and comparative study', 2013. Warning : the method uses 18000 random samples if the binomial of the number of elements is bigger.
Quality functions
- modularity : The modularity, from Newman, 'Mixing patterns in networks', 2003
- local_modularity : The local modularity, from Muff et al., 'Local modularity measure for network clusterizations', 2005
- modularity_density : The modularity density, from Chen et al., 'A new metric for quality of network community structure', 2013
- clust_coef : The average internal clustering coefficient
- permanence : The average permanence, from Chakraborty et al., 'On the permanence of vertices in network communities', 2014
- s_compactness : The compactness, from Creusefond et al., 'Finding compact communities in large graphs', 2015
- s_1-cond : One minus the conductance (normalised by the number of nodes)
- s_FOMD : Fraction Over Median Degree, from Yang&Leskovec, 'Defining and Evaluating Network Communities based on Ground-truth', 2012
- sur : Surprise, from Aldecoa and Marin, 'Surprise maximization reveals the community structure of complex networks', 2013. With the asymptotic approximation from the 2015 article (the m factor is removed)
- sign : Signifiance, from Traag et al., 'Significant Scales in Community Structure', 2013
- s_f-odf : Flake Out-Degree Fraction, from Flake et al., 'Efficient identificiation of web communities', 2000
- s_cut_ratio : the cut ratio
Restrictions
CoDACom does not (yet) support directed or weighted graphs nor overlapping community detection.
Accepted graph formats (see the related igraph documentation for more information) :
Accepted ground-truth format : list of nodes and membership
node_id1 cluster_id1 cluster_id2 ...
node_id2 cluster_id1 cluster_id3 ...
...
Translation format : list of nodes id
old_id new_id
...
Internal graph format : list of edges
node_id1 node_id2
...
Internal ground-truth format : list of nodes and membership
node_id1 cluster_id1 cluster_id2 ...
node_id2 cluster_id1 cluster_id3 ...
...
File hierarchy breakdown
- src : Source directory
- analysis : The code used for the analysis phase
- conversion : The code used for the conversion phase
- libcomparison : The library implementing comparison methods
- libmisc : Miscalleanous utility functions
- libquality : The library implementing quality functions
- preprocessing : The code used for the analysis phase
- tools : scripts, tools used by methods or external scripts
- html : This website
- methods : The community detection methods. Any subdir containing execute.sh will be considered as a method
- preprocessed : The output of the preprocessing phase
- raw : The input folder
- tmp : Internal temporary folder