Which open source software is best for network data analysis?
I want to start analyzing data for illustrating relationships between persons and institutions but I am not sure which software is best to select for using. I have to use open source and free software please advice me.
In my opinion the best software to use for network analysis will really depend on a number of factors, including:
What skills you have, especially do you have development skills and if so what languages?
How large the network in question is?
Are you more focussed on visualization or the computation of metrics such as centrality, betweenness, etc.?
What is your budget, if any?
It's a bit of a generalization and the below is not an exhaustive list, but in my mind the available tools can be divided into four categories:
Focused Desktop Tools
Gephi: Probably the most popular network visualization package out there. Gephi doesn't require any programming knowledge. It's strength is that it is able to produce very high quality visualizations. It can also handle relatively large graphs - the actual size will depend on your infrastructure (particularly RAM) but you should be able to go up to 100,000 nodes without a problem. It does have the ability to calculate a few of the more common metrics such as degree, centrality, etc. but it's a stronger tool for visualization than analysis. [Open Source]
NodeXL: NodeXL is an Excel add-in so you will need Excel to use it which is a bit of a limitation for Mac users for example. It doesn't have all of the flexibility of Gephi in terms of visualization but can produce some quality visualizations. It also interfaces directly with the SNAP library for analysis which gives it access to a nice set of efficient algorithms for metric calculations. The main advantage of NodeXL though is neither in its visualization or analysis functionality but rather in it's data collection - it interfaces with the Twitter API nicely for example and many of the use cases for NodeXL involve the visualization and analysis of social media data in my experience [Open Source but as I write this they have announced the Open Source functionality will be limited and that a Commercial option will be introduced]
Cytoscape: There is actually both a desktop version as well as a javascript version for developers (see cytoscape.js). In my experience it's primarily used in the biology domain but can certainly be used outside of it and is capable of producing high quality visualizations. [Open Source]
Ucinet: In my experience Ucinet is most widely used in academic circles. It's very strong on analytics with a large number of metrics. However, it is quite weak on visualization in my view (really thinking about its cousin Netdraw here), i.e. it can calculate both the common metrics as well as some quite arcane metrics but it's not great at turning those results into a well presented visualization. It also requires Windows for installation so Mac users have to be creative by using an emulator for example. [Commercial]
Pajek: At a high-level, not too dissimilar to Ucinet in that it is quite strong on analytics but relatively weak on visualization. There is also a version called Pajek-XXL which is specifically designed for very large graphs - if you have a network with millions of nodes and no programming skills and would like to analyse it then this wouldn't be a bad starting place. [Commercial]
General Desktop / On-Premise Solutions
Palantir: Very expensive on-premise solution. Not specifically designed for traditional network analysis but rather making sense of network data in a much more general way. Used by intelligence agencies and the like. [Commercial]
IBM i2 Analyst's Workstation: Similar to Palantir. There is a whole history here - i2 used to be a standalone company that was acquired by IBM and there was some interesting litigation between Palantir and IBM prior to this. [Commercial]
Cloud Based Tools
Polinode: Polinode is software-as-a-service for network analysis, i.e. you can upload networks to the cloud and then visualize them there like you do with Gephi but with the key advantage that you can share them with other people without having them download software. It also includes analysis capabilities including 20 of the most common metrics for network analysis - it doesn't have all the metrics of say a Ucinet but for most use cases should have enough. Since it runs in your browser, if your network is very large (e.g. >50,000 nodes) then you are probably better off using one of the developer tools or a desktop tool designed for very large networks. Since it's cloud-based, Polinode is also able to integrate relationship-based surveys for the collection of network data. [Commercial - full disclosure: I'm the founder]
Developer Tools
NetworkX: An active community and terrific if you have some Python knowledge. If you have a large dataset (>100,000 nodes say) then this is a great place to start as many of the computationally intensive metrics now make use of sparse matrices. [Open Source]
iGraph: Also good for large graphs and if you prefer R over Python for your data analysis and have a solid knowledge of R then you may want to use iGraph. [Open Source]
SNAP: Written in C++ but with a Python interface this is the Stanford Network Analysis Project. Not for the faint of heart! A great framework to build on if you have something very technical / custom to build that needs the speed that C++ can provide but be prepared to invest a lot of time in getting up to speed. [Open Source]
sigma.js: For all the web developers. sigma.js is a JavaScript library that provides flexible functionality for visualization. It's light on analysis - you would need to calculate centrality, etc. externally. There are actually a lot of other JavaScript libraries out there - see ngraph, cytoscape.js, d3, arbor, alchemy and dracula for example. [Open Source]
By no means an exhaustive list but rather a few of the more commonly used applications to illustrate the trade-offs.
I would suggest NetworkX. I have also tried Network Workbench (http://nwb.cns.iu.edu/) that is a good tool but that unfortunately it is not periodically updated such as NetworkX, iGraph and Gephi.
I am teaching a course on SNA. The students use Pajek - it is a good introduction to SNA and the program has an extensive user manual, a book actually. For my research, I am leaning toward Gephi, but in a previous research we relied on Cytoscape.
For a quick check of the data, I would recommend Pajek or Gephi.
BTW there is a good freed online course on Social Network Analysis coming up in March by Lada Adamic. She uses Gephi and iGraph. https://www.coursera.org/course/sna
There are few I work with: NodeXL is Microsoft Excel template and is very useful. Get it from here: http://nodexl.codeplex.com/. The other is Social Networks Visualizer (http://socnetv.sourceforge.net/). Both are free and very ease to use.
In my opinion the best software to use for network analysis will really depend on a number of factors, including:
What skills you have, especially do you have development skills and if so what languages?
How large the network in question is?
Are you more focussed on visualization or the computation of metrics such as centrality, betweenness, etc.?
What is your budget, if any?
It's a bit of a generalization and the below is not an exhaustive list, but in my mind the available tools can be divided into four categories:
Focused Desktop Tools
Gephi: Probably the most popular network visualization package out there. Gephi doesn't require any programming knowledge. It's strength is that it is able to produce very high quality visualizations. It can also handle relatively large graphs - the actual size will depend on your infrastructure (particularly RAM) but you should be able to go up to 100,000 nodes without a problem. It does have the ability to calculate a few of the more common metrics such as degree, centrality, etc. but it's a stronger tool for visualization than analysis. [Open Source]
NodeXL: NodeXL is an Excel add-in so you will need Excel to use it which is a bit of a limitation for Mac users for example. It doesn't have all of the flexibility of Gephi in terms of visualization but can produce some quality visualizations. It also interfaces directly with the SNAP library for analysis which gives it access to a nice set of efficient algorithms for metric calculations. The main advantage of NodeXL though is neither in its visualization or analysis functionality but rather in it's data collection - it interfaces with the Twitter API nicely for example and many of the use cases for NodeXL involve the visualization and analysis of social media data in my experience [Open Source but as I write this they have announced the Open Source functionality will be limited and that a Commercial option will be introduced]
Cytoscape: There is actually both a desktop version as well as a javascript version for developers (see cytoscape.js). In my experience it's primarily used in the biology domain but can certainly be used outside of it and is capable of producing high quality visualizations. [Open Source]
Ucinet: In my experience Ucinet is most widely used in academic circles. It's very strong on analytics with a large number of metrics. However, it is quite weak on visualization in my view (really thinking about its cousin Netdraw here), i.e. it can calculate both the common metrics as well as some quite arcane metrics but it's not great at turning those results into a well presented visualization. It also requires Windows for installation so Mac users have to be creative by using an emulator for example. [Commercial]
Pajek: At a high-level, not too dissimilar to Ucinet in that it is quite strong on analytics but relatively weak on visualization. There is also a version called Pajek-XXL which is specifically designed for very large graphs - if you have a network with millions of nodes and no programming skills and would like to analyse it then this wouldn't be a bad starting place. [Commercial]
General Desktop / On-Premise Solutions
Palantir: Very expensive on-premise solution. Not specifically designed for traditional network analysis but rather making sense of network data in a much more general way. Used by intelligence agencies and the like. [Commercial]
IBM i2 Analyst's Workstation: Similar to Palantir. There is a whole history here - i2 used to be a standalone company that was acquired by IBM and there was some interesting litigation between Palantir and IBM prior to this. [Commercial]
Cloud Based Tools
Polinode: Polinode is software-as-a-service for network analysis, i.e. you can upload networks to the cloud and then visualize them there like you do with Gephi but with the key advantage that you can share them with other people without having them download software. It also includes analysis capabilities including 20 of the most common metrics for network analysis - it doesn't have all the metrics of say a Ucinet but for most use cases should have enough. Since it runs in your browser, if your network is very large (e.g. >50,000 nodes) then you are probably better off using one of the developer tools or a desktop tool designed for very large networks. Since it's cloud-based, Polinode is also able to integrate relationship-based surveys for the collection of network data. [Commercial - full disclosure: I'm the founder]
Developer Tools
NetworkX: An active community and terrific if you have some Python knowledge. If you have a large dataset (>100,000 nodes say) then this is a great place to start as many of the computationally intensive metrics now make use of sparse matrices. [Open Source]
iGraph: Also good for large graphs and if you prefer R over Python for your data analysis and have a solid knowledge of R then you may want to use iGraph. [Open Source]
SNAP: Written in C++ but with a Python interface this is the Stanford Network Analysis Project. Not for the faint of heart! A great framework to build on if you have something very technical / custom to build that needs the speed that C++ can provide but be prepared to invest a lot of time in getting up to speed. [Open Source]
sigma.js: For all the web developers. sigma.js is a JavaScript library that provides flexible functionality for visualization. It's light on analysis - you would need to calculate centrality, etc. externally. There are actually a lot of other JavaScript libraries out there - see ngraph, cytoscape.js, d3, arbor, alchemy and dracula for example. [Open Source]
By no means an exhaustive list but rather a few of the more commonly used applications to illustrate the trade-offs.
Ashis that is more of a question than an answer. But for truly large networks (> 1m nodes) you should look at Pajek-XXL or Pajek-3XL: http://mrvar.fdv.uni-lj.si/pajek/PajekXXL.htm. From a development perspective (i.e. code required) Apache Spark's GraphX is also an option (not for visualisation though): http://spark.apache.org/graphx/. You can also look at neo4j but you won't find out-of-the-box algorithms going down that route. Oh, and if your network isn't that large then, depending on the resources on your computer, Gephi can be workable up to a few million nodes.
State University of New York Institute of Technology at Utica/Rome
Agreed, I would use Apache Spark, or Oracle PGX, or Apache Titan to do this. You may find advantages with a PGX approach over Spark if you don't have a lot of high-end nodes to run thing on.
Andrew Pitts provides a great response with lots of reference systems. I'd like to add one more to the mix - although this is not an "open source" -- but rather more of an "open standards" platform -- DataWalk. https://datawalk.com/ There are a lot of foundational data representation capabilities in "link charts" and it is not simply presenting a network diagram, but rather, how the underlying representation is defined, the ability to create the objects (and links). Many of these concepts are overviewed in my books. However, to transition the theory into real-work practice is not always efficient, practical, or consistent. Therefore, how accurate, true, complete, and reliable are the results generated? There are some good videos @ https://datawalk.com/resources/
I used to be a HUGE fan of Gephi -- it was easy to use, had a pretty intuitive UI, generated all sorts of handy SNA metrics, etc. However, it has not worked properly with Windows 10 for years (check the various user forums if you don't believe me). I've tried every suggested fix (most of which have to do with making sure that the Java home path is specified correctly)--edit the config file, uninstall and reinstall, clear out the temp folder, etc., etc. Nothing has worked, and this has been an ongoing issue for at least four years if not longer. NodeXL is an *okay* workaround, but the free version is more limited in terms of metrics (won't calculate modularity class, for example) and the graphical output is nowhere near as nice as Gephi. BLERG.
We've been using Gephi, when it isn't crashing for no reason - we're looking for an alternative - might just have to go back to python & coding it all.
Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira
My favorites are Gephi (easy and beautiful graphs), iGraph (vast number of resources, good for larger networks), Statnet (great option if you want to run Exponential Random Grapho Models - ERGMs), Pajek and Ucinet (good for smaller networks, but their graphs are not as pretty as the previous options).
To help gather more support for these initiatives, please consider sharing this post further (you don’t need a ResearchGate account to see it), and I will continue to update it with other initiatives as I find them. You can also click “Recommend” below to help others in your ResearchGate network see it. And if you know of any other community initiatives that we can share here please let us know via this form: https://forms.gle/e37EHouWXFLyhYE8A
-Ijad Madisch, CEO & Co-Founder of ResearchGate
-----
Update 03/07:
This list outlines country-level initiatives from various academic institutions and research organizations, with a focus on programs and sponsorship for Ukrainian researchers:
In this paper we describe how DUNE, an open source scientific software
framework, is developed. Having a sustainable software framework for the
solution of partial differential equations is the main driver of DUNE's
development. We take a look how DUNE strives to stay sustainable software.
In this work we consider the two-dimensional percolation model arising from the majority dynamics process at a given time $t\in\mathbb{R}_+$. We show the emergence of a sharp threshold phenomenon for the box crossing event at the critical probability parameter $p_c(t)$. We then use this result in order to obtain stretched-exponential bounds on the...
\DeclareMathOperator{\zo}{\{0,1\}} %bit set
\newcommand{\oo}{\{-1,1\}} %bit set
\DeclareMathOperator*{\Var}{Var}
\DeclareMathOperator{\Inf}{Inf}
$
We give a simple proof of the OSSS inequality (O’Donnell, Saks,
Schramm, Servedio, FOCS 2005). The inequality states that for any
decision tree $T$ calculating a Boolean
function $f:\zo^n\rightarrow \...