Skip to content
This repository has been archived by the owner on May 29, 2018. It is now read-only.

Building the dependency graph #3

Open
bmcfee opened this issue Oct 11, 2014 · 3 comments
Open

Building the dependency graph #3

bmcfee opened this issue Oct 11, 2014 · 3 comments

Comments

@bmcfee
Copy link
Member

bmcfee commented Oct 11, 2014

Most research software does not actually get cited directly. For example, a paper might cite sklearn but not numpy, or numpy but not BLAS, etc. Consequently, most research software is only cited implicitly.

To try and fill in the implied citation network, we can extract software dependencies from known repositories. This can take a few forms:

  • Python packages that use setuptools define their dependencies explicitly, and these are stored in a well-structured object that's easy to parse.
  • What about R?
  • What about MATLAB?
  • What about C/C++?

Alternatively, once we have a list of top-level packages, we can start crawling package management hierarchies:

  • Debian/ubuntu/etc
  • PyPI
  • Mathworks file exchange?
  • What about Mac users: anaconda? brew? ports?

Once we have a full tree, we'll have to prune it back to some reasonable level. It might be useful to include something like boost, but libc would obviously be a step too far. Where do we draw the line? Can this be automated?

@sbenthall
Copy link

Can I request that this dependency tracking be implemented in such a way that it can be imported as a module into another project?

I ask because I've been intending to do something similar to this for a collaboratin analysis tool my team has been working on:
https://github.com/sbenthall/bigbang

One thing I'd like to suggest (though it might be scope creep) is to think about how this integrates with version control. Software dependencies are something that change over time.

@sbenthall
Copy link

In the interest of reducing redundant effort, just putting a pointer here to the related feature request in BigBang

https://github.com/sbenthall/bigbang/issues/109

You might be interested in MetricGrimoire, which has a project, CVSAnalY, for version control data import

http://metricsgrimoire.github.io/

@bmcfee
Copy link
Member Author

bmcfee commented Nov 11, 2014

Yes, that's an excellent point. For something like pypi or debian, dynamic dependency tracking would be pretty straightforward since all packages are versioned. For the other, more esoteric sources (mathworks?), this seems pretty treacherous, but maybe soluble via timestamps.

I definitely like the idea of implementing that as a standalone module. I worry a little about having common identifiers across modules if it gets split up, but canonical naming can be part of the functionality of the dependency tracking module.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants