Family Tree of Data: Provenance and Neo4j
Details
David Allen will talk about Provenance and Neo4j:
Provenance is “information that helps determine the derivation history of a data product” essentially, the family tree of data creation and manipulation. Data provenance improves many kinds of information by adding context; where did this information come from, and how was it generated? Users who need to consume many data sources can get help making trust decisions about data, by providing them the information they need to assess how much to trust data from an unfamiliar sources. Provenance also helps users understand the impact of erroneous data or defective processes, and provides records for forensic investigations that seek to link available data, processing chains, and outcomes.
Provenance also happens to be quite naturally modeled as a graph. My company (MITRE) has been doing systems engineering research on provenance for a few years, and has tried every kind of storage model under the sun for provenance. Only relatively recently, we've moved to using neo4j to manage provenance stores. This talk is going to be about provenance, and all of the different options we tried for modeling, storing, and querying graph data; not just neo4j, but also relational and XML and their relative strengths and weaknesses. We'll talk about why neo4j was well suited to the task, and the characteristics of the use cases that play to graph strengths and avoid graph DB weaknesses.



