Skip to content
This repository has been archived by the owner on Oct 17, 2020. It is now read-only.

The Data Federation Maturity Model

Julia Lindpaintner edited this page Mar 8, 2019 · 1 revision

The following was originally published as part of the Data Federation Framework from Phase 1 of this project.

The Data Federation Maturity Model

Successful execution of a federated data effort is largely a question of incentives and resources. Developing, and complying with, a new process or specification for data submission takes considerable time, effort, and expertise, and will only be possible to sustain with a large number of motivated individuals who have both the ability and the capacity to execute on a long term vision. However, it is difficult, and unwise, to immediately allocate vast resources to a new federated data effort — the effort may be easier than anticipated, or harder, or impossible, for a variety of reasons. We have observed that the most successful of these efforts simultaneously and iteratively develop the maturity of the effort along four axes: Impetus, Community, Specification, and Application. If done properly, these four dimensions work in concert to create a virtuous cycle of more participation, enthusiasm, and resources allocated to the project over time.

Dimensions Beginnings Growth Mature
Impetus Grassroots Policy Law
Community First Adopters Trending Maintenance
Specification Dictionary Machine Readable Standardized
Application Alpha Beta Production

Impetus

The business or legal need that sets the project in motion

The first spark behind any federated data effort is some sort of impetus. This might be a shared understanding of a problem, where a grassroots community comes together and decides to act (for example, with the law enforcement community creating NIEM). Or it could be a policy, where a central authority (e.g., OMB, or a state government) decides to require compliance of a certain sort through an official policy (for example, the OMB M-13-13 policy with Data.gov). Or, it could be as formal as a law, such as in the case of the DATA Act. The more formal the mechanism, the greater force permanence it has, and thus the harder it is to adapt during implementation.

For both policies and laws, it is critical to leave all technical decision making to the implementation team. Never specify a technical standard in policy or law. For example, it can be very helpful to dictate that a standard must be machine readable, but specifying that the standard must be written in XML can have complex and unforeseen consequences. In extreme cases you can specify data elements to provide, but it is best to avoid doing so. In writing a policy or law, focus on processes and outcomes, not on implementation details. Never attempt to fully dictate a data standard in policy or law. For example, you can say that the standard must be machine readable, extensible, and developed alongside a web application with user feedback, but it would not be helpful to say that the standard should be in XML, that states can modify as needed, and that it must be showcased in a user-friendly web application. It's also important to recognize that data standards need maintenance and adaptation over time: it's wise to specify an annual or biannual review of the standard itself to make sure the data is still providing value.

Community

The people who provide and consume the data

For a federated data effort, you have two communities you need to keep happy: the data owners, or those who need to do the work to adopt the standard, and the community of users, or consumers of the aggregated data. It is the job of the project team (a small team dedicated to driving adoption) to keep this community excited and encouraged.

When executing on a federated data effort, it's important to not expect or plan for immediate compliance of the entire community of data owners. Instead, select 2-3 early adopters: good-faith partners who are excited about helping. One partner is not enough to generalize on, and over 3 partners would begin to overburden the project team. Once those early adopters have been identified, help them implement a draft of the standard in a high touch fashion. This means working with them side by side to implement it, or even have the project team implement it themselves in order to demonstrate feasibility and be able to show them how it works. The project team's relationship with the early adopters is critical and bidirectional: the implementation details and the standard itself needs to be adapted to be user friendly for data owners, and those owners must also learn about why the standard is helpful to a broader audience.

Typically data that is being collected is already in use by the communities most relevant to the data owners, and it requires a shift in mindset to invest in making it more broadly available. For example, for the Voting Information Project, counties typically already publish their polling locations and ballot information on their website. Without a standard reporting format, however, it was nearly impossible for citizens to find. Thus, the project team was responsible for conveying to the owners that problem of scale, and helping them understand the standard. And for the DATA Act, for example, the project team quickly realized that agencies were most comfortable working with CSV (comma separated value) format. Rather than try to teach them all to adhere to the XML-style format the machine readable specification was written in, they developed a parallel standard for the CSV format.

Once the early adopters have had success, it's time to roll out the larger standard. Since the first-mover risk has been absorbed, typically the effort will start trending in the larger community, who now has use cases to point to, people to talk with, and code to look at to help them comply. Once adoption is complete or plateaus, the standard enters the maintenance phase, where ownership problems begin to be salient. Often the trending phase is accompanied by influxes of excitement, talent, press, and funding. After a few years all of those may lessen, and it becomes important to establish long term mechanisms for maintaining the standard itself and the processes around compliance. If the standard has become a normal part of operations, the maintenance phase can be part of the normal budget. If it continues to be "tacked on" to normal operations, the effort is at risk of fizzling out.

It is also important to be in touch with the community of data consumers from the very beginning. This could be journalists, scientists, citizens, decision makers in the org, etc.. The reason for this is two-fold: first, as the ultimate users of the data, it is important that their feedback be integrated early and often. Second, their positive feedback and involvement will help incentivize the data owners, who are embarking on a long and uncertain journey.

Specification

The definition of what data needs to be provided and what format it should be in

The specification itself is responsible for communicating to data owners what data needs to be provided and how it should be formatted. At the lowest level of formality, project teams can start with a description of fields (a data dictionary). These should be detailed, should not use acronyms, should include exact field / column names, and include examples of compliant data. If you have a simple data standard, for example a single CSV or sets of CSVs, this level of detail may be sufficient. For example, the General Transit Feed Specification, which is among the most successful federated data efforts, does not publish a machine readable standard, but rather has thorough documentation detailing the requirements and fields in language the domain experts in transit agencies would understand.

A more formal specification should be machine readable: in this case, not only is the data dictionary well documented as detailed above, but the specification itself can be used to automatically perform simple validations against the data. This level of specificity can help reduce compliance burden by increasing clarity. For example, a data dictionary might have a field called "start_date" and describe that it's the starting date of an election, but a machine readable version would be forced to clarify that the format of the date should look like 2018-01-15, which reduces potential for wasteful back & forth or downstream data integrity problems. For specifications that must map to many-to-one or many-to-many relationships, a machine readable format (e.g., JSON Schema or XML Schema) is strongly preferred, even as a first iteration. That machine readable schema can be versioned as well.

Over many years of stability and adoption, it may make sense to promote the machine readable schema into a full standard, recognized by a formal standards body, for example. This level of maturity is rare, but can be helpful to promote stability. Working with standards bodies is generally a complex endeavor, and thus not recommended until the standard has a proven user base and community.

Application

A software application optimized for demonstrating and providing the value of the data to end users

It is only very rare cases where raw data published online will be compelling enough to drive adoption of the user community. Generally that community will have richer requirements beyond and supporting the data itself. For example, they may want to be able to search it or visualize it easily, or access it programmatically through an Application Programming Interface (API). Since that user community is ultimately driving value for the entire effort, it's important to be developing an application in concert with the standard to demonstrate the value of the data. For example, with code.gov, the project team developed an alpha version of code.gov in just a few months, showcasing the work of some of the early adopters. They quickly earned over 100,000 viewers, which helped ignite excitement for the effort across government, and provide countless valuable perspectives on how the data would be used and what, exactly, the value was they should be targeting.

Often for federated data efforts, the value proposition itself is not fully known, but rather hypothesized: developing a "killer app" that showcases that value directly is a critical part of the effort. Similarly, the application helps inform the specification itself: perhaps some fields are found to be unnecessary, or incorrectly formatted, or missing. This application provides important fuel for incentivizing adoption. For example, for the General Transit Feed Specification, having local citizens be able to search through their city's public transit routes on Google Maps provided clear and tremendous value to potential adopters. As is industry best-practice, it's important to develop this application iteratively, starting with a small subset of features, releasing publicly to early adopters in a alpha phase, performing more extensive testing in a beta phase, and adding increased stability in the production phase.