Data import
Here is a draft of XML Schema to import data from external information systems
Status: not discussed, not approved
The intended workflow for importing data into OpenMunicipio is the following:
-
the XML import file is divided into "fragments" describing major_areas of the whole ER-model used in OpenMunicipio
-
the data provider produces an XML describing its knowledge about one or more "fragments"
-
data integrity is checked within_the_same_fragment
-
the provided "fragments" are imported into the DB. Here it is fundamental to keep track of what has been imported, when it has been imported, and whether the imported data has been validated by an editor
-
a backend admin interface must be designed allowing editors to:
- approve/not approve imported data
- merge two imported records into a single one
- split a single record into two possible ones (this may be not needed if we adopt a conservative approach when importing data, i.e. merge data only in case natural_keys are present and match, or never merge data at all)
- link records belonging to different areas (e.g. a vote with an act the vote was talking about)
The XML Schema file for importation should reflect the point of view of data_provider about their knowledge of the data. The database has been divided into three main areas: Bodies, Acts and Votes.
- Bodies provide the structure of the offices where elected people and public employee work. Through offices we should also get the information of the people working there.
- Acts collects information about deliberations, motions and other sorts of acts debated in public assemblies.
- Votes register every ballot, those who took part in the ballot, their vote, and so on.
-
Person
- @first_name
- @last_name
- @birth_date [optional]
- @birth_place [optional]
- @sex [optional]
- @ssn [optional]
-
Office
- Charge (0,n)
- Person (1,1)
- @id [required]
- @start_date [optional]
- @end_date [optional]
- @end_reason [optional]
- @description [optional]
- @name
- @description [optional]
- Charge (0,n)
-
Company
- Charge (0,n)
- Person (1,1)
- @id [required]
- @start_date [optional]
- @end_date [optional]
- @end_reason [optional]
- @description [optional]
- @name
- @description [optional]
- Charge (0,n)
-
Mayor | CityGovernment | Council | Commission
- Charge (0,n)
- Person (1,1)
- @id [required]
- @start_date [optional]
- @end_date [optional]
- @end_reason [optional]
- @description [optional]
- @name
- @description [optional]
- Charge (0,n)
-
OpenMunicipio
- People (0,1)
- Offices (0,1)
- Office (0,n)
- Companies (0,1)
- Company (0,n)
- Institutions (0,1)
- Mayor (0,1)
- CityGovernment (0,1)
- CityCouncil (0,1)
- Commission (0,n)
- Offices (0,1)
- Acts (0,1)
- ActsCouncil (0,1)
- Interrogation (0,n)
- Interpellation (0,n)
- Motion (0,n)
- Agenda (0,n)
- Emendation (0,n)
- CouncilDeliberation (0,n)
- ActsCityGovernment (0,1)
- CityGovernmentDeliberation (0,n)
- Investigation (0,n)
- Decision (0,n)
- ActsMayor (0,1)
- Regulation (0,n)
- Decree (0,n)
- ActsOffices (0,n)
- Determination (0,n)
- ActsCouncil (0,1)
- Sittings (0,1)
- Sitting (1,n)
- Votation (0,n)
- Subject (1,1)
- Votes (1,1)
- @seq_num [required]
- @date_time [required]
- @presents [required]
- @partecipants [required]
- @legal_number [required]
- @counter_yes [required]
- @counter_no [required]
- @counter_abs [required]
- @outcome { approved, rejected }
- @num [required]
- @date [required]
- @call [required]
- Votation (0,n)
- Sitting (1,n)
- People (0,1)
Workflow for testing:
-
define OpenOfficeDataImport.xsd to validate the entire document
-
populate persons.xml with the
<Persons>...</Persons>
fragment -
repeat previous step for groups.xml (with the
<Groups>...</Groups>
fragment) and so on ... -
include both in om.xml as follows:
<xi:include href="persons.xml" /> <xi:include href="groups.xml" /> ...
-
validate the document with
xmllint
xmllint --xinclude --schema OpenMunicipioDataImport.xsd om.xml
- aren't attributes @start_date, @end_date, @end_reason in GroupCharge redundant (aren't those info already in Charge?)?
answer by guglielmo: no, the attributes in GroupCharge describe the fact that an InstitutionCharge may join a Group for a limited amount of time, and then move to another Group (end_date and end_reason).
In order to ensure maximum portability of data among databases, we avoid as much as possible arbitrary keys, i.e. autoincrement/integer values as keys. In fact when exporting/importing data indexed by arbitrary numeric keys.
Note: we don't argue about using natural keys (opposed to arbitrary/surrogated keys) in the actual DB, but only for import/export purposes (see here a discussion about why surrogated keys should be preferred w.r.t. natural keys).
Two possible solutions are under evaluation: receive the entire external database as one XML file, or receive it as a collection of XML files. The latter would probably be a better solution, where each XML file represent one "main" entity:
- PRO in this way every file contains less data; since XML is inherently verbose, there is less probability to reach the maximum file size limit (usually 4gb)
- PRO partial processing of data would be possible, instead of a unique batch import
- PRO/CON not sure it is possible to ensure consistency (KEY/KEYREF) across multiple XML files
- it seems only a very restricted syntax of XPath allowed for
selector
andfield
, excluding reference to external files see Section. 9.2.5 @ "XML Schema", O'Reilly - on the other side, it is possible to use
XInclude
to build a unique XML document from several XML files. On the key/keyref constraints can be specified on the main document (i.e. the file that includes all the chunks). (see a simple example)
- it seems only a very restricted syntax of XPath allowed for
guglielmo's remark: a multiple files solution allows for a better protocol in receiving deltas (variations in time)
- birth_date and birth_place can be extracted from ssn
- sex can be extracted from ssn
guglielmo's remark: ssn are not provided among the data exported, usually, while by having first name and last name, it is possible to retrieve all other information using one of openpolis's API
"From Entity Relationship to XML Schema: a graph-theoretic approach", 2009