Skip to content
scaloni edited this page Feb 29, 2012 · 28 revisions

Here is a draft of XML Schema to import data from external information systems

Status: not discussed, not approved

Importation workflow

The intended workflow for importing data into OpenMunicipio is the following:

  1. the XML import file is divided into "fragments" describing major_areas of the whole ER-model used in OpenMunicipio

  2. the data provider produces an XML describing its knowledge about one or more "fragments"

  3. data integrity is checked within_the_same_fragment

  4. the provided "fragments" are imported into the DB. Here it is fundamental to keep track of what has been imported, when it has been imported, and whether the imported data has been validated by an editor

  5. a backend admin interface must be designed allowing editors to:

    1. approve/not approve imported data
    2. merge two imported records into a single one
    3. split a single record into two possible ones (this may be not needed if we adopt a conservative approach when importing data, i.e. merge data only in case natural_keys are present and match, or never merge data at all)
    4. link records belonging to different areas (e.g. a vote with an act the vote was talking about)

Design:

The XML Schema file for importation should reflect the point of view of data_provider about their knowledge of the data. The database has been divided into three main areas: Bodies, Acts and Votes.

  • Bodies provide the structure of the offices where elected people and public employee work. Through offices we should also get the information of the people working there.
  • Acts collects information about deliberations, motions and other sorts of acts debated in public assemblies.
  • Votes register every ballot, those who took part in the ballot, their vote, and so on.

High-level view

  • Person

    • @first_name
    • @last_name
    • @birth_date [optional]
    • @birth_place [optional]
    • @sex [optional]
    • @ssn [optional]
  • Office

    • Charge (0,n)
      • Person (1,1)
      • @id [required]
      • @start_date [optional]
      • @end_date [optional]
      • @end_reason [optional]
      • @description [optional]
    • @name
    • @description [optional]
  • Company

    • Charge (0,n)
      • Person (1,1)
      • @id [required]
      • @start_date [optional]
      • @end_date [optional]
      • @end_reason [optional]
      • @description [optional]
    • @name
    • @description [optional]
  • Mayor | CityGovernment | Council | Commission

    • Charge (0,n)
      • Person (1,1)
      • @id [required]
      • @start_date [optional]
      • @end_date [optional]
      • @end_reason [optional]
      • @description [optional]
    • @name
    • @description [optional]
  • OpenMunicipio

    • People (0,1)
      • Offices (0,1)
        • Office (0,n)
      • Companies (0,1)
        • Company (0,n)
      • Institutions (0,1)
        • Mayor (0,1)
        • CityGovernment (0,1)
        • CityCouncil (0,1)
        • Commission (0,n)
    • Acts (0,1)
      • ActsCouncil (0,1)
        • Interrogation (0,n)
        • Interpellation (0,n)
        • Motion (0,n)
        • Agenda (0,n)
        • Emendation (0,n)
        • CouncilDeliberation (0,n)
      • ActsCityGovernment (0,1)
        • CityGovernmentDeliberation (0,n)
        • Investigation (0,n)
        • Decision (0,n)
      • ActsMayor (0,1)
        • Regulation (0,n)
        • Decree (0,n)
      • ActsOffices (0,n)
        • Determination (0,n)
    • Sittings (0,1)
      • Sitting (1,n)
        • Votation (0,n)
          • Subject (1,1)
          • Votes (1,1)
          • @seq_num [required]
          • @date_time [required]
          • @presents [required]
          • @partecipants [required]
          • @legal_number [required]
          • @counter_yes [required]
          • @counter_no [required]
          • @counter_abs [required]
          • @outcome { approved, rejected }
        • @num [required]
        • @date [required]
        • @call [required]

Experimenting

Workflow for testing:

  1. define OpenOfficeDataImport.xsd to validate the entire document

  2. populate persons.xml with the <Persons>...</Persons> fragment

  3. repeat previous step for groups.xml (with the <Groups>...</Groups> fragment) and so on ...

  4. include both in om.xml as follows:

    <xi:include href="persons.xml" /> <xi:include href="groups.xml" /> ...

  5. validate the document with xmllint

    xmllint --xinclude --schema OpenMunicipioDataImport.xsd om.xml

Issues

Model issues

  • aren't attributes @start_date, @end_date, @end_reason in GroupCharge redundant (aren't those info already in Charge?)?

answer by guglielmo: no, the attributes in GroupCharge describe the fact that an InstitutionCharge may join a Group for a limited amount of time, and then move to another Group (end_date and end_reason).

Avoid arbitrary keys

In order to ensure maximum portability of data among databases, we avoid as much as possible arbitrary keys, i.e. autoincrement/integer values as keys. In fact when exporting/importing data indexed by arbitrary numeric keys.

Note: we don't argue about using natural keys (opposed to arbitrary/surrogated keys) in the actual DB, but only for import/export purposes (see here a discussion about why surrogated keys should be preferred w.r.t. natural keys).

Multiplicity of files

Two possible solutions are under evaluation: receive the entire external database as one XML file, or receive it as a collection of XML files. The latter would probably be a better solution, where each XML file represent one "main" entity:

  • PRO in this way every file contains less data; since XML is inherently verbose, there is less probability to reach the maximum file size limit (usually 4gb)
  • PRO partial processing of data would be possible, instead of a unique batch import
  • PRO/CON not sure it is possible to ensure consistency (KEY/KEYREF) across multiple XML files
    • it seems only a very restricted syntax of XPath allowed for selector and field, excluding reference to external files see Section. 9.2.5 @ "XML Schema", O'Reilly
    • on the other side, it is possible to use XInclude to build a unique XML document from several XML files. On the key/keyref constraints can be specified on the main document (i.e. the file that includes all the chunks). (see a simple example)

guglielmo's remark: a multiple files solution allows for a better protocol in receiving deltas (variations in time)

Simplifications

  • birth_date and birth_place can be extracted from ssn
  • sex can be extracted from ssn

guglielmo's remark: ssn are not provided among the data exported, usually, while by having first name and last name, it is possible to retrieve all other information using one of openpolis's API

Resources

"From Entity Relationship to XML Schema: a graph-theoretic approach", 2009

"NetML @ UniRoma3"

"An algorithm for generating XML Schemas from ER Schemas"

"XML Schema", O'Reilly, 2002

"Combining XML Documents with XInclude" @ Microsoft