Difference between revisions of "Data Model"

Revision as of 18:11, 23 January 2020

This is a primer about the Open Risk Data model using a concrete example. For a more technical and general specification please check the underlying Wikibase Data Model.

Overview of the Open Risk Data model

The Open Risk Data knowledge base content can be summarized as a collection of Entities. Entities are thus the basic data elements of the knowledge base. There are two predefined kinds of Entities: Items and Properties.

Items

Most pages in the Open Risk Data base describe one item. Items are the way Open Risk Data refers to anything in scope. Usually items are data points (or metadata) relevant for risk management. So for example in Open Risk Data we will have an Item for any concrete Risk Event recorded, such as A risk event involving Wonga. Hence both the abstract concept "Risk Event", and a concrete realization are possible Items.

For people familiar with standard databases, an Item roughly corresponds to an identifiable record (or a line in a spreadsheet / CSV file). But in this case it is a record that is available online!
For people familiar with graph databases, items are the nodes of a graph, along with a label and a description
For people familiar with semantic data, items are the subjects of an RDF triple

Properties

Properties are special entities (also described in pages) that help construct Statements about items. For example, a relevant statement is: "Risk Event X has date YYYY-MM-DD" which is using the property "has date".

Properties may associate basic datatypes to items (as in the date example) in which case they act as column cells of a spreadsheet.
Propoerties may also link two items. For example, the statement "Risk Event X involves entity (P6) | involves]] Entity Y" uses the "is involved" property to link an event to an entity. In graph terms, a property expresses an edge between two nodes.

Hence the core of the data model involves Items with an arbitrary set of Statements that are constructed using Properties, basic data and other Iterms

The Full Data Model Structure

The full description of Items and Properties is as follows (annotations can be multi-lingual - we ignore this here for simplicity).

Item:
1. Item identifier (a serial ID number, prefixed with Q). This is a unique ID for an Item in the context of the Open Risk Data instance
2. Fingerprint, consisting of:
  1. Multilingual label, a human readable label
  2. Multilingual description, a longer description of the item
  3. Multilingual aliases, other possible labels
3. Statements, associated with the item, each consisting of:
  1. Claim, consisting of:
    1. Property
    2. Value
    3. Qualifiers (additional property-value pairs)
  2. References (each consisting of one or more property-value pairs)
  3. Rank
4. Site links
Property
1. Property identifier (number prefixed with P)
2. Fingerprint, consisting of:
  1. label
  2. description
  3. aliases
3. Statements, each consisting of:
  1. Claim, consisting of:
    1. Property
    2. Value
    3. Qualifiers (additional property-value pairs)
  2. References (each consisting of one or more property-value pairs)
  3. Rank
4. Datatype

Going Deeper into Items

As part of their fingerprint, every item has a label (a name) and a description in each supported language which must be a unique combination. The description helps disambiguate concepts that express different information using the same label. In addition to labels, items can have aliases which provide alternative names for an item to be found. Aliases are meant to offer the user search convenience, much like redirects on Wikipedia, and thus even popular misspellings may be used as aliases.

Going Deeper into Statements

In line with a core Wikibase design choice, is that "Open Risk Data will not be about the truth, but about statements and their references." This means that in Open Risk Data we do not actually model the items themselves, but statements about them. We do not say that: Company X has went bankrupt in August 1975, we say: There is this a statement about Company X going bankrupt in August 1975 according to a reference to a certain Court record.

A statement may consist of

one property (in the example, "went bankruptcy")
one value (a date)
optionally one or more references (a Court record)

The property, value, and qualifiers together are also called the claim, which together with any source references forms a statement.

Properties are described on their pages. Properties also have labels and descriptions, and additionally to that they also have a data type associated with them and perhaps additional properties. The data type defines the type of the value used with this property. The set of properties is created and maintained by Open Risk to accomodate the requirements of available data sets.

Values themselves can be either very simple -- another item or just a string -- or quite complex, like a geographic shape, a measurement with a unit and an accuracy, or a time period. We will describe values in more detail in their own page in the future. The set of data types is (mostly) predefined.

There are two special values, mostly regardless of their data type: none and unknown. None means that we know that the given property has no value, e.g. Elizabeth I of England had no spouse. Unknown means that the property has a value, but it is unknown which one -- e.g. Pope Linus most certainly had a year of birth, but it is unknown to us. This should not be mixed up with the notion that it is unknown whether an item has a value for a specific property, e.g., if a person had children. Both none and unknown are also not to be confused with the respective string: having the name "unknown" is different from having an unknown name (which is again different from it being unknown whether the entity has a name).

References offer a source that supports the given claim. There can be several references given for a statement. We are still working on how to further structure a reference, but in general they will point to a source (which would be a Wikibase item in its own right, e.g. a book, a website, etc.) and have further information, like the page where the claim is supported, etc. A claim without references is not necessarily wrong, nor is a claim with references true. It is still up to the reader of the statement to decide if they want to trust the claim or not. We will describe references in more detail in their own page in the future.

Qualifiers

Qualifiers are used to further describe or refine the value of a property given in a statement. They consist of a property and a value, which are the same as for statements.

While it would be convenient if we could express all the data we need for our use cases with simple property-value pairs, this is unfortunately not the case. Many statements require further qualifiers in order to be expressed. In order to reduce the number of properties to a manageable size, qualifiers are used to further specify the statement in some way. The qualifier is an integral part of the statement: take away the qualifier, and the meaning of the statement is changed. This is far less true for the references.

Ranks

As there are potentially many different statements for a given item and property, we need to select which ones to return when the database gets asked. In order to facilitate this, three ranks of statements are introduced. There can be any number of statements in each rank, but within each rank, their order is not significant.

Preferred statements: if preferred statements exist, these statements are returned in response to a query.
Normal statements: if there are no preferred statements (or the query explicitly says to include normal statements too), these statements are returned.
Deprecated statements: for statements that are being discussed, or known to be erroneous, but still listed for the sake of completion or in order to prevent them being constantly added and removed.

Within Open Risk Data, the ranks are also used to make the display cleaner. Only the preferred statements are displayed by default, and the reader has to click on a link like "more values" in order to see the normal-ranked statements.

@@ Line 1: / Line 1: @@
-This is a primer to the Open Risk Data model. For a more technical specification please check the underlying [https://www.mediawiki.org/wiki/Wikibase/DataModel Wikibase Data Model].
+This is a primer about the Open Risk Data model using a concrete example. For a more technical and general specification please check the underlying [https://www.mediawiki.org/wiki/Wikibase/DataModel Wikibase Data Model].
 == Overview of the Open Risk Data model ==
-The Open Risk Data knowledge base content can be summarized as a collection of Entities. '''Entities''' are the basic elements of the knowledge base, which can be described and referenced using the Open Risk Data model. There are two predefined kinds of Entities: Items and Properties.
+The Open Risk Data knowledge base content can be summarized as a collection of ''Entities''. Entities are thus the basic data elements of the knowledge base. There are two ''predefined'' kinds of Entities: '''Items''' and '''Properties'''.
-One page in the Open Risk Data base describes one '''item'''. Items are the way Open Risk Data refers to anything in scope, and usually are data or metadata relevant for risk management. So for example in Open Risk Data we will have an Item for any concrete [[Item:Q6 | Risk Event]] recorded.
+== Items ==
+Most pages in the Open Risk Data base describe one '''item'''. Items are the way Open Risk Data refers to anything in scope. Usually items are data points (or metadata) relevant for risk management. So for example in Open Risk Data we will have an Item for any concrete [[Item:Q6 | Risk Event]] recorded, such as [[Item:Q73 | A risk event involving Wonga]]. Hence both the ''abstract concept'' "Risk Event", and a ''concrete realization'' are possible Items.
+* For people familiar with standard databases, an Item roughly corresponds to an identifiable record (or a line in a spreadsheet / CSV file). But in this case it is a record that is available online!
+* For people familiar with graph databases, items are the nodes of a graph, along with a label and a description
+* For people familiar with semantic data, items are the subjects of an RDF triple
+== Properties ==
+Properties are special entities (also described in pages) that help construct ''Statements'' about items. For example, a relevant statement is: "Risk Event X [[Property:P8 | has date]] YYYY-MM-DD" which is using the property "has date".
+* Properties may associate ''basic datatypes'' to items (as in the date example) in which case they act as column cells of a spreadsheet.
+* Propoerties may also ''link'' two items. For example, the statement "Risk Event X [[Property:P6]] | involves]] Entity Y" uses the "is involved" property to link an event to an entity. In graph terms, a property expresses an ''edge between two nodes''.
-== Data Model Structure ==
+Hence the core of the data model involves Items with an arbitrary set of Statements that are constructed using Properties, basic data and other Iterms
-The description of Items and Properties are structured as follows.
+== The Full Data Model Structure ==
+The full description of Items and Properties is as follows (annotations can be multi-lingual - we ignore this here for simplicity).
 # '''Item''':
-## '''Item identifier''' (a serial ID number, prefixed with ''Q'')
+## '''Item identifier''' (a serial ID number, prefixed with ''Q''). This is a unique ID for an Item in the context of the Open Risk Data instance
 ## '''Fingerprint''', consisting of:
-### Multilingual '''label'''*
+### Multilingual '''label''', a human readable label
-### Multilingual '''description'''*
+### Multilingual '''description''', a longer description of the item
-### Multilingual '''aliases'''
+### Multilingual '''aliases''', other possible labels
-## '''Statements''', each consisting of:
+## '''Statements''', associated with the item, each consisting of:
 ### '''Claim''', consisting of:
 #### Property
@@ Line 26: / Line 39: @@
 ## '''Property identifier''' (number prefixed with ''P'')
 ## '''Fingerprint''', consisting of:
-### Multilingual '''label'''*
+### '''label'''
-### Multilingual '''description'''*
+### '''description'''
-### Multilingual '''aliases'''
+### '''aliases'''
 ## '''Statements''', each consisting of:
 ### '''Claim''', consisting of:
@@ Line 38: / Line 51: @@
 ## '''Datatype'''
-<nowiki>*</nowiki>) Unless label and/or description of an entity are not empty, within the scope of an entity type, an entity's combination of label and description in a certain language must be unique.<br>
-== Items ==
+== Going Deeper into Items ==
-Every item has a '''label''' (a name) and a '''description''' in each supported language. Just the label would not be enough as it may be ambiguous: Berlin could refer to the [[:en:Berlin|capital of Germany]], one of more than a dozen cities in the US, a [[:en:Berlin (album)|Lou Reed album]], an [[:en:Berlin (band)|American new wave band]], or [[:en:Berlin (disambiguation)|many other things]]. The label and the description together should identify the meaning of an item, e.g. the label "Berlin" and the description "A city in Germany" should be uniquely identifying in each language.
+As part of their fingerprint, every item has a '''label''' (a name) and a '''description''' in each supported language which must be a unique combination. The description helps disambiguate concepts that express different information using the same label. In addition to labels, items can have '''aliases''' which provide alternative names for an item to be found. Aliases are meant to offer the user search convenience, much like redirects on Wikipedia, and thus even popular misspellings may be used as aliases.
-In addition to labels, items can have '''aliases''' which provide alternative names for an item to be found. ''"[[:en:George H. W. Bush|George H. W. Bush]]"'' might also be found under ''"George Bush"'', and so might his son. Aliases are meant to offer the user search convenience, much like redirects on Wikipedia, and thus even popular misspellings may be used as aliases.
+== Going Deeper into Statements ==
+In line with a core Wikibase design choice, is that ''"Open Risk Data will not be about the truth, but about statements and their references."'' This means that in Open Risk Data we do not actually model the ''items themselves'', but ''statements about them''. We do not say that: '''Company X has went bankrupt in August 1975''', we say: '''There is this a statement about Company X going bankrupt in August 1975 according to a reference to a certain Court record'''.
-== Statements ==
+A '''statement''' may consist of
-One of the [[Wikibase/Notes/Requirements|requirements]] is that ''"Wikibase will not be about the truth, but about statements and their references."'' This means that in Wikibase we do not actually model the ''items themselves'', but ''statements about them''. We do not say that Berlin has a population of 3,5 M, we say that there is this statement about Berlin's population being 3,5 M as of 2011 according to the German statistical office.
+* one property (in the example, "went bankruptcy")
+* one value (a date)
+* optionally one or more references (a Court record)
-A '''statement''' may consist of
-* one property (in the example, "population")
-* one value (3,5 M)
-* optionally one or more qualifiers (in this example, "as of 2011" is one of the qualifiers)
-* optionally one or more references (the Germans statistical office)
 The property, value, and qualifiers together are also called the '''claim''', which together with any source references forms a statement.
-There can be several statements about the same property: people can have several children, books might have several authors. Also, there might be diverging points of view on the population of a city -- official numbers and UN estimates, for example. Or there might be values with different qualifiers, like points in time or measurement methods. For a few examples, see below.
+'''Properties''' are described on their pages. Properties also have labels and descriptions, and additionally to that they also have a data type associated with them and perhaps additional properties. The data type defines the type of the value used with this property. The set of properties is created and maintained by Open Risk to accomodate the requirements of available data sets.
-'''Properties''' are described on their own wiki pages in Wikibase. Properties also have labels and descriptions, and additionally to that they also have a data type associated with them and perhaps additional properties. The data type defines the type of the value used with this property. The set of properties is created and maintained by the Wikibase editors.
-'''Values''' themselves can be either very simple -- another item or just a string -- or quite complex beasts, like a geographic shape, a measurement with a unit and an accuracy, or a time period. We will describe values in more detail in their own page in the future. The set of data types is (mostly) predefined.
+'''Values''' themselves can be either very simple -- another item or just a string -- or quite complex, like a geographic shape, a measurement with a unit and an accuracy, or a time period. We will describe values in more detail in their own page in the future. The set of data types is (mostly) predefined.
 There are two special values, mostly regardless of their data type: '''none''' and '''unknown'''. ''None'' means that we know that the given property has no value, e.g. [[:en:Elizabeth I of England|Elizabeth I of England]] had no spouse. ''Unknown'' means that the property has a value, but it is unknown which one -- e.g. [[:en:Pope Linus|Pope Linus]] most certainly had a year of birth, but it is unknown to us. This should not be mixed up with the notion that it is unknown whether an item has a value for a specific property, e.g., if a person had children. Both ''none'' and ''unknown'' are also not to be confused with the respective string: having the name ''"unknown"'' is different from having an unknown name (which is again different from it being unknown whether the entity has a name).
 '''References''' offer a source that supports the given claim. There can be several references given for a statement. We are still working on how to further structure a reference, but in general they will point to a source (which would be a Wikibase item in its own right, e.g. a book, a website, etc.) and have further information, like the page where the claim is supported, etc. A claim without references is not necessarily wrong, nor is a claim with references true. It is still up to the reader of the statement to decide if they want to trust the claim or not. We will describe references in more detail in their own page in the future.
-=== Example statements ===
-Two statements without qualifiers:
-<div style="padding: 2ex; border:#444 solid 1px; ">
-{{Wikibase statement|item=Berlin|property=Area|value=891.85 km²|numberofsources=1}}
-{{Wikibase statement|property=Mayor|value=[[:en:Michael Müller (politician)|Michael Müller]]|numberofsources=0}}
-</div>
-One statement with two qualifiers:
-<div style="padding: 2ex; border:#444 solid 1px; ">
-{{Wikibase statement|item=Germany|property=Chancellor|value=[[:en:Angela Merkel|Angela Merkel]]|qualifier1=since|value1=2005|qualifier2=Party|value2=[[:en:CDU|CDU]]|numberofsources=2}}
-</div>
-Two statements with the same property, each with one qualifer:
-<div style="padding: 2ex; border:#444 solid 1px; ">
-{{Wikibase statement|item=Berlin|property=Population|value=3,500,000|qualifier1=as of|value1=2012}}
-{{Wikibase statement|value=8,000|qualifier1=as of|value1=15th century|numberofsources=1}}
-</div>
 === Qualifiers ===
 '''Qualifiers''' are used to further describe or refine the value of a property given in a statement. They consist of a property and a value, which are the same as for statements.
-While it would be convenient if we could express all the data we need for the use cases of Wikibase with simple property-value pairs, this is unfortunately not the case. Many statements require further qualifiers in order to be expressed. In order to reduce the number of properties to a manageable size, qualifiers are used to further specify the statement in some way. Qualifiers can be used in a number of ways, as shown by the following examples.
+While it would be convenient if we could express all the data we need for our use cases with simple property-value pairs, this is unfortunately not the case. Many statements require further qualifiers in order to be expressed. In order to reduce the number of properties to a manageable size, qualifiers are used to further specify the statement in some way. The qualifier is an integral part of the statement: take away the qualifier, and the meaning of the statement is changed. This is far less true for the references.
-A qualifier can modify what the item means (''"France: Area 213,010 sq mi - excluding Adélie Land"''), the property (''"Berlin: Population 3,500,000 - method Estimation"''), constrain the validity of the value ("''Germany: Population 80,000,000 - as of 2011"''), or offer further details (''"Austria: Religion Catholic - Percentage 64,8%"'' or ''"Goldfinger: Actor Sean Connery - Role James Bond"''), etc. A catch-all qualifier is expected to be "annotation" or something similar.
-It is open to the Wikibase community to maintain and use qualifiers in a way that makes sense to them and for their use cases. The qualifier is an integral part of the statement: take away the qualifier, and the meaning of the statement is changed. This is far less true for the references.
 === Ranks ===
-As there are potentially many different statements for a given item and property, we need to select which ones to return when Wikibase gets asked. In order to facilitate this, three '''ranks''' of statements are introduced. There can be any number of statements in each rank, but within each rank, their order is not significant.
+As there are potentially many different statements for a given item and property, we need to select which ones to return when the database gets asked. In order to facilitate this, three '''ranks''' of statements are introduced. There can be any number of statements in each rank, but within each rank, their order is not significant.
-* '''Preferred statements''': if preferred statements exist, these statements are returned in response to a query. They would, e.g. for a population contain the most recent one as long as it is regarded as sufficiently reliable. Wikibase editors might decide to mark several statements as preferred: this may be used to indicate disagreement, reflecting the knowledge diversity on the issue, or it may be used to express the notion of actually having multiple values (in case of properties like "children").
+* '''Preferred statements''': if preferred statements exist, these statements are returned in response to a query.
-* '''Normal statements''': if there are no preferred statements (or the query explicitly says to include normal statements too), these statements are returned. Historical values, like the population of a country in the past, might be here, as well as less representative sources which are still considered relevant.
+* '''Normal statements''': if there are no preferred statements (or the query explicitly says to include normal statements too), these statements are returned.
-* '''Deprecated statements''': for statements that are being discussed, or known to be erroneous, but still listed for the sake of completion or in order to prevent them being constantly added and removed. Deprecated statements only appear in search results if they are explicitly added or if they are selected based on their source. A footnote qualifier should usually accompany other-ranked statements.
+* '''Deprecated statements''': for statements that are being discussed, or known to be erroneous, but still listed for the sake of completion or in order to prevent them being constantly added and removed.
-Within Wikibase, the ranks are also used to make the display cleaner. Only the preferred statements are displayed by default, and the reader has to click on a link like ''"more values"'' in order to see the normal-ranked statements.
+Within Open Risk Data, the ranks are also used to make the display cleaner. Only the preferred statements are displayed by default, and the reader has to click on a link like ''"more values"'' in order to see the normal-ranked statements.
-==See also==
+== See also ==
 [[Category:Documentation]]