Opt. 2: Removing Redundant Attribute Values
Shared String Table
A SpatialJSON response may contain a Shared String Table, which may contain strings that are referenced by some features' properties. Only properties expressed as JSON strings can be stored in a shared string table (at current, temporal values, like Dates and Timestamps, which are expressed as strings as well, are not stored in a shared string table).
If present, a new "sharedStrings"
property is available in the top-level "FeatureCollection"
object:
{
"type": "FeatureCollection",
"$note": " /* remaining properties go here */ ",
"schemaInformation": {
"propertyNames": ["str_1", "num_2", "str_3", "str_4", "bool_5"],
"geometryName": "the_geom"
},
"sharedStrings": {
"indexes": [0, 2, 3],
"table": ["Lorem ipsum dolor sit amet,",
"consetetur sadipscing elitr,",
"sed diam nonumy eirmod tempor invidunt ut labore",
"et dolore magna aliquyam erat,",
"sed diam voluptua."]
}
}
It contains these two properties:
"table"
- Contains the shared strings. These are referenced by their index in the array."indexes"
- Contains the zero-based indexes of feature properties that may be stored in this shared string table.
In SpatialJSON, a feature's properties are basically stored in an array only (in contrast to GeoJSON which stores properties in an object). The "indexes"
array contains the indexes in these properties arrays that may have their values stored in the shared string table. In a feature's property array, such a value may actually be either null
, a regular JSON string
or a JSON number
(integral number). In the latter case, the property's value is actually stored in the shared string table, the value being used as the index into the shared string table.
These examples show how some feature's properties arrays are evaluated using the above string table:
/* showing properties array of feature #1 */
properties: ["foo", 23, 2, null, true]
/* gets evaluated to */
properties: {
"str_1": "foo",
"num_2": 23,
"str_3": "sed diam nonumy eirmod tempor invidunt ut labore",
"str_4": null,
"bool_5": true
}
/* showing properties array of feature #2 */
properties: [1, 32, "K", 3, false]
/* gets evaluated to */
properties: {
"str_1": "consetetur sadipscing elitr",
"num_2": 32,
"str_3": "K",
"str_4": "et dolore magna aliquyam erat",
"bool_5": false
}
As the examples show, there is no guarantee that all strings of a property whose index is part of the sharedStrings.indexes
array are actually stored in the shared string table.
SpatialJSON Writer Implementation
It is completely up to the SpatialJSON writer to decide, which strings to add to the shared string table. Several strategies can be used. However, the current implementation in this module makes no attempt to create an optimal shared string table. In order to be fast, strings are added as they come when features are serialized. Building an optimal table would likely require iterating features several times, calculating frequencies of strings, etc.
Nevertheless, this module's SpatialJSON writer has some simple rules for building the shared string table. Even for worst case scenarios, these try (at least) not to use (much) more bytes than needed for the same result without using a shared string table. (In theory, there are cases in which the shared string table adds some extra bytes to the result.) However, for most real world datasets, this strategy could save a moderate to significant number of bytes.
These are the rules that prevent a string from being added to the shared string table:
- The string's UTF-8 encoded byte length is less than a hard-coded minimum (currently 2, may be configurable in the future)
- The shared sting table is full, that is, it contains 2,147,483,647 entries (not really expected)
- The string's UTF-8 encoded byte length (including quotes) is less than the number of digits of it's designated index
Obviously, most savings can be achieved if a dataset contains only a few different large strings. That may be the case for attributes, that contain values of an enumeration, for example. The more often a certain string is used in the dataset, the more space can be saved by using a shared string table. In contrast, if every string in the set of encoded features is used only once (e. g. attributes that contain random or UUID-like strings), no savings will be achieved (in fact, using a shared string table in that case will produce even slightly bigger results).
Shared Strings per Request Customization
By default, the current implementation will add all JSON string encoded properties to the shared string table. (Except temporal values, like Dates and Timestamps, which in JSON technically are strings as well. However, we do not expect much redundancy in temporal values.) With the format_options
vendor parameter it is possible to specify which properties can store values in the shared string table or to completely skip the creation of such a table.
The supported format option is:
sharedstrings
(default is*
) - Specifyfalse
or leave empty (e. g.format_options=sharedstrings:
) to skip shared string table generation, ortrue
or*
to create a table including all JSON string encoded properties (that is the default behavior). Alternatively, a comma-separated list of property names could specify the set of properties that may store their values in the shared string table.
When a comma-separated list of property names is specified for the sharedstrings
format option, these additional rules apply:
-
Commas in property names (really?) may be escaped with a backslash character ``.
-
The prefix
re:
may be prepended to the list in order to designate each item a Java Regular Expression: (e. g.format_options=sharedstrings:re:adm_.*,\d\d_[a-z]+$
). See Java Pattern class.Specifying an invalid regular expression results in a Service Exception.
-
The prefix
glob:
may be prepended to the list in order to designate each item a glob pattern: (e. g.format_options=sharedstrings:glob:adm_*,[0-9][0-9]_*name
). See glob patterns.Specifying an invalid glob pattern results in a Service Exception.
Although the SpatialJSON Shared String Table feature works fine and typically saves a moderate number of bytes for arbitrary datasets in its default configuration, that is without specifying the sharedstrings
format option, this parameter provides a solid handle for advanced fine tuning of the string table's creation process.