Earth Notes: On IoT Data Sets and Processing (2015)
Updated 2024-02-19.Overview
Data to be processed potentially includes:
- sensed values, eg temperature, light level, energy consumed
- dynamic metadata about sensors and nodes, eg liveness, battery level
- static/situational data, eg sensor location to allow map projections
- transient data such as temporary end-user smartphone sensor participation
- ephemeral/restricted data, eg permitted to use for a limited time and not redistributable
- external/virtual/synthesised data sets
Aspects of processing potentially include:
- real-time or historical (eg time-series)
- raw or filtered/'cleaned'
- data status, eg public or private (and licensing)
- data leaf-to-concentrator flow integrity (authentication) and privacy (encryption)
- data store integrity, eg appropriately-granular access controls
- data end-to-end integrity/authenticity, eg detecting tampering
- sensor/data source discovery and management
- sensor management metadata such as liveness
- privacy, safety, security of personal and non-personal data
- distribution of data to public or private data sinks
- interoperability including presentation in common/standard formats
- analytics, including fusion and cross-validation of data from multiple sources
- analytics to detect faults in sensors
- types of analysis, eg heat use and footfall against temperature
- aggregation and anonymisation
- visualisation, including dashboards for real-time management
- incorporation of data into and combination with larger sets, eg fleet management and journey planning
- interaction with end-users (and data from and about interactions)
It is a useful abstraction to separate the place and thing that is being measured (eg like a "meter point") from the sensor cluster that happens to be doing the measuring. For example so that data can take various routes manually and automatically, and to allow equipment replacements and upgrades in the field.
In the comms model as of 2014-05-29 such abstraction happens downstream of the concentrator/redistributor and might be done by logically associating a combination of (sensor and) leaf node ID, concentrator ID and time-window with a particular "meter point" or equivalent. Redeployment or replacement of a sensor/node should create a new association, and forcing the sensor/node to change its ID (eg to a new random one) may help with this so ensuring that a "meter point" maps to one or more sensor/node IDs but usually one sensor/node ID (with all its readings) is unique to one meter point.
See also some previous discussions around protocols and formats for OpenTRV:
- OpenTRV Protocol Discussions 2014-11 Part 1
- OpenTRV Protocol Discussions 2014-11 Part 2
- OpenTRV Protocol Discussions 2014-12 Part 3
Building Health
See the notes in 2015-05-29 Data Sets and Processing meeting notes for an illustrative cursory cut of the sorts of inputs (sensor and other data set) and outputs (metrics/KPIs) for a building health use case.
Note that for the mobile sensors, eg parked on people, a chat with Paul Tanner (2015-05-29) suggest that some variant on the BuggyAir technology with data backhaul through staff phones or WiFi or Bluetooth Smart (and/or providing location as beacons) might be suitable, and that measures NO2, CO2, PM2.5.
Bruno's (EnergyDeck CTO) D15 note 2015-06-07...
# D15: Datasets and data processing # ## Sensor data sets ## - Time stamp in ISO 8601 format with time zone - Globally unique ID for the sensor (this can be a combination of a concentrator ID + sensor number or any other value that is globally unique) - Value - Unit (this may be sent in initial frame only or following request from data platform) - Frame number (used to identify missing frames, sequential number that can potentially loop, in which case we need to identify the looping logic to ensure missing frame detection works at the loop point) ## Data processing on the platform ## 1. Check that the frame can be parsed, if not return an error (HTTP ???). 2. Check the sensor ID, if known fetch sensor meta-data, otherwise assume new sensor. The platform may auto-create the new sensor or send an error code back depending on internal logic. Note that this internal logic can also depend on how the sensors have been commissioned. 3. If existing sensor, check that frame number received = last frame + 1. 4. If unit is specified check that it matches the unit known for the sensor. If unit mismatch, send back hard error code. 5. Store the value and send return code: - All OK: HTTP 200 or equivalent - Step 3 shows missing frame: non-critical error code asking for missing frames, specifying last received frame + last received time stamp. - Step 4 no unit stored against sensor and none provided: non-critical error asking for unit to be sent in next frame. Note that we need to have a unit value for unit-less numbers. See SenML for unit values. We may want to extend on what SenML provide but we should also be compatible with it. ## Data processing of return codes on the device ## 1. If all OK, stop processing. 2. If missing frame code, send missing frames in one or multiple messages. If some of the frames are no longer available, send a frame specifying "unknown" for those. 3. If unit required code, mark the sensor as needing to send unit on next frame. ## Commissioning ## To be done in D38.
Also see Nic's (EnergyDeck COO) D15 note 2015-06-30: indoor and outdoor environment quality, including some key items such as:
- "...research indicates that the requirements for sensors are not universal and that depending on the indoor environment, e.g. what it's used for or where it's located, different sensor sets are required. As such, it may be an important aspect of our research to look at differing requirements in different room types, not just assume that one-size fits all."
- "retailers can measure benefit directly through increased sales"
- "for outdoor monitoring when relating to office space, ... it's the comparison between indoor and outdoor environment quality, particularly air quality, which is of particular interest"
Interoperability and Discoverability
Interoperability and discoverability are important for large IoT deployments, where there is no time to hand-craft solutions for integrating sensor data sources.
Bruno's (EnergyDeck CTO) D15 note 2015-06-16: SenML and HyperCat.
# D15: HyperCat / SenML representation # In order for the data manipulated by the system to be distributed over HyperCat[1], it needs to be serialisable as SenML[2] ## SenML representation ## SenML is a simple representation that can carry a number of data points for multiple data series in a single data frame. It can be serialised to XML or JSON. A SenML frame has the ability to specify base values for a number of attributes to avoid having them repeated in individual data point entries. This can be particularly useful to convey concentrator and device level attributes. JSON generated from a OpenTRV device for a single sensor: [ "2015-06-16T00:01:17Z", "", {"@":"4b62","+":7,"vac|h":74,"v|%":0,"tT|C":5} ] Possible SenML representation (assuming 12345678 is the ID of the concentrator): { "bn": "urn:dev:id:12345678/4b62/", "bt": 1433894477, "e": [ { "n": "+", "v": 7 }, { "n": "vac", "v": 74, "u": "h" }, { "n": "v", "v": 0, "u": "%" }, { "n": "tT", "v": 5, "u": "Cel" } ] } When dealing with multiple entries with different time stamps, such as: [ "2015-06-16T00:01:17Z", "", {"@":"4b62","+":7,"vac|h":74,"v|%":0,"tT|C":5} ] [ "2015-06-16T00:01:39Z", "", {"@":"6363","+":3,"vac|h":26,"v|%":0,"tT|C":7} ] [ "2015-06-16T00:02:39Z", "", {"@":"6363","+":4,"vC|%":328,"T|C16":329,"O":1} ] We could move the device ID to the "n" attribute: { "bn": "urn:dev:id:12345678/", "bt": 1433894477, "e": [ { "n": "4b62/+", "v": 7, "t": 0 }, { "n": "4b62/vac", "v": 74, "t": 0, "u": "h" }, { "n": "4b62/v", "v": 0, "t": 0, "u": "%" }, { "n": "4b62/tT", "v": 5, "t": 0, "u": "Cel" }, { "n": "6363/+", "v": 3, "t": 22 }, { "n": "6363/vac", "v": 26, "t": 22, "u": "h" }, { "n": "6363/v", "v": 0, "t": 22, "u": "%" }, { "n": "6363/tT", "v": 7, "t": 22, "u": "Cel" }, { "n": "6363/+", "v": 4, "t": 22 }, { "n": "6363/vC", "v": 328, "t": 22, "u": "%" }, { "n": "6363/T", "v": 20.5625, "t": 22, "u": "Cel" }, { "n": "6363/O", "v": 1, "t": 22 } ] } Or when dealing with multiple entries for a single sensor: { "bn": "urn:dev:id:12345678/4b62/tT", "bt": 1433894477, "bu": "Cel", "e": [ { "v": 5, "t": 0 }, { "v": 6, "t": 5 } ] } Note on units: SenML support a limited number of units as standard. However, those units can be extended by using any unit in the UCUM standard[3] by prefixing the name with "UCUM:". This has an implication for non-standard units such as C16 that should be transformed into a standard unit. Note on my notes: I didn't convert the "h" unit as I can't remember what it is. Note on timestamps: they are formatted as an integer that is the number of seconds in the UNIX epoch. If using this format, we should ensure that those values are always in the UTC time zone. ## HyperCat catalogue ## HyperCat adds catalogue capability on top of the data at a well known URL for a particular service. That URL has a top level catalogue that points to other catalogues. So for example, the following catalogue has one sub-catalogue for devices: { "item-metadata": [ { "rel": "urn:X-tsbiot:rels:isContentType", "val": "application/vnd.tsbiot.catalogue+json" }, { "rel": "urn:X-tsbiot:rels:hasDescription:en", "val": "all catalogues" } ], "items": [ { "href": "/cats/devices", "i-object-metadata": [ { "rel": "urn:X-tsbiot:rels:isContentType", "val": "application/vnd.tsbiot.catalogue+json" }, { "rel": "urn:X-tsbiot:rels:hasDescription:en", "val": "Devices" } ] } ] } And that sub-catalogue lists the sensors: { "item-metadata": [ { "rel": "urn:X-tsbiot:rels:isContentType", "val": "application/vnd.tsbiot.catalogue+json" }, { "rel": "urn:X-tsbiot:rels:hasDescription:en", "val": "Devices" } ], "items": [ { "href": "https://config28.flexeye.com/v1/iot_Default/dms/Eseye_DM/devices/Device_1131", "i-object-metadata": [ { "rel": "urn:X-tsbiot:rels:hasDescription:en", "val": "Funky sensor" }, { "rel": "http://purl.oclc.org/NET/ssnx/ssn#SensingDevice", "val": "Sensor" }, { "rel": "urn:X-tsbiot:rels:isContentType", "val": "application/json" }, { "rel": "urn:X-senml:u", "val": "https://config28.flexeye.com/v1/iot_Default/dms/Eseye_DM/devices/Device_1131/senML/json" }, { "rel": "http://www.loa-cnr.it/ontologies/DUL.owl#hasLocation", "val": "https://config28.flexeye.com/v1/iot_Default/applications/eyeHack" } ] } ] } Note that there may be several levels of catalogues and that the leaf catalogue tends to list individual sensors on a single leaf node. The hierarchy of catalogues could be something like: /cats/concentrator/XXX/device/YYY/sensors/ Or /cats/device/XXX/YYY/sensors/ We should also include in the "i-object-metadata" structure important information such as unit, metric, name, etc. Some of those may be repeated in the SenML data but are useful in the catalogue to enable filtering. One option at the catalogue level, rather than specify the unit, would be to specify the metric (e.g. "temperature" rather than "°C") as this is enough for a platform to understand haw to handle the sensor, assuming it can handle all units in that metric. EnergyDeck will probably have a catalogue URL that follows the following pattern (this will be confirmed during implementation): /cats | root catalogue of catalogues /cats/assets | catalogue of assets /cats/metering-points | catalogue of metering points across all assets /cats/asset/x | catalogue of catalogues for asset x /cats/asset/x/assets | sub-catalogue of assets for asset x /cats/asset/x/metering-points | catalogue of metering points (~devices) attached to an asset /cats/asset/x/metering-point/y | catalogue of catalogues for MP y associated with asset x /cats/metering-point/y | ... with direct access shortcut /cats/metering-point/y/linked | catalogue of MPs related to MP y /cats/metering-point/y/series | catalogue of data series for MP y /cats/metering-point/y/series/z/ | catalogue of catalogues for series z in MP y /cats/metering-point/y/series/z/raw | raw data points for series z in MP y /cats/metering-point/y/series/z/1m | data points at 1 minute granularity /cats/metering-point/y/series/z/30m | data points at 30 minute granularity /cats/metering-point/y/series/z/1Y | data points at 1 year granularity ## References ## [1] http://www.hypercat.io/ [2] https://tools.ietf.org/id/draft-jennings-senml-10.txt [3] http://unitsofmeasure.org/ucum.html
2015-06-22: Bruno and Damon discussed the desirability of completely mechanical conversions from JSON sensor units to UCUM/SenML units to minimise or eliminate magic 'mappings' that require sophisticated developer time (in line with IBM suggestions). One particular issue that came up is being able to represent value as integers (for brevity and to keep code small on the sensors) and scale to integers for transit, when the natural scaling is a power of two, eg temperatures from common sensors with four significant bits after the binary point, ie that are currently being sent with units |C16
for "Celsius times 16". Bruno was going to investigate. One possible escape hatch is with the Ki/Mi/Gi/Ti "special prefix symbols for powers of 2".
2015/06/27: Note: mechanical translatability from any binary formats used (such as OpenThings, TinyHAN profiles or application-specific hand-crafted) is highly desirable for the same reasons, eg so that the concentrator/redistributor can convert them mechanically for downstream fan-out and make the data discoverable. That may imply a plug-in at the concentrator per upstream (binary) format to convert to a common presentation and processing format such as JSON and/or SenML.
2015/07/09: An OpenThings protocol review (also here) points out a number of interesting characteristics of OpenThings as a potential supported binary format for OpenTRV sensor leaf nodes (lightweight, suitable to run directly over RF link, extensible) and turns up a number of pros and cons from this project's point of view:
- pro
- Simple
- Works over any network protocol, including RF
- Matches OpenTRV current network topology (star)
- Large list of metrics that covers OpenTRV requirements + possibility for extension
- Two-way protocol that supports control of individual nodes
- Support for management meta-data such as battery power and alarms
- con
- No provision for binding between node and concentrator
- Some protocol optimisation currently done by OpenTRV (eg C16) may be difficult to implement
- Some restrictions on how data can be sent for devices that have multiple sensors of the same type (eg 2 temperature sensors)
Configuration File Format Version 1
Bruno's (EnergyDeck CTO) D15 note 2015-06-19: config file format.
# Disk based configuration format # The current `-dhd` option in the code automatically creates a number of stats handlers. All the command line driven code does the same so the core of the configuration should be a list of handlers with options. Using a JSON format, we could have something like: { "handlers": [ { "name": "handler name", "type": "uk.org.opentrv.comms.statshandlers.builtin.DummyStatsHandler", "options": { "option1": "value1" } } ] } The list of options is then specific to a particular handler type. Questions: - Is it sensible to have a fully qualified Java class name as the type? - Should the name be mandatory? We need an anonymous handler option for wrapped handlers anyway so we could also rely on the index in the handlers array. Example with a RKDAP handler: { "handlers": [ { "name": "EnergyDeck stats handler", "type": "uk.org.comms.http.RkdapHandler", "options": { "dadID": "ED256", "url": "https://energydeck.com" } } ] } Example with a wrapped handler: { "handlers": [ { "name": "My async handler", "type": "uk.org.opentrv.comms.statshandlers.filter.AsyncStatsHandlerWrapper", "options": { "handler": { "type": "uk.org.opentrv.comms.statshandlers.builtin.SimpleFileLoggingStatsHandler", "options": { "statsDirName": "stats" } }, "maxQueueSize": 32 } } ] } Full `-dhd` flag example: { "handlers": [ { "name": "File log", "type": "uk.org.opentrv.comms.statshandlers.builtin.SimpleFileLoggingStatsHandler", "options": { "file": "out_test/stats" } }, { "name": "Twitter Temp b39a", "type": "uk.org.opentrv.comms.statshandlers.builtin.twitter.SingleTwitterChannelTemperature", "options": { "hexID": "b39a" } }, { "name": "Twitter Temp 819c", "type": "uk.org.opentrv.comms.statshandlers.builtin.twitter.SingleTwitterChannelTemperature", "options": { "hexID": "819c" } }, { "name": "Recent stats file", "type": "uk.org.opentrv.comms.statshandlers.filter.SimpleStaticFilterStatsHandlerWrapper", "options": { "handler": { "type": "uk.org.opentrv.comms.statshandlers.builtin.RecentStatsWindowFileWriter", "options": { "targetFile": "out_test/edx.json" } }, "allowedIDs": [ "b39a", "819c" ] } }, { "name": "EMON CMS", "type": "uk.org.opentrv.comms.statshandlers.builtin.openemon.OpenEnergyMonitorPostSimple", "options": { "credentials": "emonserver1", "sourceIDIn": "819c", "statsTypeIn": "{", "mapping": { "T|C16": "Temp16", "B|cV": "BattcV", "L": "L" }, "emonNodeOut": "819c" } } ] } The implication of that configuration file is that the existing stats handlers will need to be refactored so that all handlers use a one argument constructor that takes a configuration object. A possible extension to that format would be to have a mapping between short name and fully qualified Java class at the beginning of the file to simplify the handler definitions.
DHD note: it should eventually also be possible to inline credentials in the config optionally rather than have them out of line as now. In part that should make remote management more sane and simple.
Real-time / Streamed
A primary goal of this work is building a sensor system that makes data available in real-time, though reasonable notions of 'real-time' here come with data delivery latency ranging from seconds to days.
(In general these comments are meant to apply both to the direct data outputs gathered during the Launchpad project, and to projects that use the technology developed during the project.)
Generally high-priority data, and/or that needs to be acted on quickly to be maximally useful such as our footfall data to feed into journey planners and bus dispatch, should have lower latency and usually low datum size to enable that. This might be dispatched over-the-air over an RF (radio-frequency) connection for example, as telemetry. Typically data updates would be sent every few minutes, though extra/early transmission of time-sensitive data is possible.
Lower-priority and/or bulk data could, for example, wait for something to physically pass or connect to it, eg a bus passing a bus shelter could use WiFi to get a quick large historical (eg over prior 24h) data dump from sensors, maybe to be disgorged at the bus depot or another suitably-instrumented stop.
It is anticipated that this data will be largely handled and available in real-time, possibly delayed or blocked for financial or security/privacy reasons in a few cases, eg bus shelter attendance in the small hours.
Downstream fan-out across the Internet from the concentrator/distributor will add more latency unless serious efforts are made otherwise, eg by:
- Keeping alive downstream connections to avoid set-up overheads, typically of several RTTs (Round-Trip Times), especially where secure handshakes are necessary.
- Paying for low-latency QoS connectivity, eg more like leased-line than default best-efforts Internet service.
It is also anticipated that this data will be streamed to databases for long-term persistence and analysis. The simplest version of this is likely to be a simple timestamped and possibly authenticated log of the sensor data, eg, something a little like this for JSON:
[ "2015-06-28T16:41:31Z", "", {"@":"0a45","+":5,"tT|C":7,"vC|%":0,"T|C16":357,"O":1} ] [ "2015-06-28T16:42:05Z", "", {"@":"0d49","+":5,"tT|C":7,"vC|%":862,"T|C16":354} ] [ "2015-06-28T16:42:15Z", "", {"@":"2d1a","+":5,"tT|C":7,"vC|%":0,"T|C16":377,"O":1} ] [ "2015-06-28T16:42:33Z", "", {"@":"f1c6","+":4,"L":5,"O":1,"vac|h":6,"T|C16":350} ] [ "2015-06-28T16:44:21Z", "", {"@":"414a","+":2,"vac|h":8,"v|%":0,"tT|C":7,"vC|%":0} ] [ "2015-06-28T16:44:27Z", "", {"@":"3015","+":0,"B|mV":2533,"H|%":52,"O":1} ] [ "2015-06-28T16:45:15Z", "", {"@":"819c","T|C16":340,"L":228,"B|cV":255} ] [ "2015-06-28T16:45:31Z", "", {"@":"0a45","+":0,"T|C16":355,"vac|h":9,"v|%":0} ] [ "2015-06-28T16:46:05Z", "", {"@":"0d49","+":6,"T|C16":353,"O":1,"vac|h":8} ] [ "2015-06-28T16:47:05Z", "", {"@":"0d49","+":7,"B|mV":2601,"v|%":0,"tT|C":7} ]
and this for non-JSON origin format:
2015/06/28 16:41:33Z 414A 22.3125 @414A;T22C5;L31;O1 2015/06/28 16:42:13Z 819C 21.3125 @819C;T21C5;L239;O1 2015/06/28 16:42:55Z 3015 23.625 @3015;T23CA;L63;O1 2015/06/28 16:43:43Z D49 22.125 @D49;T22C2;L55;O1 2015/06/28 16:44:05Z A45 22.25 @A45;T22C4;L38;O1 2015/06/28 16:44:25Z 2D1A 23.5625 @2D1A;T23C9;L21;O1 2015/06/28 16:45:13Z 819C 21.25 @819C;T21C4;L228;O1 2015/06/28 16:46:33Z F1C6 21.875 @F1C6;T21CE;L1;O1 2015/06/28 16:47:35Z D49 22.0625 @D49;T22C1;L55;O1 2015/06/28 16:48:17Z 2D1A 23.5625 @2D1A;T23C9;L16;O1
Real-time uses of data are potentially more forgiving of (eg) changes of sensor and format, eg downstream code can be adjusted hand-in-hand with sensor data changes, though log files should be in a form that as far as possible is either explicitly documented or at which a reasonable guess can be made. This makes textual (eg ASCII line-oriented *nix-like) values with names and (UCUM) units a good default, even for the likely-terse data direct from sensors. (Some conversion from binary formats to text may be necessary, but should if possible be mechanical with the original data preserved as close as possible to the original, maybe even including a hex dump of the original message bytes minus any cryptographic elements not intended to be long-lived or widely exposed (eg to attack) (see D16 Security).
A reasonable default may be to insist on a (compact) SenML translation of any message not directly representable in compact ASCII text alongside a hex dump of the message data bytes, with that translation being optional if the inbound message is suitable readable printable single-line ASCII7.
Note that if the receiving end of the RF link from the sensor provides a textual representation of what arrived over the wire, then that form may be treated as-if the transmitted form, eg as with @2D1A;T23C9;L16;O1
above.
Note that these forms are intended to be written and accessed serially, and are not necessarily efficient for other than log writing and in reading other than for audit and bulk-data analysis. These forms will typically be less compact than a hand-crafted binary form, though standard common well-documented compression techniques likely to have decoders available for a long time (eg GZIP, LZMA) may eliminate much of that redundancy, while leaving the underlying form open to archaeology many years after storage.
Current thinking (2015-06-28) is that as far as reasonably practical all the real-time (sensor) data (including metadata) should have a simple mechanical translation to SenML and be HyperCat-friendly for discoverability.
Timestamps and authenticators
Note that in general the expectation is that time stamps, and any data signing for long-term authentication (eg for non-repudiation and tamper-proofing), will happen at the concentrators (which can maintain accurate clocks and perform more elaborate cryptography). Concentrators will only sign incoming data that has passed link/source authentication checks, which should be sufficient authentication chain back from the hardware.
Timeseries / Historical
Once out of real-time, we may take snapshots of real-time state across one or many sensors, or maintain and update live queryable time series thereof.
(In general these comments are meant to apply both to the direct data outputs gathered during the Launchpad project, and to projects that use the technology developed during the project.)
Such blended uses between real-time and historical can be very useful, ie combining data right up to now with the historical set to make near-term predictions based on daily/weekly/annual/weather patterns, even if that latest data may not be stored in the same place as the older data. Databases such as kdb+
or software such as Chronicle, possibly more often seen in HFT/HST (High- Frequency/Speed Trading) in finance, may be especially suited to this task.
Some sensor and post-analytics data, eg that put into open public data stores, will likely have some long-term value for many years after the termination of (say) this Launchpad project and other projects generating data using this framework. Thus we have to think about long-term archival at least of the data itself (physical archival media concerns are out of scope). Ideally data is 'self describing' but in practice something written in a simple (probably non-binary) format like the real-time log output above might be close. However, that is likely to conflict with being efficient in space and/or access-time terms.
As for real-time data, where at all possible, data should be made easy to catalogue with and discover via HyperCat, and, if public/open data, actually be recorded in places such as Thingful.
Some candidate data stores to deposit some or all of the Launchpad project data include:
- Digital Catapult Environmental Data Exchange (EDX).
- GLA (Greater London Authority) London Datastore.
- The
earth.org.uk
/opentrv.org.uk
Web sites, plus SourceForge/GitHub.
Conclusions
Real-time (streamed) and timeseries (historical) uses of the collected data should be supported.
The availability of simple mechanical translations of all sensor node data forms to a common intermediate such as SenML with UCUM units (for appropriate sensors) is highly desirable for keeping development costs down (specialist developers that can target the distributor environment are costly) and helping with discoverability and long-term storage of IoT data.
Fast processing will probably require loading that data (possibly with live updating from streamed inputs) into specialist binary or in-memory forms, but those will likely not be the archival/storage form.
Metadata required for provisioning and on-going estate management should be generated by all sensor nodes intended for large deployments, including alerts for some physical tampering.
A basic concentrator/distributor configuration format has been outlined, which should help facilitate remote management with something like TR-069. If existing open frameworks are able to cover what our existing proof-of-concept concentrator code has achieved then we may switch to one of them.