Real World Data
Warning
Big fat “unfinished” warning: structa is still very much incomplete at this time and there’s plenty of rough edges (like not showing CSV column titles).
If you run into unfinished stuff, do check the issues first as I may have a ticket for that already. If you run into genuinely “implemented but broken” stuff, please do file an issue; it’s these things I’m most interested in at this stage.
Pre-requisites
You’ll need the following to start this tutorial:
A structa installation; see Installation for more information on this.
A Python 3 installation; given that structa requires this to run at all, if you’ve got structa installed, you’ve got this too. However, it’ll help enormously if Python is in your system’s “PATH” so that you can run python scripts at the command line.
The scipy library must be installed for the scripts we’re going to be using to generate data. On Debian/Ubuntu systems you can run the following:
$ sudo apt install python3-scipy
On Windows, or if you’re running in a virtual environment, you should run the following:
$ pip install scipy
Some basic command line knowledge. In particular, it’ll help if you’re familiar with shell redirection and piping (note: while that link is on askubuntu.com the contents are equally applicable to the vast majority of UNIX shells, and even to Windows’ cmd!)
“Real World” Data
For this tutorial, we’ll use a custom made data-set which will allow us to tweak things and see what’s going on under structa’s hood a bit more easily.
The following script generates a fairly sizeable JSON file (~11MB) apparently recording various air quality readings from places which bear absolutely no resemblance whatsoever to my adoptive city (ahem):
import sys
import json
import random
import datetime as dt
from scipy.stats import skewnorm
readings = {
# stat: (min, max),
'O3': (0, 50),
'NO': (0, 200),
'NO2': (0, 100),
'PM10': (0, 100),
'PM2.5': (0, 100),
}
locations = {
# location: {stat: (skew, scale), ...}
'Mancford Peccadillo': {
'O3': (0, 1),
'NO': (5, 1),
'NO2': (0, 1),
'PM10': (10, 3),
'PM2.5': (10, 1),
},
'Mancford Shartson': {
'O3': (-10, 1),
'NO': (10, 1),
'NO2': (0, 1),
},
'Salport': {
'NO': (10, 1),
'NO2': (-10, 1/2),
'PM10': (5, 1/2),
'PM2.5': (5, 1/2),
},
'Prestchester': {
'O3': (1, 1),
'NO': (5, 1/2),
'NO2': (0, 1),
'PM10': (5, 1/2),
'PM2.5': (10, 1/2),
},
'Blackshire': {
'O3': (-10, 1),
'NO': (50, 1/2),
'NO2': (10, 1/2),
'PM10': (10, 1/2),
'PM2.5': (10, 1/2),
},
'St. Wigpools': {
'O3': (0, 1),
'NO': (10, 1),
'NO2': (5, 3/4),
'PM10': (5, 1/2),
'PM2.5': (5, 1/2),
},
}
def skewfunc(min, max, a=0, scale=1):
s = skewnorm(a)
real_min = s.ppf(0.0001)
real_max = s.ppf(0.9999)
real_range = real_max - real_min
res_range = max - min
def skewrand():
return min + res_range * scale * (s.rvs() - real_min) / real_range
return skewrand
generators = {
location: {
reading: skewfunc(read_min, read_max, skew, scale)
for reading, params in loc_readings.items()
for read_min, read_max in (readings[reading],)
for skew, scale in (params,)
}
for location, loc_readings in locations.items()
}
timestamps = [
dt.datetime(2020, 1, 1) + dt.timedelta(hours=n)
for n in range(10000)
]
data = {
location: {
'euid': 'GB{:04d}A'.format(random.randint(200, 2000)),
'ukid': 'UKA{:05d}'.format(random.randint(100, 800)),
'lat': random.random() + 53.0,
'long': random.random() - 3.0,
'alt': random.randint(5, 100),
'readings': {
reading: {
timestamp.isoformat(): loc_gen()
for timestamp in timestamps
}
for reading, loc_gen in loc_gens.items()
}
}
for location, loc_gens in generators.items()
}
json.dump(data, sys.stdout)
If you run the script it will output JSON on stdout, which you can redirect to a file (or straight to structa, but given the script takes a while to run you may wish to capture the output to a file for experimentation purposes). Passing the output to structa should produce output something like this:
$ python3 air-quality.py > air-quality.json
$ structa air-quality.json
{
str range="Blackshire".."St. Wigpools": {
'alt': int range=31..85,
'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
'lat': float range=53.29812..53.6833,
'long': float range=-2.901626..-2.362118,
'readings': {
str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
},
'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
}
}
Note
It should be notable that the output of structa looks rather similar to the
end of the air-quality.py
script, where the “data” variable that is
ultimately dumped is constructed. This neatly illustrates the purpose of
structa: to summarize repeating structures in a mass of hierarchical data.
Looking at this output we can see that the data consists of a mapping (or Javascript “object”) at the top level, keyed by strings in the range “Blackshire” to “St. Wigpools” (when sorted).
Under these keys are more mappings which have six keys (which structa has displayed in alphabetical order for ease of reading):
alt which maps to an integer in some range (in the example above 31 to 85, but this will likely be different for you)
euid which maps to a string which always started with “GB” and is followed by several numerals
lat which maps to a floating point value around 53
long which maps to another floating point roughly around -2
ukid which maps to a string always starting with UKA00 followed by several numerals
And finally, readings which maps to another dictionary of strings …
Which maps to another dictionary which is keyed by timestamps in string format, which map to floating point values
If you have a terminal capable of ANSI codes, you may note that types are displayed in a different color (to distinguish them from literals like the “ukid” and “euid” keys), as are patterns within fixed length strings, and various keywords like “range=”.
Note
You may also notice that several of the types (definitely the outer “str”, but possibly other types within the top-level dictionary, like lat/long) are underlined. This indicates that these values are unique throughout the entire dataset, and thus potentially suitable as top-level keys if entered into a database.
Just because you can use something as a unique key, however, doesn’t mean you should (floating point values being a classic example).
Optional Keys
Let’s explore how structa handles various “problems” in the data. Firstly, we’ll make a copy of our script and add a chunk of code to remove approximately half of the altitude readings:
$ cp air-quality.py air-quality-opt.py
$ editor air-quality-opt.py
data = {
location: {
'euid': 'GB{:04d}A'.format(random.randint(200, 2000)),
'ukid': 'UKA{:05d}'.format(random.randint(100, 800)),
'lat': random.random() + 53.0,
'long': random.random() - 3.0,
'alt': random.randint(5, 100),
'readings': {
reading: {
timestamp.isoformat(): loc_gen()
for timestamp in timestamps
}
for reading, loc_gen in loc_gens.items()
}
}
for location, loc_gens in generators.items()
}
for location in data:
if random.random() < 0.5:
del data[location]['alt']
json.dump(data, sys.stdout)
What does structa make of this?
$ python3 air-quality-opt.py > air-quality-opt.json
$ structa air-quality-opt.json
{
str range="Blackshire".."St. Wigpools": {
'alt'?: int range=31..85,
'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
'lat': float range=53.29812..53.6833,
'long': float range=-2.901626..-2.362118,
'readings': {
str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
},
'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
}
}
Note that a question-mark has now been appended to the “alt” key in the second-level dictionary (if your terminal supports color codes, this should appear in red). This indicates that the “alt” key is optional and not present in every single dictionary at that level.
“Bad” Data
Next, we’ll make another script (a copy of air-quality-opt.py
), which
adds some more code to “corrupt” some of the timestamps:
$ cp air-quality-opt.py air-quality-bad.py
$ editor air-quality-bad.py
for location in data:
if random.random() < 0.5:
reading = random.choice(list(data[location]['readings']))
date = random.choice(list(data[location]['readings'][reading]))
value = data[location]['readings'][reading].pop(date)
# Change the date to the 31st of February...
data[location]['readings'][reading]['2020-02-31T12:34:56'] = value
json.dump(data, sys.stdout)
What does structa make of this?
$ python3 air-quality.py > air-quality-bad.json
$ structa air-quality-bad.json
{
str range="Blackshire".."St. Wigpools": {
'alt'?: int range=31..85,
'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
'lat': float range=53.29812..53.6833,
'long': float range=-2.901626..-2.362118,
'readings': {
str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
},
'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
}
}
Apparently nothing! It may seem odd that structa raised no errors, or even warnings when encountering subtly incorrect data. One might (incorrectly) assume that structa just thinks anything that vaguely looks like a timestamp in a string is such.
For the avoidance of doubt, this is not the case: structa does attempt to
convert timestamps correctly and does not think February 31st is a valid date
(unlike certain databases!). However, structa does have a “bad threshold”
setting (structa --bad-threshold
) which means not all data in a given
sequence has to match the pattern under test.
Multiple Inputs
Time for another script (based on a copy of the prior
air-quality-bad.py
script), which produces each location as its own
separate JSON file:
$ cp air-quality-bad.py air-quality-multi.py
$ editor air-quality-multi.py
for location in data:
filename = location.lower().replace(' ', '-').replace('.', '')
filename = 'air-quality-{filename}.json'.format(filename=filename)
with open(filename, 'w') as out:
json.dump({location: data[location]}, out)
We can pass all the files as inputs to structa simultaneously, which will cause it to assume that they should all be processed as if they have comparable structures:
$ python3 air-quality-multi.py
$ ls *.json
air-quality-blackshire.json air-quality-prestchester.json
air-quality-mancford-peccadillo.json air-quality-salport.json
air-quality-mancford-shartson.json air-quality-st-wigpools.json
$ structa air-quality-*.json
{
str range="Blackshire".."St. Wigpools": {
'alt': int range=15..92,
'euid': str range="GB0213A".."GB1029A" pattern="GB[01][028-9][1-26-7][2-379]A",
'lat': float range=53.49709..53.98315,
'long': float range=-2.924566..-2.021445,
'readings': {
str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-2.982586..327.4161 }
},
'ukid': str range="UKA00148".."UKA00786" pattern="UKA00[135-7][13-47-8][06-9]"
}
}
In this case, structa has merged the top-level mapping in each file into one large top-level mapping. It would do the same if a top-level list were found in each file too.
Conclusion
This concludes the structa tutorial series. You should now have some experience of using structa with more complex datasets, how to tune its various settings for different scenarios, and what to look out for in the results to get the most out of its analysis.
In other words, if you wish to use structa from the command line, you should be all set. If you want help dealing with some specific scenarios, the sections in Recipes may be of interest. Alternatively, if you wish to use structa in your own Python scripts, the API Reference may prove useful.
Finally, if you wish to hack on structa yourself, please see the Development chapter for more information.