Copyright (c) June 2016, Thom Fruehwirth, Ulm University, Germany. Programmed in SWI Prolog supported by Constraint Handling Rules.

*This service comes with absolutely no warranty.*

Approximation | name | economy (mpg) | cylinders | displacement (cc) | power (hp) | weight (lb) | 0-60 mph (s) | year | Answer Property |

minima | AMC Ambassador Brougham | 9 | 3 | 68 | 46 | 1613 | 8 | 70 | |

median | Ford Maverick | 23 | 4 | 151 | 95 | 2830 | 15.5 | 76 | |

maxima | Volvo Diesel | 8 | 455 | 5140 | 24.8 | 82 | |||

sizes | 306 | 130 | 5 | 83 | 94 | 356 | 96 | 13 |

The first column tells us the entries in the field "name" range from "AMC Ambassador Brougham" to "Volvo Diesel" when sorted. In the middle is "Ford Maverick" and there are 306 different names. The next field is "economy (mpg)". When we sort this column, the lowest value is "9" and the median "23". The highest value is blank, because an empty field is used in the original spreadsheet if a value is not available (NA). When the numbers in this field are sorted, the blank field becomes the highest value. The cylinders column tells us that there are between 3 and 8 cylinders, with 4 being the median value. All in all, there are 5 different numbers for cylinders. Three numbers we already know, including the smallest and largest, so the missing two numbers must come from the numbers 5, 6 and 7. (One can ask queries for these values to find out that 7 is ruled out.) In the last row we see that the database covers the 13 years from 1970 to 1982.

Query: | - | 23 | 4 | 150 | 95 | 2800 | 15 | - | |

similar | - | 25.1 | 4 | 140 | 88 | 2720 | 15.4 | - | lowmidhigh |

related1 | - | 22 | 4 | 122 | 86 | 2395 | 16 | - | low |

related1 | - | 27.5 | 4 | 134 | 95 | 2560 | 14.2 | - | mid |

related1 | - | 36 | 4 | 135 | 84 | 2370 | 13 | - | high |

There are some answer rows. The answer types in the very first column are "similar" and "related1". This means there are no exact matches to that query, but a similar one and several related ones. "related1" here means weakly similar. There are several such answer types, including "exact" and "close", which is in-between "similar" and "related" and "related1". Note that each answer row will correspond to an actual row in the spreadsheet, with the values we choose to ignore masked out. No values are made up.

Lets stay with the "related1" answer rows first. There are three exemplary rows, and they correspond to the "low", "mid" and "high" rows when the answers are sorted. The rows are first sorted on the field "economy (mpg)" and then from left to right on all remaining fields. "mid" stands for the median, middle row. The "low" and "high" rows cover a range of 75% of all the answers. We chose this 75% instead of 100% to rule out extreme (often erroneous) minimum and maximum entries. The three answer rows "low", "mid" and "high" are given for each answer types. Sometimes they may coincide. We see that the "economy (mpg)" ranges from 22 to 36. For the remaining 25% of the related rows, their values must be either lower or equal than 22 or higher or equal than 36. In the answer rows, the cylinders are always 4, but this does not mean that no other numbers are possible. E.g., there might be an row with "economy (mpg)" of 25 and 8 cylinders. For "power (hp)" the values are unordered. The "low" and "high" typically only applies the first field with values.

Now let's go the the only answer row for "similar". 75% of the similar matching rows to our query look like this row. So the "low", "mid" and "high" rows collapse into one row entry. The graphical bar indicates that as well. It is a visualization of the relative frequency of that particular row in all the answer rows, also weighted by the degree of similarity.

We would like to know which actual car(s) are that similar to the "average" car, therefore in the next query, we put a question mark "?" in the first field. This says that we want to see values for that field. Note that answer rows will always be sorted by fields with "?" in the query first, independent of the order of the fields in the original spreadsheet. In the current example query, "name" is the first field anyway.

Query: | ? | 23 | 4 | 150 | 95 | 2800 | 15 | - | |

similar | Ford Fairmont (Man) | 25.1 | 4 | 140 | 88 | 2720 | 15.4 | - | lowmidhigh |

related1 | Chevrolet Chevette | 32.1 | 4 | 98 | 70 | 2120 | 15.5 | - | low |

related1 | Maxda GLC Deluxe | 34.1 | 4 | 86 | 65 | 1975 | 15.2 | - | mid |

related1 | Toyouta Corona Mark II (Wagon) | 23 | 4 | 120 | 97 | 2506 | 14.5 | - | high |

Again we see our familiar entry as the only answer row for "similar" matches. So there is only one car with these values, and that is a "Ford Fairmont (Man)". The rows for the "related1" matches have changed. We still cover 75% of all matches between the "low" and "high" rows, but now we sorted the rows according to the "?" query field, so we sorted according to the names of the cars, ranging from "Chevrolet" to "Toyouta" (this seems to be a typo in the original spreadsheet). So the exemplary rows for this range are different. Since what we see covers only three quarters of all answers, we can assume that there are also cars whose names start with "A" and "B" as well as "U" to "V" that are included in these related answers. (Remember that "AMC" and "Volvo" are the extreme values for the "name" field in this spreadsheet.)

Query: | ? | ? | ? | ? | ? | ? | ? | ||

exact | Chevrolet Chevelle Concours (Wagon) | 8 | 350 | 165 | 4142 | 11.5 | 70 | low | |

exact | Ford Torino (Wagon) | 8 | 351 | 153 | 4034 | 11 | 70 | mid | |

exact | Volkswagen Super Beetle 117 | 4 | 97 | 48 | 1978 | 20 | 71 | high |

We can now ask for the missing information based on similar cars. We simply copy all fields for a particular car, say the Chevrolet, and put question mark "?" in the field with the missing value.

Query: | Chevrolet Chevelle Concours (Wagon) | ? | 8 | 350 | 165 | 4142 | 11.5 | 70 | |

exact | Chevrolet Chevelle Concours (Wagon) | 8 | 350 | 165 | 4142 | 11.5 | 70 | lowmidhigh | |

similar | Chevrolet Impala | 14 | 8 | 350 | 165 | 4209 | 12 | 71 | low |

similar | Chevrolet Monte Carlo | 15 | 8 | 400 | 150 | 3761 | 9.5 | 70 | |

similar | AMC Rebel SST (Wagon) | 8 | 360 | 175 | 3850 | 11 | 70 | mid | |

similar | Chevrolet Chevelle Concours (Wagon) | 8 | 350 | 165 | 4142 | 11.5 | 70 | high |

As we see the Chevrolet Impala is a car with similar specifications, so we can assume that the value of the missing "economy (mpg)" field for the Chevrolet Wagon is about 14.

Query: | - | - | - | - | 50 | - | ? | - | |

close | - | - | - | - | 52 | - | 16.5 | - | low |

close | - | - | - | - | 52 | - | 19.4 | - | |

close | - | - | - | - | 49 | - | 19.5 | - | mid |

close | - | - | - | - | 52 | - | 22.2 | - | |

close | - | - | - | - | 52 | - | 24.6 | - | high |

There are five answer rows shown, all of them have the same relative frequency. The power is mostly 52 and the acceleration centers around 19.5. So if we had a car with a power of about 50, we can expect an acceleration of about 20.

Query: | - | - | - | - | 100 | - | ? | - | |

exact | - | - | - | - | 100 | - | 14.8 | - | low |

exact | - | - | - | - | 100 | - | 16 | - | mid |

exact | - | - | - | - | 100 | - | 17.7 | - | high |

For a power of 100, the exact answers center around an acceleration of 16, and this is also a frequent combination of values. These values are confirmed by the answer rows for close and similar matches. Thus a car with twice the power will typically accelerate faster.

Query: | - | - | - | - | 200 | - | ? | - | |

exact | - | - | - | - | 200 | - | 15 | - | lowmidhigh |

close | - | - | - | - | 198 | - | 10 | - | low |

close | - | - | - | - | 208 | - | 11 | - | |

close | - | - | - | - | 198 | - | 11.5 | - | mid |

close | - | - | - | - | 200 | - | 15 | - | high |

similar | - | - | - | - | 225 | - | 10 | - | low |

similar | - | - | - | - | 190 | - | 12.2 | - | mid |

similar | - | - | - | - | 150 | - | 14.5 | - | high |

For a power of 200, the exact answers collapse into one row - this is an indication that there is only one answer row. The close answer rows cover some more actual rows. The mid value of acceleration close to a power of 200 is 11.5. A similar value of 12.2 is confirmed by the answer rows for similar matches. These are the cars with fastest acceleration.

We now ask the queries the other way round - we give a question mark "?" for the power and query with the values of 8, 15 and 25 for acceleration. Looking at the close answers, for 8, we get a median value of 160 for power, for 15 of 95, for 25 of 71. These are confirmed by the rows that can be found in the other answer types.

So in conclusion, for faster acceleration we need more power.

Query: | - | - | ? | - | - | - | - | - | |

exact | - | - | 4 | - | - | - | - | - | midlow |

exact | - | - | 6 | - | - | - | - | - | |

exact | - | - | 8 | - | - | - | - | - | high |

We see that 4 cylinders are more common than 8 and 6 cylinders together. 4 cylinders represent both the median and the low value - so they cover almost the entire lower half. 3 and 5 cylinders do not show up, so they must be quite rare. Indeed a corresponding query will show that they occur only in particular car models.

This ends the tutorial for now.

Feedback is welcome, mail to thom.fruehwirth@uni-ulm.de