Tiny Data Analysis Tutorial

Analyse&Predict Version 0.6 - Query your spreadsheet - Explore your data - Ask What-If Questions

Copyright (c) June 2016, Thom Fruehwirth, Ulm University, Germany. Programmed in SWI Prolog supported by Constraint Handling Rules.

This service comes with absolutely no warranty.

The Car Spreadsheet Example

To start with this tutorial, please load the sample spreadsheet file cars.csv at http://datamining.informatik.uni-ulm.de. It contains data about cars available in the US during the 1970s. The tool will return with giving you the minimum, median and maximum value in each field as well as the number of different values therein.

Approximationnameeconomy (mpg)cylindersdisplacement (cc)power (hp)weight (lb)0-60 mph (s)yearAnswer Property
minimaAMC Ambassador Brougham9368461613870
medianFord Maverick23415195283015.576
maximaVolvo Diesel8455514024.882
sizes306130583943569613

The first column tells us the entries in the field "name" range from "AMC Ambassador Brougham" to "Volvo Diesel" when sorted. In the middle is "Ford Maverick" and there are 306 different names. The next field is "economy (mpg)". When we sort this column, the lowest value is "9" and the median "23". The highest value is blank, because an empty field is used in the original spreadsheet if a value is not available (NA). When the numbers in this field are sorted, the blank field becomes the highest value. The cylinders column tells us that there are between 3 and 8 cylinders, with 4 being the median value. All in all, there are 5 different numbers for cylinders. Three numbers we already know, including the smallest and largest, so the missing two numbers must come from the numbers 5, 6 and 7. (One can ask queries for these values to find out that 7 is ruled out.) In the last row we see that the database covers the 13 years from 1970 to 1982.

The Average Car

The first query we might ask is if there is a car that has all the (rounded) median values in each field (except name and year) - so to say a perfectly "average" car with regard to these fields. For the fields we want to ignore, we write a hyphen "-". We press the "Return" key to get the answer.

Query:-23415095280015-
similar-25.1414088272015.4-lowmidhigh
related1-22412286239516-low
related1-27.5413495256014.2-mid
related1-36413584237013-high

There are some answer rows. The answer types in the very first column are "similar" and "related1". This means there are no exact matches to that query, but a similar one and several related ones. "related1" here means weakly similar. There are several such answer types, including "exact" and "close", which is in-between "similar" and "related" and "related1". Note that each answer row will correspond to an actual row in the spreadsheet, with the values we choose to ignore masked out. No values are made up.

Lets stay with the "related1" answer rows first. There are three exemplary rows, and they correspond to the "low", "mid" and "high" rows when the answers are sorted. The rows are first sorted on the field "economy (mpg)" and then from left to right on all remaining fields. "mid" stands for the median, middle row. The "low" and "high" rows cover a range of 75% of all the answers. We chose this 75% instead of 100% to rule out extreme (often erroneous) minimum and maximum entries. The three answer rows "low", "mid" and "high" are given for each answer types. Sometimes they may coincide. We see that the "economy (mpg)" ranges from 22 to 36. For the remaining 25% of the related rows, their values must be either lower or equal than 22 or higher or equal than 36. In the answer rows, the cylinders are always 4, but this does not mean that no other numbers are possible. E.g., there might be an row with "economy (mpg)" of 25 and 8 cylinders. For "power (hp)" the values are unordered. The "low" and "high" typically only applies the first field with values.

Now let's go the the only answer row for "similar". 75% of the similar matching rows to our query look like this row. So the "low", "mid" and "high" rows collapse into one row entry. The graphical bar indicates that as well. It is a visualization of the relative frequency of that particular row in all the answer rows, also weighted by the degree of similarity.

We would like to know which actual car(s) are that similar to the "average" car, therefore in the next query, we put a question mark "?" in the first field. This says that we want to see values for that field. Note that answer rows will always be sorted by fields with "?" in the query first, independent of the order of the fields in the original spreadsheet. In the current example query, "name" is the first field anyway.

Query:?23415095280015-
similarFord Fairmont (Man)25.1414088272015.4-lowmidhigh
related1Chevrolet Chevette32.149870212015.5-low
related1Maxda GLC Deluxe34.148665197515.2-mid
related1Toyouta Corona Mark II (Wagon)23412097250614.5-high

Again we see our familiar entry as the only answer row for "similar" matches. So there is only one car with these values, and that is a "Ford Fairmont (Man)". The rows for the "related1" matches have changed. We still cover 75% of all matches between the "low" and "high" rows, but now we sorted the rows according to the "?" query field, so we sorted according to the names of the cars, ranging from "Chevrolet" to "Toyouta" (this seems to be a typo in the original spreadsheet). So the exemplary rows for this range are different. Since what we see covers only three quarters of all answers, we can assume that there are also cars whose names start with "A" and "B" as well as "U" to "V" that are included in these related answers. (Remember that "AMC" and "Volvo" are the extreme values for the "name" field in this spreadsheet.)

Missing Information? - Fill in the Blanks!

The cars spreadsheet misses some information (denoted by empty fields). The following query asks for the cars that have an empty "economy (mpg)" field.

Query:???????
exactChevrolet Chevelle Concours (Wagon)8350165414211.570low
exactFord Torino (Wagon)835115340341170mid
exactVolkswagen Super Beetle 1174974819782071high

We can now ask for the missing information based on similar cars. We simply copy all fields for a particular car, say the Chevrolet, and put question mark "?" in the field with the missing value.

Query:Chevrolet Chevelle Concours (Wagon)?8350165414211.570
exactChevrolet Chevelle Concours (Wagon)8350165414211.570lowmidhigh
similarChevrolet Impala14835016542091271low
similarChevrolet Monte Carlo15840015037619.570
similarAMC Rebel SST (Wagon)836017538501170mid
similarChevrolet Chevelle Concours (Wagon)8350165414211.570high

As we see the Chevrolet Impala is a car with similar specifications, so we can assume that the value of the missing "economy (mpg)" field for the Chevrolet Wagon is about 14.

What If? - Relating Power and Acceleration

Let's explore the relationship between "power (hp)" and acceleration "0-60 mph (s)". We already know the latter has the values 8, 15.5 and 24.8 for minimum, median and maximum, respectively. The power is at least 46 with median 95. The maximum is the empty field, so the maximum value is unknown. If we ask for a car with a very high power of say 500, there are no exact matches but we get back as the highest value for close matches the number 230. We can consider the queries that we are going to ask as What-If-questions. The answers will tell us what we can expect if we fix certain parameters (field values) and leave others open. We put a "?" into the acceleration field and see how it is effected by a power of 50, 100 and 200, respectively. The following table shows some of the resulting answer rows (not all).

Query:----50-?-
close----52-16.5-low
close----52-19.4-
close----49-19.5-mid
close----52-22.2-
close----52-24.6-high

There are five answer rows shown, all of them have the same relative frequency. The power is mostly 52 and the acceleration centers around 19.5. So if we had a car with a power of about 50, we can expect an acceleration of about 20.

Query:----100-?-
exact----100-14.8-low
exact----100-16-mid
exact----100-17.7-high

For a power of 100, the exact answers center around an acceleration of 16, and this is also a frequent combination of values. These values are confirmed by the answer rows for close and similar matches. Thus a car with twice the power will typically accelerate faster.

Query:----200-?-
exact----200-15-lowmidhigh
close----198-10-low
close----208-11-
close----198-11.5-mid
close----200-15-high
similar----225-10-low
similar----190-12.2-mid
similar----150-14.5-high

For a power of 200, the exact answers collapse into one row - this is an indication that there is only one answer row. The close answer rows cover some more actual rows. The mid value of acceleration close to a power of 200 is 11.5. A similar value of 12.2 is confirmed by the answer rows for similar matches. These are the cars with fastest acceleration.

We now ask the queries the other way round - we give a question mark "?" for the power and query with the values of 8, 15 and 25 for acceleration. Looking at the close answers, for 8, we get a median value of 160 for power, for 15 of 95, for 25 of 71. These are confirmed by the rows that can be found in the other answer types.

So in conclusion, for faster acceleration we need more power.

It's All in the Cylinders

Similar worthwhile data analyses can be obtained by relating the number of cylinders to other parameters of cars. One not so obvious result is that 8 cylinder cars where more prevalent in the early 1970s, while 4 cylinder cars were more prevalent in the early 1980s. One would start with the following query to see the relative frequency of the number of cylinders.

Query:--?-----
exact--4-----midlow
exact--6-----
exact--8-----high

We see that 4 cylinders are more common than 8 and 6 cylinders together. 4 cylinders represent both the median and the low value - so they cover almost the entire lower half. 3 and 5 cylinders do not show up, so they must be quite rare. Indeed a corresponding query will show that they occur only in particular car models.

This ends the tutorial for now.

Feedback is welcome, mail to thom.fruehwirth@uni-ulm.de