Hi, I'm Ben 🐚

I don’t get a lot of free time for development, but whenever I do, at the moment I’m plugging it into making data testing, and specifically, data contracts, as easy as possible!

Programming is messy enough anyway, but programming with data is extra messy, because your programme gets to keep changing every time it runs even if you change nothing. Luck you!

There’s a whole load of products that exist for data testing, but I built a new one anyway, it’s called Wimsey! Most existing tools for data contracts are prioprietary and non-open source, but even the ones that are often tied to open source products, that winds up making a huge impact on how they exist, they tend to want you to build an entire server and database to get them up and runner. Not enough people test data to start with, without us making it hard for them! That’s where Wimsey comes in.

Step One: Running A Test

Say you have a dataframe, let’s go with a Pandas one, because that’s the most common, but Wimsey will work with Polars, Pyarrow, DuckDB, Modin, Dask and a bunch of others too! If you have a data contract defined in a json or yaml file you can do this, neat huh?

import pandas as pd
import wimsey

df: pd.DataFrame = pd.read_excel("ohgodwhyisitanexcelfile.xlsx").pipe(wimsey.validate, "tests.json")

Step Two: Writing A Test

“But Ben, I don’t have a data contract, that’s the whole issue!?” I hear you saying, which is fair enough, you could manually write one, but if you already have 100 pipelines with no tests, that’s gonna take a long time, so instead, you could do this:

import pandas as pd
import wimsey

df: pd.DataFrame = pd.read_excel("ohgodwhyisitanexcelfile.xlsx").pipe(wimsey.validate_or_build, "tests.json")

If “test.json” doesn’t exist, Wimsey will build a sensible test based on the dataframe it’s passed. That isn’t much this run through, but next time we run this script, if the columns or values have changed a whole bunch, we’ll get an error. Hooray! Ok, maybe not hooray, errors are never fun, but finding out you have bad data from an error is better than from an angry analytics team.

Obviously in this example, we’re reading a single excel sheet, it probably won’t change, in the wild this would normally be an API or database call for instance.

Step Three: Use a Cloud or Something

I’m not gonna shill for S3, but if you have S3, or any file store you like, you can really get going:

import pandas as pd
import wimsey

df: pd.DataFrame = pd.read_excel("ohgodwhyisitanexcelfile.xlsx").pipe(wimsey.validate_or_build, "s3://filestore/example/tests.json", storage_options={"endpoint_url": "httpz://cool.nice"})

Now you can not only tell your boss you have data testing, but that you’ve set up a fancy new “contracts store” too - win win win! Plus probably less headaches down the round, so win win win win I guess, but that’s getting a little silly.