API Docs 🧑‍💻

Wimey provides an intentionally small number of functions in order to simplify the experience of working with data contracts.

Note that tests can be defined as python functions, but aren’t included in this documentation - you’ll find them instead in the “Test Catalogue” section!

Test, Validate, Results & DataValidationError

For simple use cases, Test, Validate and DataValidationError will have you covered. Test and Validate are almost identical, except that Test will pass back a FinalResult data-type, and Validate will pass back the intial dataframe, raising an error if tests fail.

Both test and validate take df, contract (string giving the filename, support fsspec type locations, or alternatively, a list of python defined test) and optional storage_options (fsspec style storage options) arguments.

Here’s an example of then in usage:

import polars as pl
import wimsey
from wimsey import tests

df: pl.DataFrame = pl.read_csv("hopefully_nice_data.csv")

the_same_df: pl.DataFrame = wimsey.validate(df, "tests.json")

results = wimsey.test(
  df,
  [
    tests.count_should(column="col_a", be_exactly=10),
    tests.mean_should(column="col_b", be_greater_than=5),
  ],
)

If any of the tests ran by validate fail, it will raise a DataValidationError. test will always pass back a FinalResult dataclass, containing success (a bool) and results (a list of Result dataclasses for each individual tests, containing name, success and unexpected).

Profiling

Under wimsey.profiling are several functions designed to help build out tests.

Test/Validate or Build

validate_or_build and test_or_build will run tests, or if they don’t yet exist, sample the dataframe to build some sensible starter tests, and save to the location. Return wise, they both behave identical to validate and test, but additionally take:

samples (int) - number of samples to take
n (int) - optional size of sample to take
fraction (float) - optional size of sample to take
margin (float) - how far from the ranges to place tests, leaving as 1 will make relatively strict tests, increasing will make more lenient tests.

Here’s an example of test_or_build in usage:

import polars as pl
import wimsey
from wimsey.types import FinalResults

results: FinalResults = (
  pl.read_csv("hopefully_nice_data.csv")
  .pipe(
    wimsey.test_or_build,
    "s3://test-store/tests.json",
    samples=100,
    fraction=0.4,
    storage_options={"endpoint_url": "httpz://s3.somewhere.cool"},
  )
)

Starter Test Generation

If you want to directly generate tests, you can do so with starter_tests_from_sampling and starter_tests_from_samples.

starter_tests_from_samples will take an iterable of dataframes, and a margin (float) explaining how far from the ranges to place tests.

starter_tests_from_sampling will take a single dataframe, and additionally:

samples (int) - number of samples to take
n (int) - optional size of sample to take
fraction (float) - optional size of sample to take
margin (float) - how far from the ranges to place tests, leaving as 1 will make relatively strict tests, increasing will make more lenient tests.

These two functions have equivalent save_starter_tests_from_samples and save_stater_tests_from_sampling that take a path and storage_options argument, saving your tests into a file (they will choose json or yaml depending on file extensions).

Here’s an example of save_starter_tests_from_samples in use:

from glob import glob

import polars as pl
import wimsey

wimsey.save_starter_tests_from_samples(
  path="my-first-tests.json",
  samples=[pl.read_csv(i) for i in glob("data/*.csv")],
  margin=3,
)