Past Meetup

Reflections on testing Machine Learning systems

This Meetup is past

25 people went


Seal Software is sponsor for this evening!

17:30 Some snacks and drinks
18:00 The lecture followed by discussions in groups

It is the times when ML is becoming omnipresent in numerous technologies surrounding us, AI is a buzzword for such systems which take crucial decisions on which human lives depend. However, the core of these systems, namely ML models remain outside the scope of rigorous testing. There is very little in both academic literature and software engineering discussions about what vulnerabilities exits and how predictable (or unpredictable) such systems really are.

While there exists another paradigm where ML is used to speed up/create/maintain automatic tests, we will only focus on the paradigm where we look for systematic (if possible) way to QA such AI systems powered by ML.

Crucial questions which needs answers are such as:
How to detect bugs, anomalies, randomness, errors etc in ML systems?
We may not always know the exact input. As such systems are geared towards unstructured and often noisy input. Therefore we may not always know the exact output and especially what the correct answer is after all.

How strict should we be with the criteria to allow tests to pass or fail (should there be a scale? a threshold? an exact value or fuzzy value?).

How do we search for weakness or adversarial attacks and the exploitation of these weekness of the ml system?

In ML we talk about supervised and unsupervised training in order to derive and ML model. If we focus on supervised ML - we have two stages, a training stage and classification (i.e. the application of a model) stage. How do we design tests for the training stage? Often we depend on training data, but how can we verify that the data is sound and correct for the given task.

Most ML systems are measured on the basis of their precision, recall, f-measure, accuracy. These metrics are calculated based on a given gold standard data set. Quality, diversity and sanity of this data is vital. But are these statistical measures the only way to test and QA the training stage?

AI systems make mistakes, which are gradually corrected with more data (or cleaning data) so the same mistakes will not repeat but new ones may be introduced. So testing may need to be dynamic.

We would like to open the floor for discussion, ideas and cooperation regarding QA for ML-powered systems.