Authors proposed a new pipeline for interactive weak supervision. Instead of asking users for sample labeling authors proposed to ask for labeling of the labeling functions (LF), e.g. regular expressions for text parsing. Authors argues that since experts can generate meaningful heuristics for LFs by hand, they probably would be able to discriminate meaningful auto-generated LFs from bad. Using estimation of the quality of LF authors finally generate pseudo-labels and train final model.
Here LF family denotes a number of LFs which are same in nature. E.g. all possible bigrams on texts are different LFs in the same LF family.
We can generate LFs automatically with a very low cost, and even better the library of possible LF families is well transportable across tasks of the same nature. We will denote single LF as $\lambda_i$. Each LF is characterised by its accuracy $\alpha_i$ and propensity $\pi_i$ (part of samples on which LF does not abstain).
At each query step, $\lambda_i$ is shown to the $j$-th expert and obtain expert estimation of it's usefulness $u_{i,j}$. Which is expert estimation of the $\alpha_i$. Authors assume that the mean estimation $u_i = \frac{1}{N_{experts}} \sum \limits_j u_{i,j}$ is related to $\alpha_i$ by some monotonically increasing function and therefore the ranking of the LF w.r.t. $u_i$ is equivalent to the ranking w.r.t. $\alpha_i$. This assumption is based on the fact, that experts are able to write good heuristics themselves.
Now, the question is how to estimate the quality of the functions unscored by experts. This is required both to generate proposal LFs in a way smarter than random guessing and to include unscored LFs in final markup as well. Paper proposes to set up the following pipeline for it:
Authors provide three methods of selection of the LF for final label estimation: (1) any amount of the LFs could be selected (2) limited amount of LFs could be selected and (3) only LFs directly ranked by experts as useful could be selected (due to bias monitoring or security requirements). For these situations varies not only the principle of the selecting final LFs for label construction but also the method of selecting next LF for the expert querying process.
For the (3) the task is to collect as much different meaningful annotations as possible given the limited amount of expert queries possible. Authors propose to select candidate LF with $\argmax \limits_i \mu(\hat{u}_i)$. And the final set of LFs is selected by tresholding actual $u_i$ provided by experts.
For the (1,2) authors selected candidate via $\argmax \limits_i 1.96 \sigma(\hat{u}_i) - \vert \mu(\hat{u}_i) - r \vert$, where $r > 0.5$ is minimal plausible accuracy.
For (1) the final set of LFs is selected via tresholding $\mu(\hat{u}i) > r$. And for (2) getting into account maximum allowed size of LFs bag $m$ via greedily solving $\argmax \limits{\vert \mathcal{D} \vert = m} \sum \limits_{i \in \mathcal{D}} [\mu(\hat{u}_i) > r] (2 \mu(\hat{u}_i) - 1) \pi_i$. (Gently reminder, $\pi_i$ is propensity — part of the dataset where $\lambda_i$ does not abstain). This selection encourages not only accuracy, but also coverage.
Finally the label is estimated via factor graph model. I have spent quite a time, trying to investigate how authors get the estimation of the accuracy for each model per each sample. I encourage you to read the paper if you think, that specific function will be beneficial over weighting with $\hat{u}_i$. You will be specifically interested in Appendix C.
Authors show some experiments on simple datasets, like Amazon, IMDB etc. with comparison to classical active learning methods. Where method shows beneficial results. Although, there was a discussion in the reviews section about how the selection of such an easy experiments affects ability of human to decide on LFs easier than in real case.
For COCO dataset authors proposed to use sort of kNN algorithm with centroids representing LFs.