Usable information in Datasets

Jan 27, 2022 · London, United Kingdom

"Conditional probing: measuring usable information beyond a baseline"

Probing experiments investigate the extent to which neural representations make properties -- like part-of-speech -- predictable. One suggests that a representation encodes a property if probing that representation produces higher accuracy than probing a baseline representation like non-contextual word embeddings. Instead of using baselines as a point of comparison, we're interested in measuring information that is contained in the representation but not in the baseline. For example, current methods can detect when a representation is more useful than the word identity (a baseline) for predicting part-of-speech; however, they cannot detect when the representation is predictive of just the aspects of part-of-speech not explainable by the word identity. In this work, we extend a theory of usable information called V-information and propose conditional probing, which explicitly conditions on the information in the baseline. In a case study, we find that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.

Information-Theoretic Measures of Dataset Difficulty
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. Not only is this framework informal, but it also provides little understanding of how difficult each instance is, or what attributes make it difficult for a given model. To address these problems, we propose an information-theoretic perspective, framing dataset difficulty as the absence of usable information. Measuring usable information is as easy as measuring performance, but has certain theoretical advantages. While the latter only allows us to compare different models w.r.t the same dataset, the former also allows us to compare different datasets w.r.t the same model. We then introduce pointwise V−information (PVI) for measuring the difficulty of individual instances, where instances with higher PVI are easier for model V. By manipulating the input before measuring usable information, we can understand why a dataset is easy or difficult for a given model, which we use to discover annotation artefacts in widely-used benchmarks.

Background reading:


** This meeting will be online using Zoom **
Please ensure that you install the Zoom app before the meeting in order to join in the discussions. It may also be possible to use the Zoom browser client but please check your audio/video setup in advance.

- 18:45: Attendees join and introduce themselves
- 19:00: Meetup starts
- 20:30: Close

A note about the Journal Club format:

1. The sessions usually start with a 5-10 minute introduction to the paper by the topic volunteer, followed by splitting into smaller groups to discuss the paper and other materials. We finish the session by coming together for about 15 minutes to discuss what we have learned as a group and ask questions around the room.
2. There is no speaker at Journal Club. One of the community has volunteered their time to suggest the topic and start the session, but most of the discussion comes from within the groups.
3. You will get more benefit from the session if you read the paper or other materials in advance. We try to provide (where we can find them) accompanying blog posts, relevant code and other summaries of the topic to serve as entry points.
4. If you don't have time to do much preparation, please come anyway. You will probably have something to contribute, and even if you just end up following the other discussions, you can still learn a lot.

Event organizers
  • London Data Science Journal Club

    Keeping up with the latest research is important for the data scientist so let's work on this together. Each week or two, we will choose one or more articles to read and meet up to discuss them. This group is open to data scientists of any experience level and speciality but I expect the core group will be relatively small. I hope you are as excited about data science, machine learning, and statistics research as I am, and I look forward to meeting you! This is a variety of the Silicon Valley Data Science

    Recent Events

Are you organizing Usable information in Datasets?

Claim the event and start manage its content.

I am the organizer

Featured Events