  • NL Seminar -The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

    Thu, Mar 21, 2024 @ 11:00 AM - 12:00 PM

    Information Sciences Institute

    Conferences, Lectures, & Seminars

    Speaker: Anthony Chen and Shayne Longpre, MIT

    Talk Title: The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

    Abstract: REMINDER: This talk will be a live presentation only, it will not be recorded.  Meeting hosts only admit guests that they know to the Zoom meeting. Hence, you’re highly encouraged to use your USC account to sign into Zoom. If you’re an outside visitor, please provide your: Full Name, Title and Name of Workplace to (nlg-seminar-host(at)isi.edu) beforehand so we’ll be aware of your attendance. Also, let us know if you plan to attend in-person or virtually. More Info for NL Seminars can be found at: https://nlg.isi.edu/nl-seminar/ The arms race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we introduce the Data Provenance Initiative, a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data.

    Biography: Bio 1:Anthony Chen is an engineer at Google DeepMind doing research on factuality and long-context language models. He received his PhD from UC Irvine last year where he focused on generative evaluation and factuality in language models. Bio 2: Shayne Longpre is a PhD candidate at MIT with a focus on data-centric AI, language models, and their societal impact. If speakers approve to be recorded for this NL Seminar talk, it will be posted on our USC/ISI YouTube page within 1-2 business days: https://www.youtube.com/user/USCISI. Subscribe here to learn more about upcoming seminars: https://www.isi.edu/events/

    Host: Jon May and Justin Cho

    More Info: https://nlg.isi.edu/nl-seminar/

    Webcast: https://www.youtube.com/watch?v=np9HeJN6miw

    Location: Information Science Institute (ISI) - Virtual and ISI-Conf Rm#689

    Audiences: Everyone Is Invited

    Contact: Pete Zamar

    Event Link: https://nlg.isi.edu/nl-seminar/

