Workshop Description

This 3-day interactive workshop introduces the overarching principles guiding generative modeling and specifically Large-Scale Language Models (LLM), their application in Python for inference, and specific use-cases in Genomics. Experience with Python is necessary, and basic knowledge about ML workflows is preferred.

At the end of this workshop, you WILL be comfortable with loading, inferencing and experimenting with state-of-the-art LLMs in Python, and making small changes to suit your research interests in Genomics. You WILL NOT be exposed to the internal architecture of LLMs and training your own models.

Workshop Topics

Day 1: Python Review and Introduction to Language Models
• Review of Python and ML fundamentals: Data structures fundamentals using Python on Google Colab, a review of NumPy, Scikit-Learn, and basic ML workflow (Classification vs. Regression, Training vs. Inference, Loss Functions, Cross-Validation, Train vs. Test splits).
• Transformers and LLMs (Part 1): A short theory section to introduce Transformers and its architecture, and its function as an ML model. Introduction to Tokenizers.
• Prompting and Conditional Generation: Designing a small playground in Python to use prompting to programmatically run LLMs to perform text generation.

Day 2: Inferencing Language Models and Broad Use-cases
• Transformers and LLMs (Part 2): Encoder vs. Decoder-style LLM models. Building intuition about why these models may be more useful than others, and Overview of Applications in Text and Vision.
• LLM Store – HuggingFace: Continuation. Introduction to Huggingface, a library of pre-existing models trained by the community that can be programmatically downloaded and used for inference.
• Inferencing LLMs with Python (Part 2): Setting up a custom model for inference in Python using the Torch and Transformers libraries.

Day 3: Genomics-specific Use-cases and Summary
• Genomic-specific Applications of LLMs (Part 1): Introduction to various LLMs for Genomics
• Genomic-specific Applications of LLMs (Part 2): Continuation. Inferencing from the DNA-ESA model for sequence alignment.
• Summary: Recap of the discussed topics and a summary of next steps.

Technical Requirements

Please attend the workshop with access to a computer and pre-installed Google Collab environment. This is an interactive session with many coding and implementation parts.

Instructor

Pavan Holur is a 6th year PhD student in the Complex Networks Group (Electrical and Computer Engineering, PI: Vwani Roychowdhury) working on knowledge representation learning with a focus on Natural Language Processing (NLP). He has co-authored numerous papers in top conferences (ACL, ACM) and journals (PLoS, RYOS, JCSS) that involve generative applications of state-of-the-art Large Language Models (LLMs) in such diverse environments ranging from Genomics to Social Media. He was also CTO at Nextnet, a generative AI startup, dedicated to accelerating drug discovery. Hobbies include playing chess, the guitar and the Veena, and keeping fit.

Videos

Reviews

Workshop Details

Prerequisites: Experience with Python is necessary, and basic knowledge about ML workflows is preferred.
Length: 3 days, 3 hrs per day
Level: Intermediate
Location: Boyer 529
Seats Available: 28

Fall 2024 Dates

Nov. 12, 13, and 14

1:30 PM – 4:30 PM

REGISTRATION IS OPEN!