Workshop Description

In this workshop, we explore applications of Machine Learning to analyze biological data without the need of advanced programming skills. For example, Machine Learning techniques can be used to construct predictive models based on a set of training examples, to remove noise and spurious artifacts from data (e.g. photobleaching), or to help visualize trends within high dimensional datasets, etc. This workshop will cover the basic principles involved in the applications mentioned above, such as pattern recognition, linear and non-linear regression and cluster analysis. The workshop will be oriented towards hands-on activities, starting from the basics of how to load and prepare biological datasets in a Python environment. By the end of this workshop, students will be able to use Scikit-Learn’s documentation (and other libraries) to build models based on their own data, assess their performance and make new predictions. Students are encouraged to attend to the Advanced Python and Modern Statistics workshops, although no advanced knowledge will be assumed.

Day 1

  • Overview of Machine Learning: what is it and how does it help me?
  • Nano-review of Python
  • Introduction to Jupyter Notebooks
  • Introduction to Numpy and matrices
  • Importing and formatting datasets
  • Introduction to feature extraction
  • Introducing Scikit-Learn’s structure
    • Example: Principal Component Analysis
    • Plotting and simple visualizations
    • Understanding the documentation
  • In-class practice

Day 2

  • Introduction to classification and Scikit-Learn
    • Bayes classifier
    • Fisher linear discriminant
    • Support Vector Machines
    • Meta-parameters and regularization
    • Random Forests
  • Scores and validation
    • Accuracy and error
    • Precision-Recall
    • F-scores
  • Cross-validation
    • Estimation bias in performance estimation
    • Leave-one-out Cross-validation
    • K-Fold Cross-validation
    • Common issues and speeding up cross-validation
  • In-class practice

Day 3

  • Linear and non-linear regression
    • Linear models
    • Polynomial regression
    • Regression with Support Vector Machine
  • Generalized Linear Models
    • Meta-parameters and regularization
    • Lasso regression
  • Cross-validation and model selection
  • In-class practice
  • Overview of cluster analysis
    • Overview of methods
    • Fixed number of clusters: K-means
    • Can we learn K?
    • Hierarchical clustering
    • UPGMA methodology
  • In-class practice

Technical Requirements

Attendees should have a working copy of Python 3 with the following packages:

  • Numpy
  • Matplotlib
  • Scikit-Learn
  • Jupyter Notebooks

We will post instructions to walk you through how to install.


Dr. Thiago S. Mosqueiro is a Post-Doctoral researcher at the Pinter-Wollman lab (UCLA), primarily working with data analysis and simulations involving learning and behavior of honey bees. His main research interests are mathematical and computational modeling applied to systems biology, especially neuroscience. He received his BS and Ph.D. in Physics from the University of São Paulo (USP). During his Ph.D., he was a visiting graduate student at the BioCircuits Institute (BCI) and the Rady School of Management, at University of California San Diego (UCSD). Thiago has also contributed to research in applied fields, such as Machine Learning, Data Analysis of Chemical Sensors, and Cloud Computing.

Workshop Details

Prerequisites: Python
Length: 3 days, 3hrs per day
Level: Intermediate
Location: Collaboratory Classroom  (Boyer Hall, 529)
Seats Available: 28

Fall Dates

Nov. 28 – 30, 9:30 – 12:30pm