It's been a while: so save this date, October 15, 2024, to re-connect, in person, with your friends in the local R Community.
Agenda:
6:30 PM Pizza and networking
7:00 PM John Mount: Using R and Stan to Learn Preferences
7:30 PM Earl Hubbell: Causes of Death in Cancer: Demonstrating Open Code With Synthetic Data
8:00 PM Bob Horton: Searching for concepts in semantic space
8:30 General Discussion
8:45 PM wrap up
Using R and Stan to Learn Preferences
Abstract:
A classic problem in online marketing is learning user preference from past user selections and rejections. This is a hard problem as most of the user’s behavior is hidden or unobserved and also confounded with nuisance issues.
Bayesian systems, such as Stan’s Markov chain Monte Carlo sampler are ideal for guessing at or inferring such hidden state.
I will describe what Stan is, and show how to call Stan from R or Python to characterize and solve the problem. The strength of the method is: Stan lets the user concentrate on describing the problem, and keeps how the problem is solved somewhat separate from the specification. This makes Stan an excellent prototyping tool. I will share a few “tricks in thinking probabilistically” and how to run Stan and diagnose the results.
Seeing how this is done should allow participants to try Stan on their own problems.
Speaker:
Dr. John Mount is a data science developer, consultant, speaker, and trainer. He has a Ph.D in computer science from Carnegie Mellon. He is a general partner at Win Vector LLC and one of the authors of Practical Data Science with R (now in its second edition). He is also the co-author of numerous R packages, including vtreat which performs automatic reliable low-dimensional re-encoding of categorical variables for machine learning problems. Win Vector LLC would love to help you with your data science projects!
-----------------------------------------------
Causes of Death in Cancer: Demonstrating Open Code With Synthetic Data
Abstract:
The recent paper “Avoiding lead-time bias by estimating stage-specific proportions of cancer and non-cancer deaths” asks the question: what do people die from if they are diagnosed with cancer at different stages? This is important to understand if an earlier stage at diagnosis leads to individuals living long enough to die from something other than their diagnosed cancer. As this study draws on large databases of diagnosed cancer patients, sharing code and data is complicated by the need to respect privacy while still allowing the code to be demonstrated. One way of solving this conundrum is to generate synthetic data replicating the structure of the real data, thereby allowing evaluation of the code while avoiding any possible inference about real individuals.
Speaker:
Dr. Earl Hubbell is Distinguished Scientist and Vice President at GRAIL, supporting multi-cancer early detection efforts.
----------------------------------------
Searching for concepts in semantic space
Abstract:
Semantic embeddings are dense vector representations created by deep learning models, assigned in such a way that similar objects have similar vectors. These vectors can be thought of as representing points in a high dimensional space. The examples in this presentation use Sentence Transformers to generate embedding vectors for passages of text. Such vectors are generally used to search a database by example; the embedding is computed for an example passage, then similar passages are found in a database using vector comparison.
Here I show how to compute a query vector that represents a category of objects, rather than a single example. This is done by using the embedding vectors as features in a logistic regression model to predict a category label. The logistic regression model has one coefficient for each dimension in the embedding vector, so the vector of coefficients represents a point in the same embedding space. The point in the embedding space with the highest cosine similarity to the coefficient vector is the one with the highest predicted probability from the logistic regression model. This re-frames ML prediction as vector search, and lets us take advantage of fast approximate nearest neighbor search methods commonly available in vector databases.
Examples come from predicting medical subject heading (MeSH) terms in the open biomedical literature (PubMed Central). We show some simple qualitative approaches to characterizing biases in these MeSH term predicting models, and some preliminary screening approaches to try to identify those models that seem most promising for indexing other corpora.
Speaker
Bob Horton is Principal Data Scientist at Win-Vector Labs. He came to data science via molecular biology, software development and bioinformatics, and is a long-term member of BARUG.