ETL for LLMs
Hosted by Silicon Valley MongoDB User Group
Details
ETL for LLMs - Everything you need to preprocess your unstructured and structured data to make it GenAI and LLM Ready. The use case demonstrated will be a Private GenAI Q&A System using Dolly v2, Spark, MongoDB & Dataworkz
Description: ChatGPT had to be trained on enormous amounts of data to make it excel at human-like, iterative content creation. But it’s only as good as the data it was trained on - ChatGPT can hallucinate (confident-sounding but erroneous output) especially when asked domain-specific questions.
What if you could train ChatGPT in a matter of minutes to answer questions based on your own data like PDF manuals, product reviews in semi-structured JSON format, internal wikis, customer conversations in a CRM and the list goes on. To make ChatGPT work with your data, you need to build sophisticated data pipelines. It takes a new approach to data management to create chunks for better information retrieval. In this session you will build a data pipeline to process biomedical literature available on http://pubmed.gov.
In this workshop you will learn how Dataworkz streamlines creating a high-quality curated dataset from unstructured PDF files available on PubMed and use MongoDB Vector Search for storing LLM ready embeddings using the embedding model of your choice – OpenAI’s text-embedding-ada-002, all-mpnet-base-v2 from the MTEB benchmark or one of your own from Hugging Face.
Come join us to experience the fastest path to building a Retrieval Augmented Generation (RAG) application – Dataworkz with MongoDB Atlas Vector Search
Agenda:
4:00 pm Registration & Networking
4:15 pm "ETL for LLMs" Talk with demo with Nikhil Smotra of Dataworkz
5:00 pm Food & Beverage
5:30 pm Hands-on workshop
6:30 pm end
Speaker:
Nikhil Smotra, CTO and Co-founder, Dataworkz - Nikhil is driven by the potential for innovation and really excited about leveraging advanced technologies such as artificial intelligence, especially LLMs and applying them to extract valuable insights from customer data. Nikhil’s robust experience working with data management at scale led him to co-found Dataworkz. His vision is to create self-service experience that brings together -- data, transformation and AI applications -- for users of different skill levels.
Prior to Dataworkz, Nikhil worked as SVP, Head of Data Engineering at iQor, a leader in BPO and Product Support, where he led development and management of BigData platforms. Nikhil helped launch the enterprise data initiative and built a high-performing global data engineering team. During his tenure at iQor, Nikhil also managed QeyMetrics – a Business Intelligence and Operational Analytics SaaS offering. Nikhil spent several years at Lockheed Martin(R&D) where he harnessed the potential of NoSQL technology, prior to it gaining popularity, and used it along with semantic web technologies to build a massively scalable Digital Archive with automated data preservation, curation and classification.
Nikhil is an executive alumnus of Haas School of Business, UC Berkeley (Data Science and Analytics Program) and holds a B.E in Computer Science from University of Pune, India. Nikhil also served on Advisory Board of Rutgers University’s BigData certificate program for executives from 2018-2022.