Cloudera LLM & RAG - Bring your own data!

Name: Cloudera LLM & RAG - Bring your own data!
Start: 2024-01-16T17:00:00+01:00
End: 2024-01-16T21:00:00+01:00
Location: Hyperight Data Club

Hosted by Patrick C.

Stockholm MLOps Community

Details

Participate in a hands-on-lab to customize an LLM to your liking.

Using the Cloudera tools, we start with a prototype and customize it using RAG (Retrieval Augmented Generation) to customize a chatbot that answers your questions based on your context.

Since this is a hands-on lab, there will be some pre-requisites relating to the data you bring.

Prerequisites for the “Bring your own data! - LLM & RAG Lab".

The beauty of this lab is that you get to see an LLM at work with your own documents. The only limit for what you can do with this is your own imagination. The caveat is that you have to prepare these documents to be suitable for RAG (Retrieval-Augmented
Generation).

Short version:

Documents have to be:
in raw text format,
roughly 1kB,
however no bigger than 16kb size chunks.

More verbose version:

Determine what process or problem you want to work with. Examples:
Customer service solutions database.
Legal texts that govern a process
Self-service how-to guides for customers
If you are out of ideas, why not try a book from project Gutenberg (https://www.gutenberg.org/)? Remember to use the .txt format though.
Get a set of data that you want to query with the LLM. It has to be in raw text (.txt) format. Encoding/Character set: Preferably UTF-8. This means your Augmentation data can be in any language and character set supported by UTF-8. Note that both Embedders and LLMs (The two types of models we’ll be using in the lab) usually perform better on English due to the greater base of training material.
When you have this base material, you need to chunk it into sections that are useful for the Embedding model to process. Useful chunk size typically means ~1000 bytes of size, That is roughly the same as half a page in a book.
The easiest way to chunk the data is to use “raw chopping”, ie just cut the text indiscriminately into a certain byte size. This is easily performed using the Linux command line.
split -b1k \
To do it for a full directory of files, use:
mkdir split_temp;cd split_temp;for f in ../*.txt; do mkdir $f.dir; split -b1k $f; mv * $f.dir; done;cd ..
This command sequence creates a temporary directory; then uses this directory as a base for the splitting files before moving them to a designated directory created per each source file.
Advanced (and untested). If you’d like to get a better chopping of your files, line count can be useful. If your source material uses exactly one line per paragraph, this chunking can give a good result, as a paragraph is a good “semantic container”.
split -l1 \
Do check the size of your files afterwards. The limit of the LLM we will work with is 4k.

***Update***

Note from your tutors Erik & David at Cloudera:

"Looking forward to seeing you all harness the joy of your custom chatbot!

The basic idea is that you can show your organisation a potential improvement with using RAG-LLM. OK, we'll have to admit it will only be a prototype at this time, but it will be demoable. To this end, we will keep your access to the LAB environment for a few days after the lab, so you will be able to show off for your colleagues.

A few reminders:

- The prerequisites are paramount to a successful workshop. You need to bring (on your laptop) the text chunks that are described in the prep section of the meetup description. The lab without the prerequisite data from you is a no-go, this is a "bring your own data" workshop!

- If you are not able to prep your material, don't worry! We will schedule more of these and you can join at a later time. But please cancel your booking and leave room to someone else in that case, as we are fully booked + have a good waiting list + mind you, there is a fair amount of prep going into this from those of us hosting!

- To make sure we are all up to date, please send a confirmation mail to esteinholtz@cloudera.com with "prep complete" (nothing more needed) in the subject line. It is appreciated by all of us that everyone is prepared."

Questions?
Send them to Erik Steinholtz , esteinholtz@cloudera.com

Stockholm MLOps Community

Cloudera LLM & RAG - Bring your own data!

Stockholm MLOps Community

Details

Related topics

You may also like