Survey Summarization in LangChain: A Tutorial
Ziggy Cross is a Prompt Engineer currently working on Meta's AI Personas team. He has previously worked at Kai Analytics as an NLP Engineer integrating large language models into the Unigrams pipeline and is also a recent graduate from the University of British Columbia with a Master’s degree in Computational Linguistics.
What is LangChain, and why is it taking the natural language processing world by storm?
In short, LangChain is a rapidly growing Python library built for creating powerful large-language model applications.
In this article, I will show you how to build a basic text summariser using Python and LangChain. There is no coding experience necessary, and if you would like, you can follow along using Google Colab here. This option will let you run everything from your browser without needing to install any software.
Otherwise, if you're comfortable in Python and would like to follow along on your own editor, you can install all of the necessary libraries using:
pip install langchain openai docx2text
Now we’ve got that sorted, let’s start developing! First, we need to decide what exactly we want our application to do.
Often at Kai Analytics, we collect large amounts of qualitative survey data. This data is rich in substance but can take a lot of time to evaluate. Part of our analytic process involves using our own purpose-built software, extending our ability to get meaningful insight from large amounts of data.
In this tutorial, we will design a basic summarisation app that can take a sample qualitative dataset as input, and give us back a concise summary of that data.
To do this we will need a few building blocks:
A way to read our data
A large language model
A prompt template
1. Document reader:
First, let’s create a system to load our data in. LangChain has a collection of 'Document Loaders' we can use. I have chosen to use the Docx2txtLoader, as .docx is an accessible and machine-readable file format. You are welcome to use another format if you’d like: there is a list of supported inputs here.
You can download our sample dataset below, or create your own. The document should be located in the same folder as your Python script if you are following along on your own editor.
To load in the file, we can use this Python code:
This will create a document loader, and then load the data from our customer comments file.
2. Large Language Model:
Next, we need to connect to a Large Language Model (LLM) which we will use to summarise our data. I have chosen to use OpenAI's GPT 3.5 model, as it is currently the most approachable. If you have not used this before, you will need to get an API key here. If you are interested in using another model, you can find a list of models supported by LangChain here.
Using LangChain, we can import the OpenAI interface, and then connect to our LLM with a single line of code:
If you would like to test the model's connection, you can run
llm.run("Hey! is this thing live?")
to check that everything is working.
3. Prompt template:
This third part is the most important, but also the most complicated.
In order to run our text data into the LLM, we need to create a 'prompt template'. Usually, when you work with a language model, you might ask it a question directly. A template allows for us to control one part of the prompt, and leave another part to be variable. Here, the controlled part of our prompt would be the description of the task, and the variable part is the data we are asking it to summarise.
The prompt template should be specific and clear. Because we want our model to act like a market researcher who will summarise our qualitative survey data, we should tell it explicitly in our prompt template. Here's how we do this in Python:
Designing prompts like the one above is called 'Prompt Engineering'. There are many approaches to this task, and all will have their own unique problems and benefits. I'd encourage you to experiment with different prompts and see what returns the best result!
Putting it all together: Chains
Great! Now we have all the building blocks of our application created (document reader, LLM, and prompt template), we can link them all together into a working application. To do this we can use chains, which, unsurprisingly, are a key part of building applications with LangChain.
We want our data to be fed into our prompt template to create a prompt, and our prompt to be fed into our LLM to create a summarisation.
We can use the built-in 'LLMChain' to do exactly this:
Now we have a complete chain, which takes an input, places it into our prompt template, and feeds that into our LLM. Let’s try it out!
Hopefully, you will get an output that summarises the text in our customer comments!
Try out some different prompts, and perhaps even some different data to see how our application holds up. Try to get a more verbose, then a more succinct output. Or perhaps try to focus just on customers' favourite flavours, and then other feedback about food. After you have tried out a few prompts, come back here for some next steps.
Further steps: Sequential Chains in LangChain
The real power of LangChain becomes apparent when you start linking these chains together to create larger, more complex applications. As an example of this, let’s take our summarised feedback, and pass that into another prompt template that creates creative, easy-to-achieve actionables for our business.
To do this we will need to
Create another prompt template and LLM chain
Join both of our LLM chains together to make one larger chain
1. Another prompt template
This time, we want our model to roleplay a business advisor, so we should change our prompt to reflect this. Again, I would encourage you to customise this prompt to your own application!
2. Sequential Chain
Now we have both of our chains completed, we can make a simple sequential chain to feed the output of one chain (our summarisation chain) into the other (our actionables chain).
When we run our chain, you will see that the LLM first processes the input to create a series of bullet point summaries, and then gives those summaries into a second prompt to create a list of actionables. I have turned on the 'verbose' setting so the model will show you each step of its output.
Now our hypothetical business can start implementing new techniques based on this distilled customer information, all inspired by powerful LLMs. Once we have new survey data, all we have to do is change the document file we are reading (and perhaps tweak our prompt templates a little bit), to get new results! Because we are using LangChain, this should all be very straightforward.
That’s all for today. Hopefully, you successfully created a basic LangChain application that utilises large language models and prompt engineering.
LangChain is a very powerful toolkit that is rapidly growing and can be used for a wide range of tasks. For example, some of our internal visualisations use LangChain to create succinct and meaningful topic labels that help us explore large amounts of data quickly. While LLMs are by no means a replacement for strong, human analysis, finding creative ways to leverage their strengths can be a great way to empower your market research team to handle more complex problems.
If you would like to learn more about the ways we are using technological tools to strengthen our processes in our product Unigrams, you can read more here.
Thanks for your time, and hopefully you found this tutorial useful!