James Tsang

James Tsang

A developer.
github
twitter
tg_channel

How to implement your own ChatGPT Code Interpreter

What is Code Interpreter?#

After the launch of ChatGPT Code Interpreter some time ago, everyone should have gained some understanding of what it is and what it can do. Therefore, we will not repeat these basic questions here. Instead, let's look at how to understand Code Interpreter from the perspective of processes and elements.

Mermaid Loading...

In the regular ChatGPT interaction, the process and elements are: Prompt => Output Text. This is also why the concept of Prompt Engineering emerged immediately after ChatGPT was launched, along with the accompanying work of constructing Prompt embeddings. The process is short, and the elements are simple; constructing a good Prompt is the key in this process. For Code Interpreter, its process and elements are as follows: Prompt => Code ==Interpreter==> Output(Description, Image, File...). This brings some changes:

  1. Prompt construction is no longer directly aimed at output but at generating intermediate code.
  2. A code interpreter is needed to distinguish sessions, execute code, save intermediate variables, etc.
  3. Output becomes more diverse, which can include images, files, etc.

Why Implement Code Interpreter#

ChatGPT has already implemented Code Interpreter, relying on OpenAI's GPT model, which is quite powerful. Why do we still need to implement our own Code Interpreter? Besides aligning with industry leaders and integrating internal model capabilities, we can also think about what incremental benefits we might gain from implementing it ourselves. Some typical increments include:

  1. Ability to interact with real-time data: ChatGPT's Code Interpreter does not have the networking capabilities of Plugins. Once Code Interpreter is enabled, plugins cannot be selected, leading to insufficient real-time data for ChatGPT Code Interpreter, making it impossible to do things like "plot the stock performance of Apple Inc. in 2023."
  2. Ability to interact with more environments: After local or cloud deployment, a more flexible environment is available, whether it is operating the file system, calling APIs, or installing packages not supported by ChatGPT Code Interpreter.

Thoughts on Implementing Code Interpreter#

To implement Code Interpreter, there are two core focuses: first, leveraging model capabilities, such as generating the code to be called using OpenAI API's Function Calling capability; second, having an environment that can execute Python code. For example, if the user specifies that they want to output a sine function graph, we need to obtain the code to plot the sine function graph, then send it to the Python interpreter for execution, outputting the image to display to the user. In this process, the LLM Agent may also need to provide some explanations and details about the results. Additionally, file I/O and the ability to save variables in the session need to be considered.

If implemented using LangChain, it would be more convenient. Here, the Python interpreter, file I/O, and variable saving can all be seen as a Tool in LangChain, which can be incorporated into the LangChain Agent Executor for invocation. Following this idea, the community has already produced an open-source implementation: codebox-api, which can be registered as a LangChain Tool. In addition to providing the core capability of code execution, we also need some surrounding components to implement: session management, initiating Kernel calls, file I/O, meaning that every time a new session is created, a new Jupyter Kernel channel is created to execute code, and the execution results are categorized and fed back to the user according to the output type. The author of the above codebox-api has also packaged this into a solution: codeinterpreter-api.

Design of Implementing Code Interpreter#

Next is a breakdown of the codeinterpreter-api project, looking at how to design and implement the above ideas. Since the project mainly uses LangChain to orchestrate the entire process, here are some basic concepts used in LangChain:

Some Basic Concepts in LangChain:#

  • LangChain Agents: A foundational module in LangChain, the core idea is to use LLM to select a series of actions to take. Unlike hard-coded action sequences in chains, agents use language models as reasoning engines to determine which actions to take and in what order.
  • LangChain Tools: Tools are the capabilities called by Agents. Two main aspects need to be considered: providing the correct tools to the Agent and describing these tools in a way that is most helpful to the Agent. When creating a custom StructuredTool for Code Interpreter, it is necessary to define: name, description, func (synchronous function), coroutine (asynchronous function), args_schema (input schema).
  • LangChain Agent Executor: The Agent Executor is the runtime for Agents. It actually calls the Agent and executes the actions it selects. This executor also performs some additional complexity-reducing tasks, such as handling cases where the Agent selects a non-existent Tool, where the Tool encounters an error, or where the Agent produces output that cannot be resolved as a Tool call.

Process Design and Implementation#

With the above basic concepts in place, we can look at how to implement Code Interpreter based on LangChain Agent. Let's examine the specific execution process through the following code:

from codeinterpreterapi import CodeInterpreterSession, File

async def main():
    # context manager for start/stop of the session
    async with CodeInterpreterSession(model="gpt-3.5-turbo") as session:
        # define the user request
        user_request = "Analyze this dataset and plot something interesting about it."
        files = [
            File.from_path("examples/assets/iris.csv"),
        ]
        # generate the response
        response = await session.generate_response(user_request, files=files)
        # output the response (text + image)
        response.show()

if __name__ == "__main__":
    import asyncio
    # run the async function
    asyncio.run(main())

The effect is shown in the image below:

Kapture

Execution Environment and Tool Instantiation#

When creating a session with with, we need to start instantiating the Jupyter kernel and the agent executor. Here are some key steps:

  1. Create a service to communicate with the Jupyter kernel using jupyter-kernel-gateway and check the status of successful startup.
self.jupyter = await asyncio.create_subprocess_exec(
	python,
	"-m",
	"jupyter",
	"kernelgateway",
	"--KernelGatewayApp.ip='0.0.0.0'",
	f"--KernelGatewayApp.port={self.port}",
	stdout=out,
	stderr=out,
	cwd=".codebox",
)
self._jupyter_pids.append(self.jupyter.pid)

# ...
while True:
	try:
		response = await self.aiohttp_session.get(self.kernel_url)
		if response.status == 200:
			break
	except aiohttp.ClientConnectorError:
		pass
	except aiohttp.ServerDisconnectedError:
		pass
	if settings.VERBOSE:
		print("Waiting for kernel to start...")
	await asyncio.sleep(1)
await self._aconnect()

Specify stdout and stderr, record the process pid, and associate this kernel instance with the session. After the kernel is created, send an HTTP request to establish a websocket connection with the kernel.

  1. Create the Agent Executor
def _agent_executor(self) -> AgentExecutor:
	return AgentExecutor.from_agent_and_tools(
		agent=self._choose_agent(),
		max_iterations=9,
		tools=self.tools,
		verbose=self.verbose,
		memory=ConversationBufferMemory(
			memory_key="chat_history",
			return_messages=True,
			chat_memory=self._history_backend(),
		),
	)

def _choose_agent(self) -> BaseSingleActionAgent:
	return (
		OpenAIFunctionsAgent.from_llm_and_tools(
			llm=self.llm,
			tools=self.tools,
			system_message=code_interpreter_system_message,
			extra_prompt_messages=[
				MessagesPlaceholder(variable_name="chat_history")
			],
		)
		# ...
	)

def _tools(self, additional_tools: list[BaseTool]) -> list[BaseTool]:
	return additional_tools + [
		StructuredTool(
			name="python",
			description="Input a string of code to a ipython interpreter. "
			"Write the entire code in a single string. This string can "
			"be really long, so you can use the `;` character to split lines. "
			"Variables are preserved between runs. ",
			func=self._run_handler, # Call CodeBox for synchronous execution
			coroutine=self._arun_handler, # Call CodeBox for asynchronous execution
			args_schema=CodeInput,
		),
	]

Here, we define the use of OpenAIFunctionsAgent, which we can replace with our own Agent if needed. However, currently, only OpenAI's API has a convenient and powerful Function Calling capability, so we use this as an example. We also specify the Tool that the Agent and Agent Executor will use, which includes the name, description, and other parameters for executing Python code. The Jupyter kernel instance created in the previous step is further encapsulated by CodeBox and passed to the Tool as synchronous and asynchronous calling methods.

Handling Input Text and Files#

Since Prompt Engineering is an external step, users should prepare the Prompt before passing it in. Therefore, this step does not do much work; it simply appends the incoming text and files (for example, specifying which files the user wants to use) and records the files in the CodeBox instance for later execution.

class UserRequest(HumanMessage):
    files: list[File] = []

    def __str__(self):
        return self.content

    def __repr__(self):
        return f"UserRequest(content={self.content}, files={self.files})"

def _input_handler(self, request: UserRequest) -> None:
	"""Callback function to handle user input."""
	if not request.files:
		return
	if not request.content:
		request.content = (
			"I uploaded, just text me back and confirm that you got the file(s)."
		)
	request.content += "\n**The user uploaded the following files: **\n"
	for file in request.files:
		self.input_files.append(file)
		request.content += f"[Attachment: {file.name}]\n"
		self.codebox.upload(file.name, file.content)
	request.content += "**File(s) are now available in the cwd. **\n"

Execution and Result Handling#

Through the Agent Executor, we can already achieve automatic conversion from prompt to code. Let's see how this code executes:

def _connect(self) -> None:
	response = requests.post(
		f"{self.kernel_url}/kernels",
		headers={"Content-Type": "application/json"},
		timeout=90,
	)
	self.kernel_id = response.json()["id"]
	if self.kernel_id is None:
		raise Exception("Could not start kernel")

	self.ws = ws_connect_sync(f"{self.ws_url}/kernels/{self.kernel_id}/channels")

First, we need to connect to a specific kernel via websocket,

self.ws.send(
	json.dumps(
		{
			"header": {
				"msg_id": (msg_id := uuid4().hex),
				"msg_type": "execute_request",
			},
			"content": {
				"code": code,
				# ...
			},
			# ...
		}
	)
)

Then, we send the code to the kernel for execution via websocket,

while True:
    # ...
	if (
		received_msg["header"]["msg_type"] == "stream"
		and received_msg["parent_header"]["msg_id"] == msg_id
	):
		msg = received_msg["content"]["text"].strip()
		if "Requirement already satisfied:" in msg:
			continue
		result += msg + "\n"
		if settings.VERBOSE:
			print("Output:\n", result)

	elif (
		received_msg["header"]["msg_type"] == "execute_result"
		and received_msg["parent_header"]["msg_id"] == msg_id
	):
		result += received_msg["content"]["data"]["text/plain"].strip() + "\n"
		if settings.VERBOSE:
			print("Output:\n", result)

	elif received_msg["header"]["msg_type"] == "display_data":
		if "image/png" in received_msg["content"]["data"]:
			return CodeBoxOutput(
				type="image/png",
				content=received_msg["content"]["data"]["image/png"],
			)
		if "text/plain" in received_msg["content"]["data"]:
			return CodeBoxOutput(
				type="text",
				content=received_msg["content"]["data"]["text/plain"],
			)
		return CodeBoxOutput(
			type="error",
			content="Could not parse output",
		)

Next, we handle a series of return messages in the channel:

  • msg_type: stream, if msg contains "Requirement already satisfied:", append output content and continue waiting for ws return.
  • msg_type: execute_result, append msg["content"]["data"]["text/plain"] to output content and continue waiting.
  • msg_type: display_data, retrieve msg["content"]["data"]. If it contains image/png, return it wrapped in CodeBoxOutput; if it is text/plain, return similarly. Otherwise, return an error type output indicating that the output result could not be parsed.
  • msg_type: status, execution_state: idle: code executed successfully but produced no output.
  • msg_type: error: report the error directly.

Output Results#

After obtaining the above CodeBoxOutput, we can process the output: for text-type outputs, no additional processing is needed as they have already been output via stdout during execution; for file system operations, since they are done directly through Jupyter, no extra handling is needed in the framework; they will automatically be saved as local files until a description or explanation of the output file is needed. For image-type outputs, the returned image's base64 will be saved in the session's out_files during execution, and during the final output processing, it will be converted to the standard Image type in Python and displayed using IPython's display method.

def get_image(self):
    # ...
	img_io = BytesIO(self.content)
	img = Image.open(img_io)

	# Convert image to RGB if it's not
	if img.mode not in ("RGB", "L"):  # L is for greyscale images
		img = img.convert("RGB")

	return img

def show_image(self):
	img = self.get_image()
	# Display the image
	try:
		# Try to get the IPython shell if available.
		shell = get_ipython().__class__.__name__  # type: ignore
		# If the shell is in a Jupyter notebook or similar.
		if shell == "ZMQInteractiveShell" or shell == "Shell":
			from IPython.display import display  # type: ignore

			display(img)
		else:
			img.show()
	except NameError:
		img.show()

Considerations for Internal Implementation#

If we want to implement Code Interpreter internally in a company, here are some points we can focus on:

  1. Service-oriented or solution-oriented
    1. Provide basic execution capabilities or components for existing platform modules.
    2. Provide relevant solutions to teams that need to implement Code Interpreter internally.
  2. Integrate internal models.
  3. Integrate internal systems and environments to achieve automatic API calls.
  4. Support an open technology stack that is not limited to LangChain.

Final Thoughts#

This time, we only briefly covered the implementation plan of Code Interpreter from a general process and design perspective, but there are many small but important details that have not been discussed, such as: how to output in a web environment rather than locally, how to provide feedback to the model to regenerate code when execution errors occur, how to automatically install missing dependency packages and then re-execute, etc. These are all essential considerations and implementations for a robust Code Interpreter. The community's solutions are currently also MVP versions, and the handling and consideration of edge cases still have some gaps from practical production applications. There is still a long way to go to implement a complete and production-ready Code Interpreter, and we need to get our hands a bit dirtier.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.