OmniFill: Domain-Agnostic Form Filling Suggestions Using Multi-Faceted Context

OmniFill: Domain-Agnostic Form Filling Suggestions Using

Multi-Faceted Context

Timothy J. Aveni

tja@berkeley.edu

University of California, Berkeley

Berkeley, California, USA

Armando Fox

fox@berkeley.edu

University of California, Berkeley

Berkeley, California, USA

Björn Hartmann

bjoern@berkeley.edu

University of California, Berkeley

Berkeley, California, USA

Prior examples

Current browsing context

Text field structure

Fill suggestions

… receipt for your credit card

transaction at TACO SPOT …

purchase total: $64.13 …

Vendor

Department

Amount

Reimbursement Request

Purchase Receipt

Vendor

Department

Amount

Reimbursement Request

Taco Spot

64.13

LLM

Prompt

Figure 1: OmniFill oers domain-agnostic suggestions for Web forms using a multi-faceted prompt to a large language model.

ABSTRACT

Predictive suggestion systems oer contextually-relevant text en-

try completions. Existing approaches, like autoll, often excel in

narrowly-dened domains but fail to generalize to arbitrary work-

ows. We introduce a conceptual framework to analyze the com-

pound demands of a particular suggestion context, yielding unique

opportunities for large language models (LLMs) to infer suggestions

for a wide range of domain-agnostic form-lling tasks that were

out of reach with prior approaches. We explore these opportunities

in OmniFill, a prototype that collects multi-faceted context includ-

ing browsing and text entry activity to construct an LLM prompt

that oers suggestions in situ for arbitrary structured text entry

interfaces. Through a user study with 18 participants, we found

that OmniFill oered valuable suggestions and we identied four

themes that characterize users’ behavior and attitudes: an “oppor-

tunistic scrapbooking” approach; a trust placed in the system; value

in partial success; and a need for visibility into prompt context.

CCS CONCEPTS

• Human-centered computing

→

User interface design; Web-

based interaction; Natural language interfaces; Text input.

KEYWORDS

large language models, intelligent user interfaces, form lling,

context-awareness, Web automation

1 INTRODUCTION

A signicant but tedious part of information work consists of taking

textual information from one source, and manually moving it into

arXiv Preprint, October 2023

structured online forms - either through direct copy and paste, or

through reformatting or rewriting. Examples include extracting

information from a purchase receipt into a reimbursement form

(see Figure 1), adding events from the Web to a personal calendar,

or transforming the format of existing form eld values in-place. In

these tasks, users are asked to laboriously serve as human “glue”,

often between dierent siloed systems. Prior research has identied

this problem and proposed several approaches to oer contextually-

relevant form completions [

]. One limitation of these

prior approaches is that they often only apply in narrowly dened

domains. In particular, the choice of implementation approach,

e.g., programming-by-demonstration [

] or document mining [

]

enables some tasks, but precludes others.

In this paper, we rst contribute a framework for characterizing

the form lling task. We describe four major dimensions – infor-

mation demands, operation complexity, structure variability, and

task specication. This analysis yields unique novel opportunities

for LLMs to produce form lling suggestions for a wide range of

tasks that were out of reach with prior approaches. In particular,

we posit that LLMs are suitable for operating on a multi-faceted

“bag of context” that can contain both prior demonstrations (as in

programming by demonstration), as well as explicitly marked con-

tent (text agged by the user as important) and implicitly collected

content (history such as search queries and user behavior in the

form).

We explore this research hypothesis with OmniFill, a new proto-

type system that collects multi-faceted context including browsing

and text entry activity; constructs an LLM prompt with that content;

and then oers suggestions in situ for arbitrary structured form

interfaces. We demonstrate the promise of this approach through

several example applications, and we examine utility through a

study with 18 participants. Through an analysis of users’ behaviors

arXiv:2310.17826v1 [cs.HC] 27 Oct 2023

arXiv Preprint, October 2023 Aveni et al.

and attitudes during the user study tasks, we propose considera-

tions that system designers should keep in mind when building

this type of domain-agnostic suggestion system for use in the real

world.

2 RELATED WORK

The most closely related prior work falls into three areas: 1) au-

tomating interaction with Web pages in general; 2) form-lling

interfaces in particular; and 3) other predictive text tools. In addi-

tion, OmniFill is related to a rapidly growing set of user interfaces

that use LLMs as backends. We discuss each in turn.

2.1 Automating Interactions with Web Pages

Research has investigated several ways for automating tedious or

repetitive interactions in Web pages beyond suggesting or lling

forms.

One line of work uses end-user programming to record demon-

strations of Web page interactions, generalize these demonstrations

into a program in a suitable DSL, and then let others replay this

program later in a related but dierent context. Key tasks enabled

by programming-by-demonstration have been sharing multi-step

processes (e.g., Koala [

], later CoScripter [

]) and Web scraping

(e.g., Ringer [

], Rousillon [

]). A source of complexity in these

systems is that Web pages may be interactive, their structure may

be ill-formed, and they may change over time; more recent work has

succeeded in using natural language processing techniques to make

task specications more robust and exible (e.g. DiLogics [35]).

Some other approaches to automation rely on pixel-based reverse

engineering of rendered interfaces (Prefab [

], Sikuli [

]), apply-

ing computer vision techniques to derive interface structure. The

combination of DOM and visual features can surpass the limitations

of these individual approaches [24].

OmniFill is not addressing general Web automation, focusing

instead only the narrower problem of suggesting text in Web form

elds. However, it can draw on broader context than just prior

demonstrations. While demonstrations are part of the context con-

sidered by OmniFill, the system also takes into account other sources

of context explicitly and implicitly collected (e.g., text identied as

important by the user), and world knowledge (as captured in the

pre-trained LLM).

2.2 Automatic Form Filling

OmniFill is inspired by prior mixed initiave systems such as Look-

Out [

] and Citrine [

], which also seek to reduce the tedium

of manually completing forms with information that already ex-

ists in some other format. LookOut extracted information from

email to pre-populate calendar events. Citrine parsed users’ copied

text into typed elds using hand-constructed parsers for frequent

content such as addresses, and could then auto-complete dierent

address elds based on recognizing the structure of the form from

prior demonstrations. Related interaction techniques such as Entity

Quick Click also rely on entity recognition in copied text to accel-

erate copy and paste tasks [

]. OmniFill extends this work by using

an LLM and multiple sources of context to broaden the applicability

of Citrine’s approach without requiring manually authored parsers

for each type of data.

A number of technical approaches have been proposed to ex-

tract information from a larger document or corpus for the purpose

of form lling, e.g. hidden Markov models [

], or discriminative

context free grammars [

]. Prior work using ML approaches has

constructed direct mappings between observed user context and

form elds using NLP techniques (e.g. [

]) or constructed domain-

specic models of form dependencies (e.g. [

]). Because foundation

language models, as used in OmniFill, can operate across many

tasks and domains [

], they hold the promise of potentially obviat-

ing domain-specic recognizers and expanding task specications

beyond eld-by-eld extractions of suggestions from context.

2.3 Predictive Text Completion

Form lling is a special case of the larger problem of predictive text

completion which seeks to predict, given some context of existing

text, what text a user is likely to enter next.

Predictive Text Completion has existed for a long time in code

editors [

], search query interfaces [

], and predictive key-

boards [

]. More recently, language models have found widespread

use completion suggestions in email composition [

], and for larger

code chunks [

]. In addition to research on the underlying tech-

nologies, researchers are also studying the impact of the use of

predictive text interfaces on productivity [

], on the content of

text being produced [

], and on users’ perceptions of their tasks [

While we restrict our focus in this paper to form lling, our con-

ceptual framework and our multi-faceted context structure can

potentially be applicable for analyzing a broader set of predictive

tasks.

2.4 Novel Interfaces Enabled by LLMs

Researchers are increasingly investigating the utility of pre-trained

language models for enabling new interactions. One key distinction

is between systems that use LLMs to enable natural language based

input on one hand; and systems that use LLMs as an enabling

implementation technology for novel direct manipulation or other

types of interactions.

As examples of the rst category, Wang et al. show that LLMs are

promising for enabling conversational interactions with mobile user

interfaces [

], such as screen summarization and screen question

asking. Stylette [

] enables re-design of Web pages through natural

language commands. In the second category, user interactions are

translated into appropriate prompts to a language model “behind

the scenes” to oer better performance or novel capabilities beyond

previous algorithmic approaches. Examples are TaleBrush [

which allows users to sketch story arcs to condition the co-creation

of stories with LLMs, and SayCan [

], which uses LLMs to map

real-world robot aordances to relevant situations.

OmniFill does not directly expose text interactions with the un-

derlying LLM, instead oering a user-facing interaction closer to

traditional brower autoll. Still, the use of an LLM as implemen-

tation technology enables the system to perform many natural

language tasks that are implicitly used in form lling, such as ex-

tracting entities from context and making use of form eld labels.

OmniFill: Domain-Agnostic Form Filling Suggestions Using Multi-Faceted Context arXiv Preprint, October 2023

3 TASK DIMENSIONS

A form-lling task is a related series of operations intended to

gather, process, and enter information into a form. Form-lling

tasks can manifest in many shapes, and each task has its own

multidimensional requirements necessary of a system that is able

to automatically suggest or complete eld values for the task.

To assist in articulating the space of tasks that may be handled

eectively by an LLM-backed system, we consider four relevant

dimensions to describe form-lling tasks: information demands,

operation complexity, structure variability, and task specication.

3.1 Information demands

What information is being placed into the form? Where does it come

from? In a predictive system, what information may be transcribed,

transformed, or used as a retrieval key for the system’s current

suggestion or completion?

We identify six broad classes of such information.

Historical user behavior

What information has the user en-

tered into the form or similar forms in the past? Example:

browsers oer dropdown suggestions for recognized elds,

even in previously-unseen forms.

Explicitly-foregrounded information

Users may explicitly

call attention to information outside the target form interface.

Example: importing a CSV into a system, or the “select, then

copy” operation of a clipboard.

Implicit browsing context

User activity may provide addi-

tional context relevant to the task even if not explicitly fore-

grounded. Example: scrolling through the subject lines of an

email inbox.

Current form state

Some form lling operations make use of

information already present in the form. Example: creating

a username based on the values of a “First name” and “Last

name” eld.

External general knowledge or language knowledge

Some

information inserted into forms comes from the surround-

ing world, especially in combination with other information

sources. Example: after typing “Paris” in a “City” eld, a

system may oer “France” for the “Country” eld.

User-specic external knowledge

User-specic information

may come from sources other than form-lling activity or

recent browsing context. Example: a system may oer com-

pletions based on the data from the user’s phone contacts.

3.2 Operation complexity

What transformations must be applied to inputted information by

either the user or a predictive system making eld suggestions?

Exact transcription

An entire input is transcribed verbatim

into the output eld. Example: a user copies a URL into their

clipboard, and their mobile browser address bar oers to

“paste” that full URL.

Literal extraction

Information (in any of the forms described

in Section 3.1) is transcribed verbatim into the target form

but must rst be extracted from some larger source.

Format transformation

Information must be transformed

into a dierent format. Example: a task requires a name

to be converted from all-caps (as it appears in the source) to

title case (as required by the target form).

Quantitative transformation

The task demands arithmetic

operations to be performed on available information before

inserting into the target form.

Semantic transformation

The task requires a transforma-

tion of the available information that produces an output

that is syntactically and structurally distinct from the input

information but still semantically related. Example: upon

viewing an email reading, “I don’t eat meat, but I can eat

animal products like cheese”, the user might write “Vegetar-

ian” into the target form to match the format of other form

responses.

3.3 Structure variability

When the task consists of multiple instances of a form being lled,

how does the task vary between form lls? Once the task is well-

specied, how exible does that specication need to be to handle

the source and target structures?

Fixed source structure

Does information used to derive eld

values come from the same place, in the same format, from

case to case? An example of a task with xed information

source structure is one in which information is sourced from

a spreadsheet and each row of the sheet is used to construct

a submission in the target form.

Varied source structure

An example task with varied infor-

mation source structure is the information-gathering task

from our user study, in which users repeatedly ll the same

form but with information from a dierently-structured web-

site for each submission of the target form.

Fixed target structure

Is information always inserted into

the same eld of the same form, or can each submission

to a form take a dierent structure? Any task involving

repeated submissions to the same form has a xed target

eld structure.

Varied target structure

An example of a task with varied tar-

get eld structure is lling out many distinct job applications;

although the information being lled is mostly the same from

submission to submission, the form structure (and in some

cases, the demanded information format) changes each time

a form is lled.

3.4 Task specication

What information is available that species the process of com-

pleting the task? If a user is completing the task manually, how

much understanding of the task might be gleaned by someone

looking over their shoulder? These specications are distinct from

the information demands described in Section 3.1, which refers to

information that may be actually entered into the form (potentially

after a transformation) or used to reference other information to

be entered into the form.

Implicit specication

Task specications may be implied by

the structure of collected context or target form. For exam-

ple, a user who copies an address and visits a form with an

address eld likely intends to insert that address into the

arXiv Preprint, October 2023 Aveni et al.

Task context Form field Expected field value

Exact transcription

$63.14

Total price: Total price:

$63.14

Literal extraction

...Subtotal: $57.93\nTotal:

$63.14\nThank you for shoppi...

Total price: Total price:

$63.14

Format transformation

TACO SPOT

Vendor: Vendor:

Taco Spot

antitative transformation

4x burrito, 2x taco

Purchase quantity: Purchase quantity:

Semantic transformation

4x burrito, 2x taco

Expense type: Expense type:

food

Figure 2: Illustrations of various categories along the operation complexity dimension.

target form. This structural information may take a machine-

readable form (as with the

autocomplete

attribute in HTML)

or natural language forms (as with text description labels on

elds).

Instructions

Tasks can be specied with form-lling instruc-

tions, present in, for example, the form or browsing context.

Instructions may be precise and machine-readable, as with

tasks specied using code or macro software. They may also

come in the form of natural language, which may leave gaps

in the specication.

Examples

Many tasks can be approximately specied using a

number of examples, as demonstrated in the programming-

by-example literature. When prior examples of user behavior

are available, approximate task specications can be inferred

and realized in synthesized code. Examples may also help to

clarify ambiguities in vague instruction-based specications.

3.4.1 Specification visibility and mutability. In a predictive system

oering form-lling suggestions, allowing the user to see and ma-

nipulate the specication can enable renement to improve future

suggestions. This can happen by manually constructing or deleting

examples, or by modifying instructions, when they are present

and in a form that the user is comfortable manipulating. Many

programming-by-example techniques produce a readable or ma-

nipulable code specication as an artifact, rather than keeping the

specication as a hidden implementation detail [13, 14].

Specication renement need not be driven by the user complet-

ing the task. If predictive systems can detect incomplete specica-

tions or present the user with anomalous saved examples, they can

take initiative in asking the user to clarify the task specication

(as discussed in [

]), just as another person looking over the

user’s shoulder might.

4 TOOL CAPABILITIES

To situate OmniFill among prior form-lling tools, we analyze the

capabilities of existing tools with respect to the space of tasks they

are capable of handling, as shown in Figure 3.

First, we consider typical Web browser autoll behavior, which

matches form elds to prior values entered into similar elds by

the user. Then, we compare to LookOut [

], a tool for Microsoft

Outlook that oers automated scheduling services based on the

contents of users’ emails. By comparing Citrine, an “intelligent copy-

and-paste” tool [

], we introduce a tool that can detect information

from certain known schemas in exible information sources. Then

we consider DiLogics [

], a programming-by-demonstration tool

that enables automated form lling from an imported CSV and

synthesizes an explicit specication from existing examples.

Implementation approaches in this prior work constrain the

types of task that can be performed. First, each approach oers a

structure that privileges some forms of information over others,

rather than allowing information sources to vary based on the

task demand. Second, prior approaches cannot perform semantic

transformations without domain-specic processors in place. Many

prior systems are also locked to repeating source or target struc-

tures, rather than supporting tasks with exible needs. Systems that

require heavy specication, though valuable for full automation,

cannot quickly make lightweight inferences of approximate details

or implicit specications in exible contexts.

Pre-trained foundation language models may be uniquely suit-

able to address these shortcomings: as long as relevant context

information can be brought into the system as text, dierent types

of information can be mixed in a single prompt. LLMs also oer

semantic transformation capabilities, and the world knowledge cap-

tured by models during their pre-training may give them the power

to adapt to arbitrary tasks seen by users. We therefore built OmniFill

with an LLM backend, seeking to cover many possible information

demands, a range of operation complexity types, exible source

and target structure, and exibly-dened task specications. Not

every task is well-suited to LLMs (e.g. because of limited context

window size or poor arithmetic skills), but even these shortcomings

may be ameliorated using targeted techniques developed by AI

researchers [27, 36, 43].

OmniFill: Domain-Agnostic Form Filling Suggestions Using Multi-Faceted Context arXiv Preprint, October 2023

Browser autoll LookOut Citrine DiLogics OmniFill

Information

demands

Historical user

behavior, Current

form state

Implicit browsing

context

Explicitly-

foregrounded

information

Historical user

behavior, Explicitly-

foregrounded

information

Historical user behavior,

Explicitly-foregrounded

information, Implicit

browsing context, Current

form state, External general

knowledge or language

knowledge

Operation

complexity

Exact transcription

Exact transcription,

Literal extraction,

Format

transformation

Exact transcription,

Literal extraction,

Format

transformation

Exact transcription,

Literal extraction,

Format

transformation

Exact transcription, Literal

extraction, Format

transformation, Semantic

transformation

Structure

variability

Fixed source

structure, varied

target structure

Fixed source

structure, xed

target structure

Varied source

structure (only in

known schemas),

varied target

structure

Fixed source

structure (data

import), varied

target structure

Varied source structure,

varied target structure

Task speci-

cation

N/A (xed to tasks

handled by the

system)

N/A (xed to tasks

handled by the

system)

N/A (xed to tasks

handled by the

system)

Instructions (natural

language),

Synthesized

instructions (code),

Examples,

Visible/mutable

specication

Implicit specication

(natural language),

Instructions (natural

language), Examples

Figure 3: Task requirements supported by browser autoll, LookOut, Citrine, DiLogics, and OmniFill.

Collect context

Accept suggestions

Users can manually invoke

OmniFill for a particular form.

The Scrapbook lets users see

which context has been

collected for this form fill.

Users may also configure

OmniFill to be automatically

invoked for the active form.

OmniFill can be configured to

save examples automatically

when the form is submied.

By holding Alt and selecting

text, users can add context to

their scrapbook.

Upon invocation, OmniFill

generates suggestions for

fields in the web form.

OmniFill sometimes won't

make a suggestion for every

field, depending on what's in

the bag of context.

Figure 4: OmniFill’s browser extension. Users can select text on the Web to add to their scrapbook, then invoke the system to

obtain suggestions for any Web form. A sidebar on the right of the browser controls the scrapbook, invocation, and conguration.

5 INTERACTION DESIGN AND

IMPLEMENTATION

5.1 System architecture

OmniFill consists of two local components, a browser extension

and local server both implemented in TypeScript, and a connection

to a remote LLM server with a “chat completion” generation API.

Together, these components enable a streamlined interaction

ow, depicted in the system screenshots in Figure 4, oering users

low-friction interactions to foreground relevant context, invoke

OmniFill’s backend to oer completion suggestions, and show the

system examples of the form-lling task. This all occurs in situ,

requiring no task-specic conguration and minimal interaction

outside the context sources and target form already in use.

5.1.1 Browser extension. OmniFill includes a sidebar as part of its

Web browser extension, which contains a preview of the contents

arXiv Preprint, October 2023 Aveni et al.

Browser extension

Sidebar

▪ Scrapbook view

▪ Action buons

▪ Configuration

options

Content script

▪ Form field detection

▪ Google search detection

▪ Context selection

▪ Suggestion presentation

▪ ‘Auto-save examples’ detection

Background script

Local server

▪ Source of truth for bag of context

▪ Persistence

▪ Operating system text entry bridge

▪ LLM prompt generation

Request cache

LLM API server

Remote gpt-3.5-turbo host

Figure 5: OmniFill’s underlying architecture, including the

browser extension, local server, and remote API server.

of the “scrapbook.” The scrapbook contains browsing context ob-

served by OmniFill, including Google searches, text from Web pages

added to the scrapbook, and any text manually added through the

sidebar’s add-to-scrapbook button (which presents a free-form text

eld that users can paste or type text into). Users can delete indi-

vidual scrapbook entries from the scrapbook but not view the full

scrapbook contents or make in-place updates to collected context.

The sidebar also includes two action buttons: “Suggest with

OmniFill”, which invokes the LLM to make predictive suggestions

for the currently-focused Web page, and “Save example”, which

captures the current scrapbook contents as well as the current

page’s form structure to save into the “Prior examples” section of

OmniFill’s bag of context, then clears the scrapbook in preparation

for the next form-lling example.

The sidebar also contains conguration options such as Om-

niFill’s automatic invocation mode (which can update suggestions

when the scrapbook changes or form elds are modied) and its

automatic example-saving feature, which attempts to detect form

submissions and saves examples automatically.

OmniFill’s browser extension also injects JavaScript into each

visited Web page, responsible for collecting context from the Web

and from user activity. Users may select text while holding the Alt

(or Option) key on their keyboard, which adds the selected text

to OmniFill’s bag of context. In addition, Google search queries

are automatically added to the context. The injected script also

synchronizes the current page’s form eld structure, including

initial eld values and updates made by the user to the elds, to the

rest of the system. Form elds have their names inferred through

a variety of methods; rst, OmniFill attempts to read the elds’

accessible names (e.g.

aria-label

s), falling back to nearby visible

text if necessary. If the automatic example-saving feature is enabled

for a particular site, this injected script listens for form submissions

or clicks on buttons labeled “Save” or “Submit” and invokes the

example-saving routine. This technique may not generalize to all

websites, so users may need to click the “Save example” button

manually before making form submissions; in our user study, this

technique was sucient to save examples automatically.

When OmniFill has suggestions for the user, the injected script

seeks out the associated form elds in the page by their inferred

name, highlights them with a purple outline, and oers a small

suggestion box when focusing elds whose suggested value diers

from the current value.

5.1.2 Local server. OmniFill’s browser extension coordinates with

a local server program running on the computer. This separate

server simulates key presses when OmniFill needs to type into a

text eld. In addition, this server acts as the single source of truth

for OmniFill’s current bag of context, and this is persisted to the

user’s computer in case the browser or server is closed. When

OmniFill is ready to make a request to the LLM, this local server

generates a prompt, computing the number of tokens used by the

prompt and pruning examples and scrapbook contents if necessary

until the prompt is suciently short to t in the model’s context

window. Responses are cached by the local server so that only

unique requests are made to the LLM API.

5.1.3 LLM API. OmniFill makes requests to OpenAI’s

gpt-3.5-turbo-0613

model, choosing between the 4,096-

token context length model and the 16,384-token model based

on the size of the prompt. The model is queried at temperature

0, oering nearly deterministic results for a particular prompt.

Responses are piped from the local server back to the browser

extension, which presents any suggestions back to the user.

5.2

Serializing the “bag of context” into an LLM

prompt

To allow the system to make useful inferences in varied situations,

OmniFill constructs a prompt containing three main facets of task

context (visualized in Figure 6):

•

Current browsing context, representing context explicitly

inserted into OmniFill’s scrapbook by the user as well as

implicitly-observed Google searches.

•

Form structure state, including the inferred name for each

form eld, the initial value of each form eld, and any edits

made by the user to each form eld’s value.

•

Prior examples saved by the user, each including browsing

context, form structure, and the intended nal form state

before saving the form.

This prompt structure is designed to support many broad tasks

without requiring task-specic conguration. Some tasks may pri-

marily use only one or two of these context facets, and others may

combine information from all three. Depending on the task, these

context facets may all contain information demands of the task,

and they may all contain information that helps OmniFill infer and

carry out a task specication, so all three facets are included in each

prompt to the LLM.

5.2.1 Current browsing context. All information present in the

“scrapbook” in OmniFill’s sidebar will be included in this section of

the prompt, referred to internally as

"user action context"

. This

OmniFill: Domain-Agnostic Form Filling Suggestions Using Multi-Faceted Context arXiv Preprint, October 2023

Assistant messages User messages

{

"user action context": [ ... ],

"template with initial values": { ... },

"changes made by user so far": { ... }

}

Fill in the template JSON, using...

{ ... }

{

"user action context": [ ... ],

"template with initial values": { ... },

"changes made by user so far": { ... }

}

Fill in the template JSON, using...

{ ... }

{

"Vendor name": "Hawaiian Airlines",

"Amount spent": "$401.13",

"Requester's department": "CS",

"Expense type": "travel",

"Total purchase quantity": "1"

}

Prior examples

{

"user action context": [

{

"type": "text selection",

"page title": "Your Receipt for TACO SPOT",

"before": "...Subtotal$57.93\nTa",

"selection": "x$5.21\nTotal$63.14\n\nCre",

"after": "dit Card..."

}

"template with initial values": {

"Vendor name": "",

"Amount spent": "",

"Requester's department": "",

"Expense type": "",

"Total purchase quantity": ""

"changes made by user so far": {

"Vendor name": "Taco Spot",

}

Fill in the template JSON, using the provided context as

appropriate. Return a JSON object with every key from the

template. If a field should be updated, include a string

with the new value. Otherwise, include the boolean false

value. The blank template is as follows:

{

"Vendor name": "",

"Amount spent": "",

"Requester's department": "",

"Expense type": "",

"Total purchase quantity": ""

}

Current form fill context

{

...

}

Model

response

Figure 6: The structure of the prompt generated by a sin-

gle invocation of OmniFill for

gpt-3.5-turbo

. Some details

are omitted from prior examples. In addition, OmniFill ex-

pands selection context by 500 characters in each direction.

Highlighted in purple are details about the browsing context

(Section 5.2.1); in green is form structure state (Section 5.2.2),

and in orange is the output format (Section 5.2.4).

includes a description of Google searches made by the user, context

added manually through the sidebar’s “add context” button, and

context added by selecting text on Web pages. When text is selected

on Web pages, 500 characters before and after the user’s selection

are also included in the prompt (separated, so that the prompt

includes information about which text was actually selected). In

addition, the page’s title (as shown in the browser tab header) is

included.

5.2.2 Form structure state. For the LLM to return results in a usable

format, the prompt needs to contain a full list of the text elds on

the target Web page. The prompt also includes the initial value

of each eld and a list of the eld updates that the user has made

since those initial values were retrieved; this provides context on

how the user has engaged with the form so far, which can, for

example, allow the system to make suggestions for the second half

of a form after the user has given the system indications of their

intent through edits made to the rst half of the form. The “initial”

value of each eld is computed either at page load time or the time

the form structure last changed, whichever came later – this enables

the prompt to “start fresh” in situations where form submission

does not reload the page but instead adds new elds to the page

(e.g. when clicking a button to add a new row to a spreadsheet-style

form).

In practice, we discovered that, if some elds had already been

updated by the user, the system sometimes would not oer sugges-

tions for the remainder of the form, instead making only suggestions

that armed what the user had already input. In the OmniFill pro-

totype, we make two parallel requests to the model: one request as

described in this section, and a secondary request with any current

user edits suppressed (but the initial eld values still present), using

responses from the second prompt only as a fallback when a eld

did not have a suggestion from the rst prompt. In the information-

gathering task of our user study (described in section 6.1), 38 of 418

(= 9%) of model completions had at least one eld suggestion made

by this metric that would have been left blank or unchanged if not

for this secondary parallel request.

5.2.3 Prior examples. Because

gpt-3.5-turbo

is a “chat comple-

tion” model, we can model prior examples as previous “exchanges”

with the model by including past prompts (as described in the prior

two subsections) as a “user message” and the desired output of the

LLM (the example’s nal form state, as saved by the user) as an

“assistant response”, including nal values for each eld in the re-

sponse. The prompt does not include actual prior model responses

for previous requests, only the nal eld values saved by the user in

the “assistant response”. By saving examples, especially when lling

the same form repeatedly, users can demonstrate their process to

OmniFill. Even when OmniFill does not successfully extract mean-

ingful names from the elds in the target form, we have observed

that a small number of examples can suce to begin specifying the

form-lling task.

5.2.4 Output format. A simple structure for the requested output

format (which is duplicated in the prompt’s “assistant message”

for each example) is a JSON object with a key for every eld in

the target Web form and a value equaling the system’s suggestion

for that eld. By requesting suggestions for the entire form in one

query (rather than requesting each eld’s suggestion separately, in

parallel), the system can produce internally-consistent suggestions,

which may be valuable in some tasks where there are multiple

reasonable completions for the full form.

After some experimentation, we settled on an output format

that asks the model to respond with a JSON object that includes

every key, but to include string values only for the elds that should

have their values updated from their current values. Otherwise,

the JSON value for this eld should be a

false

Boolean value.

arXiv Preprint, October 2023 Aveni et al.

Figure 7: In the information-gathering task, information

from school websites is inserted into the target form in the

information-gathering task, including elds for: school and

district name, principal name, grade levels ser ved (e.g. 9-12),

total enrollment count, address, and phone number.

By keeping each key in the JSON object (as opposed to asking

the model to return only the keys needed), we observed a lower

risk of the model prematurely closing the JSON object in tasks

where the model needed to make many suggestions; this serves as

a rudimentary “chain-of-thought” prompt [

], considering each

key individually before making a decision to nish the response.

By removing the requirement to transcribe many elds exactly, we

prevent a distracting, noisy set of output examples from obfuscating

the true task being demonstrated in those examples.

We have found that, given the prompt as described in this section,

OmniFill rarely suggests overeager lls for form elds that (given

the bag of context) don’t need to be updated in the tasks for which

we have tried the system. In our user study’s information-gathering

task, for example, OmniFill rarely provided a suggestion for a eld

whose value was not discernible from browsing context, and in the

data-formatting task, OmniFill typically only oered suggestions

for the elds that the task asked users to update. The extent to

which OmniFill is conservative this way is highly dependent on

the prompt; for example, altering the prompt in Figure 6 to include

the sentence, “You must provide a useful suggestion for every eld,

even if you aren’t sure.” causes the model to oer a suggestion for

every eld.

We believe that, although LLMs can be powerful engines for

solving tasks in a domain-agnostic way, tuning prompts to improve

performance on some tasks without hindering performance on

other tasks requires careful planning. To better manage the uncer-

tainty of “herding AI cats” [

], practitioners should regression-test

changes to systems’ prompt structures on a wide range of sam-

ple tasks that require capabilities across all task dimensions (as

described in Section 3) and context facets.

6 USER STUDY

We conducted one-hour sessions with 18 participants to judge im-

pressions of OmniFill. Participants, aged 21-30 (mean 24), spoke

English and were located in the United States. Four participants

identied as female, twelve identied as male, and two did not

disclose their gender. Participants completed a consent form and

were compensated with a $20 gift card for their time.

Phone number formats

(###) ###-####

##########

###-###-####

### #######

### ### ####

+1 (###) ###-####

+1##########

+1 ### ### ####

1-(###)###-####

+1-###-###-####

(###) ###-####

T-Shirt size formats

3XL

med

LARGE

lrg

ex small

…

XXL

XXXL

Figure 8: The form and required transformations for the

data-formatting task.

6.1 Information-gathering task

Participants joined a Zoom call and completed two distinct tasks

with OmniFill through a remote virtual desktop, with a demon-

stration of the system incorporated into the rst task. In the rst

task, participants were instructed to “forage” information [

] from

websites of schools, inputting that information into a structured

form (an instance of EspoCRM [

]), shown in Figure 7. First, par-

ticipants were directed to the website of a public high school and

asked to spend a few minutes to ll the form manually, without

using OmniFill. For this test website, participants who took longer

than a few minutes were oered assistance locating the form data.

Then, participants were shown a demonstration on a dierent

school’s website, using OmniFill to ll the same form. In this demo,

participants were shown how to use the Alt+select interaction

to add context to their scrapbook, then shown how to generate

and accept suggestions for the form. Participants were shown that

selections can be made approximately, and that it was possible (but

not necessary) to make multiple distinct selections before returning

to the target form. Participants were shown how to add context

to the scrapbook manually, for situations where the Alt+select

interaction failed (e.g. when viewing PDFs). In the demonstration,

participants were shown and reminded that form inputs may be

typed manually and need not come from OmniFill suggestions. We

also checked OmniFill’s “Automatically save examples for this site”

checkbox so that the form’s “Save” button would trigger the current

form inputs to be saved as an example.

After the demonstration, participants were asked to practice us-

ing the system by again lling the form with data from the website

they had already manually collected information from. Then, par-

ticipants were given approximately 15 minutes to perform the task

freeform, nding websites for other schools and lling out the form,

one school at a time. To encourage participants to nd websites

with varying structures, we asked them to research schools from

dierent school districts. We also asked participants not to spend

more than a few minutes on any one website and indicated that it

was okay to leave some elds blank if they didn’t think they would

nd the information.

OmniFill: Domain-Agnostic Form Filling Suggestions Using Multi-Faceted Context arXiv Preprint, October 2023

6.2 Data formatting task

The data-formatting task required OmniFill to recognize patterns

when editing forms with prepopulated data.

Participants were shown a mock “human resources” website (an

instance of Admidio [

]), pre-lled with fake company membership

data. We prepopulated every user prole with a T-shirt size, most

proles (45 of 51) contained a mobile number, and some (14 of 51)

contained a home phone number. The string format of the shirt

size and phone number were randomly chosen for each member

prole, as depicted in Figure 8. Each participant received the same

mock company data, but the order of the data was randomized so

that OmniFill would behave dierently for each participant. No

participant saw more than 28 of the 51 total member proles during

the ten minutes allotted to perform the task.

Participants were shown a series of instructions for their task,

asking them to update each phone number in the member prole

(if present) to a xed format and to update each T-shirt size to

one of a few xed options (depicted in Figure 8). These instruc-

tions were shown in a screenshot so that participants could not

Alt+select them, since we wanted to investigate OmniFill’s pat-

tern recognition; although participants could have manually typed

the instructions into the scrapbook, none did. Then, we instructed

participants to visit each member prole in order, updating the

phone number and T-shirt elds following these instructions, with

OmniFill automatically saving examples each time the form was

saved. We also instructed participants to click the “Suggest with

OmniFill” button in the extension sidebar and wait for its response

on each prole before making any edits, testing the system’s ability

to recognize the task specication and oer the correct edits as

suggestions. We advised participants that they were not required

to accept OmniFill’s suggestions or use its suggestions unchanged.

6.3 Interview

After completing both tasks, participants were asked briey about

their experience in a conversational interview, including questions

about OmniFill’s perceived utility, concerns about real-world use,

condence in the system’s accuracy, and condence that they would

notice OmniFill’s mistakes.

7 STUDY RESULTS

We rst discuss high-level impressions for the two tasks, then in-

troduce four important lenses through which we consider users’

behavior; opportunistic scrapbooking, trust in OmniFill’s sugges-

tions, value in partial success, and a need for visibility into prompt

context.

7.1 Impressions

Participants readily used OmniFill in the information-gathering

task, accepting OmniFill’s suggestions for almost all form elds

in each school contact they saved; seventeen participants, when

asked, said they would likely use OmniFill if they had to perform

the information-gathering task in real life. Although suggestions

for this task largely used information from the current browsing

context portion of the prompt, even this task was able to benet from

OmniFill’s multi-faceted prompt. For example, participants who

typed schools’ country name as “United States” tended to receive

this as a suggestion, where participants who preferred “USA” often

received the suggestion in this format, indicating that OmniFill was

making use of prior examples in its prompt.

In the data-formatting task, OmniFill proved more successful

for some participants than for others, and when the system did

eventually succeed more consistently, many examples were often

required for this to settle. Figure 9 visualizes OmniFill’s success

suggesting values for the phone number elds in particular. Each

row represents a participant, and each box represents a completion

request sent through the system. Only the rst request is shown

per member prole, even if participants invoked the system more

than once, and we do not show requests for proles where no

phone number change needed to be made. In this gure, the box is

shown in green only when both phone number elds are suggested

correctly (although in many cases there was only one phone number,

so only one change needed to be made). Cases where users accepted

an incorrect suggestion or rejected a correct suggestion (choosing

instead to type the phone number manually) are also shown.

After conducting the study, we also called the GPT-4 API [

]

using the same prompts as those used in the study to observe system

performance with the more powerful model, shown on the right

side of Figure 9. Because GPT-4 has a smaller context window (8,192

tokens) than

gpt-3.5-turbo-16k

, some prompts could not be run

for this analysis; those are depicted with gray stripes. In addition to

the context window size, real-world trade-os of using GPT-4 for

OmniFill include cost and latency; for these reasons and to better

understand users’ behavior with a less-eective model, we used

GPT-3.5 for our studies.

7.2 Opportunistic scrapbooking

Compared to the meticulous copying-and-pasting (or memorizing-

then-typing) strategy participants employed to ll out the information-

gathering task’s contact form before being introduced to OmniFill,

participants’ information-foraging behavior during the free-form

component of this task was approximate and opportunistic. We

observed participants collecting information for OmniFill’s bag of

context quickly and even haphazardly, as one participant put it,

often with little regard for the later extraction step.

7.2.1 “Collect-then-fill” strategy. In the manual portion of the

information-gathering task, participants all alternated frequently

between the information source website and the target interface

(the contact form), copying or transcribing information in small

increments. Because the contact form’s “address” eld was split

into ve text elds (“Address”, “City”, “State”, “Postal Code”, and

“Country”), two participants constructed an ad hoc “scrapbook”

inside the “Address” eld, pasting a full line (address, city, state,

and postal code) into the interface and then manually organizing

it piece-by-piece into the appropriate elds. Still, all participants

were able to search only for small portions of the total information

demand at a time before needing to insert the collected data into

the contact form.

However, in the second phase of the information-gathering task,

during which participants were equipped with OmniFill’s scrap-

book, all but one participant eventually transitioned to a “collect-

then-ll” strategy, in which multiple information demands were

collected into the scrapbook without returning to the contact form

arXiv Preprint, October 2023 Aveni et al.

Figure 9: OmniFill’s success in suggesting phone number formats in the data-formatting task, compared to those same prompts

run against the GPT-4 API. Prompts that could not be run with GPT-4 due to context window restrictions are shown as “invalid”.

Each row represents one participant’s experience in the task, showing suggestions left-to-right as the task progressed.

Multi-value collection

Approximate targeting

Overcollection

extract

transform

School:

Midtown High School

Principal:

Jordan Green

Phone:

650-555-1130

Street:

114 Summer Street

Grade levels:

9-12

Enrollment:

1,139

Principal: Jordan Green

Phone: 650-555-1148

Address: 114 Summer Street

…a long history of welcoming a diverse body of students

grades 9-12, and this year is no exception. We are proud to

welcome 1,139 students and oﬀer a wide range…

Midtown High School

Address: 114 Summer Street

Main Oﬀice Phone: 650-555-1130

Figure 10: Opportunistic scrapbooking involves the collection of many values at once, approximate relevance judgments, and

overcollection. OmniFill often extracts appropriate information even from opportunistically-built scrapbooks.

in between. This was true both of information that could be col-

lected in a single selection (e.g. an entire address and phone number

from one selection on a “Contact Us” page) and information from

dierent pages, which was collected in multiple text selections

often without returning to the contact form in between. In most

cases, participants did revisit the contact form before collecting all

of the demanded information, but after collecting multiple pieces of

“low-hanging fruit”. Often, the last few straggling pieces of missing

information were entered in smaller increments as the participants

hunted them down.

Participants often lost track of which information they had col-

lected already, and in some cases even expressed surprise when

they saw how much information OmniFill had extracted – “wow,

that already knocks out a lot.”

7.2.2 Overcollection. Since it is OmniFill, and not the user, who

needs to “read through” context in the scrapbook and extract rele-

vant information, the main cost of adding new information to the

bag of context occurs at the time of collection, and additional cost

is largely not incurred even if the user’s judgment of relevance

turns out to be a false positive. This distinguishes OmniFill from

simple “multiple copy-and-paste” tools, which still require precision

in foraging to prevent an overload of information to parse. Partici-

pants frequently collected context with imprecise text selections

(compared to the exact selection required for a copy-and-paste) or

occasional duplicate information from dierent sources (e.g. col-

lecting the phone number from the “Contact Us” page and from the

website’s footer). Some participants suggested obviating manual

collection entirely, but our own early prototype testing showed that

OmniFill’s success was signicantly diminished when collecting all

browsing context indiscriminately, without manual foregrounding

of information the user nds relevant.

By lowering the barriers for participants to “overcollect” infor-

mation, the system was able in some cases to correct errors, even

transparently. For example, one participant collected the wrong

phone number for the school (collecting instead the phone number

OmniFill: Domain-Agnostic Form Filling Suggestions Using Multi-Faceted Context arXiv Preprint, October 2023

of the school’s Title IX coordinator), and OmniFill generated a sug-

gestion with this phone number. Before this participant returned

back to the contact form, they found another page that contained

additional information they were looking for, alongside the correct

phone number for the school. When the participant quickly selected

the entire information block – including the correct phone number

– OmniFill corrected its phone number suggestion, now that there

were two phone numbers to choose from (and sucient natural

language information in its bag of context to determine which of

the two was more appropriate), without the participant ever even

seeing the incorrect suggestion.

Overcollection may also help when OmniFill makes an incorrect

inference based on information not in the context, e.g. when in-

correctly assuming a particular elementary school’s served grade

levels are K-5 or a high school’s are 9-12. One participant accepted

one such incorrect suggestion and, upon later nding the school’s

enrollment total on a “school prole”, collected the entire paragraph

as context; since the school prole also included the true “grade

levels served” eld value “K-6”, OmniFill updated its suggestion.

This was not a transparent correction, since the participant had

already accepted the “K-5” suggestion, but after noticing the change,

they updated the information in the contact form.

7.3 Trust in OmniFill’s suggestions

Because OmniFill assumes the role of locating relevant information

in its bag of context during the information-gathering task, users

who want to collect context opportunistically and avoid keeping

track of the bag of context must place trust in OmniFill’s inferences.

We observed that users often trusted OmniFill’s suggestions, even

when that trust may have been unfounded.

The contact form contained a “grade levels served” eld, which

should be lled with e.g. “9-12” for a typical American high school.

When websites did not explicitly include this information, some

participants were reluctant to make assumptions for the value of

this eld. Although OmniFill often withheld suggestions unless the

relevant information was explicitly present in its bag of context (for

example, we never observed the street address eld being incor-

rectly populated), the model did often return with a suggestion for

the grade level eld without the context including this information;

this happened at least once for 15 of 18 participants.

It was rare, however, for participants to question this information

when OmniFill suggested it; only one participant did explicitly

reject this overeager grade level completion, and that participant

had selected only a very small amount of text that they could

immediately observe did not contain the grade level information.

In a ipped example, one participant, who had lled out most of

the contact form, was struggling to nd the name of the school’s

principal. They chose to select the name of an assistant principal for

their scrapbook. Only after heading back to the contact form and

observing that OmniFill did not oer a suggestion for the “Principal”

eld did the participant give up on nding the information and

move onto the next school.

Although the data-formatting task contained fewer opportunities

for participants to rely on OmniFill to make “judgment calls”, we

still observed two instances of participants accepting and saving

OmniFill’s incorrect suggestion. For each of these two participants,

OmniFill made the strange suggestion in one case to duplicate a

correctly-formatted phone number into a blank phone number eld.

Although both participants had already correctly handled this case

in the past, they chose to accept this suggestion from OmniFill; one

said aloud, “Oh, yeah, sure. Why not?” as they did this, indicating

a willingness to allow OmniFill to take the reins in deciding and

carrying out the task specication.

7.4 Partial success

OmniFill was not perfect, occasionally visibly failing to extract

explicitly-foregrounded information in the information-gathering

task or oering incorrect suggestions (or no suggestions at all) for

the data-formatting task.

Still, this partial success oered a value-add in both tasks. In

the information-gathering task, participants could always fall back

to traditional information-foraging techniques to nd data that

OmniFill failed to extract. In the data-formatting task, OmniFill’s

success rate on oering correct T-Shirt sizes was high (across the 18

participants, OmniFill oered a correct suggestion in 306 of the 347

cases (= 88%) where the T-Shirt size needed to be updated), so par-

ticipants were often presented with at least one correct suggestion

even when the system couldn’t get every eld right.

Even when the phone number suggestion was incorrect, we

found that participants often accepted the incorrect suggestion and

then made changes to the eld, rather than editing the original

text eld value. Of the 16 participants who received a suggestion

to change a phone number from one incorrect format into another

incorrect format, 15 accepted one of these suggestions at least once

(and 9 did this multiple times); Figure 9 indicates incorrect sugges-

tions which were accepted by the user. For example, a common

failure mode for this task was for OmniFill to format the phone

number correctly but without removing the country code (i.e.

(###) ###-####

) or to hyphenate a ten-digit number (transform-

ing

##########

###-###-####

). By accepting these suggestions,

participants allowed OmniFill to get them closer to their desired

result for the form eld.

Many participants noted that partial failure may not be harm-

less, however. Ten participants, when asked about concerns using

OmniFill in the real world, cited accuracy considerations. Many

noted, either during the interview or while completing the tasks,

that although they may notice “obvious” errors or those that don’t

conform to the task specication, they might not notice if Om-

niFill inserted incorrect information in the information-gathering

task or changed phone numbers in the data-formatting task. Four

participants, including some concerned about OmniFill’s accuracy,

still suggested that the potential for human error in a task like

phone number formatting was high, and that they had more trust

in OmniFill to be correct.

7.5 Context visibility

7.5.1 Verifying suggestions. LLMs, especially when asked to supply

information not in their prompt context, can “hallucinate” incorrect

information [

]. Although OmniFill can in some cases oer value

to users by suggesting general world or language knowledge not

present in the model’s context window (as with its school grade

level suggestions, which were largely correct), we anticipate that,

arXiv Preprint, October 2023 Aveni et al.

in many situations where OmniFill oers a value-add, the infor-

mation present in the suggestion is present in the bag of context

(either literally or in some pre-transformation state). Since sugges-

tion accuracy was often a concern for participants, oering better

visibility into where information is coming from may help.

7.5.2 Incorrect prior examples. Because OmniFill observes much

of the user’s behavior passively and has a largely immutable bag

of context, there is a risk that participants will accidentally teach

the system incorrectly, saving an example that causes problems

in the future (such as suggestions of incorrect information from

prior examples or inferences of an incorrect task specication). We

did not provide users with a mechanism for deleting incorrectly-

saved examples, and 9 of 18 participants saved at least one incorrect

example during the data-formatting task; even those who corrected

their errors were not able to delete the mistakenly-saved examples.

In two cases in the data-formatting task, we observed partic-

ipants making a mistake that was later reected in the model’s

output, both related to formatting phone numbers when two were

present in the eld. Neither mistake was caused by the participant

accepting a suggestion from OmniFill, but in both cases, when pre-

sented with a similar prole later in the task, OmniFill made the

same error.

When we later called the GPT-4 API with the prompts from the

data-formatting task, despite GPT-4’s much higher general success

rate on this task (depicted in Figure 9), both of these later errors

still manifested. Although a more powerful LLM may be able to

learn patterns and infer a task specication with few examples, this

suggests that incorrect examples of prior user behavior can still be

a critical source of ultimately incorrect predictions by the system.

8 DISCUSSION

LLM-backed systems like OmniFill can oer convenient form-lling

suggestions in a single, domain-agnostic package without requiring

signicant task-specic conguration. Because of this general ap-

plicability, we anticipate that, despite lower-precision task specica-

tions and the risk of dicult-to-notice errors, this type of predictive

system will become increasingly integrated into daily computing,

just as search result ranking and mobile predictive keyboards have.

System designers must understand both the limitations of LLM-

backed approaches and the character of users’ interactions with

such systems. In this section, we discuss implications for privacy

and accuracy, especially through the lens of potential future direc-

tions of this work.

8.1 Privacy and security

When explicitly asked about real-world concerns about using Om-

niFill, only six of 18 participants mentioned privacy or information

security as a consideration, even though the data-formatting task

involved working with simulated personal information. We did not

explain to participants upfront that OmniFill does not run entirely

on the user’s local computer, but even users who know they are

interacting with an online system may not consistently protect the

privacy of their information [

]. Systems that allow users to ingest

other peoples’ personal information must be considered even more

strictly. A system like OmniFill benets from passively observing

the user’s behavior or collecting implicit context (e.g. bringing in

text immediately before and after a user’s text selection in case their

mouse cursor “aim” was imperfect), but this requires trust in the

remote LLM API used by the system. Because OmniFill is designed

to interoperate between siloed ecosystems, privacy concerns persist

even outside of systems that are known to track user behavior.

Running general suggestion systems locally on the user’s com-

puter may become feasible as model architectures grow more e-

cient and computers become more powerful, keeping model queries

private. Still, future work may choose to store many prior user ex-

amples (more than what can t into a single LLM prompt) for later

retrieval during prompt construction (as in [

]). System designers

should be careful when collecting and opaquely storing browsing

context long-term, even locally, since this practice can increase the

consequences of a system breach.

8.2 Accuracy

Even after the system appears to have “learned” a task, the lack of

a rigorous task specication can cause occasional errors (as Fig-

ure 9 demonstrates in the GPT-4 section). OmniFill is designed to

assist users in their manual tasks, not construct reliable automa-

tions. However, “automation bias” is known to result in uncertainty

among users of automated systems [

], and a “good-enough” au-

tomation of menial tasks may cause the user to “check out” and

stop manually verifying the results of suggestions, as reported by

some participants after completing our data-formatting task. As

Figure 9 illustrates, success can be highly dependent on choice of

model, depending on the task at hand. This suggests challenges

for developers building systems that are not entirely under their

control.

Since users cannot inspect the LLM’s workings, it is dicult to

form a mental model of which types of errors the LLM is likely to

make, meaning even partial success of the system may have variable

utility. We observed users accepting OmniFill’s “judgment calls”

even though the system had never been told how it should make

these judgments; the “authority” of the system seemed persuasive

even in those very situations where an automated system cannot

know how to behave.

Future improvements to such systems may further reduce the

cognitive load demanded of users during certain tasks (e.g. by im-

proving implicit context collection so that less information needs

to be explicitly foregrounded), but these exact improvements may

cause users to be less likely to notice when the model makes an

error. Our results from the information-gathering task demonstrate

that the system is already powerful enough to permit users to col-

lect and use information without noticing that they had collected

it. How, then, would a user in this situation notice if the suggestion

is incorrect?

Context visibility may play a key role. Because the “collect-then-

ll” strategy recruits OmniFill’s bag of context as an auxiliary

“memory” for the user, future work in constructing interface fea-

tures could allow users to peer into the bag of context and view

context sources in situ. Since information used in suggestions is

often present in the bag of context, either in some literal or pre-

transformation form, these features could surface just the context

that matches with OmniFill’s suggestions. Although a literal search

for each suggestion in the bag of context may not yield results when

OmniFill: Domain-Agnostic Form Filling Suggestions Using Multi-Faceted Context arXiv Preprint, October 2023

the task involves transformation operations, additional prompts to

a lighter-weight LLM (or a semantic search using an embedding

model) could be used to handle many simpler cases. OmniFill’s full

LLM prompt would still be responsible for producing the actual sug-

gestions, since the full bag of context may be valuable for making

high-quality predictions (e.g. if the form structure is poorly-labeled

but can be learned through prior form-lling examples), but post

hoc “attribution” may be achieved through an auxiliary system.

Although in situ training of the system can be a low-friction

way to oer predictive suggestions, systems should provide users

with the ability to view and rene their task specications, e.g. by

curating their set of examples to maximize system accuracy. Future

work could assist users in this process by detecting and surfacing

potentially anomalous prior examples or by engaging the user in a

dialogue to dene and ne-tune task specications as the system is

used over time.

9 CONCLUSION

Not every task calls for full automation or an elaborate specica-

tion. Even when task denitions are fuzzy, partial automation of

the simpler tedious components of form lling tasks can prove

valuable, and LLM-backed systems like OmniFill can serve as a

“glue” between arbitrary context sources and target forms without

heavy conguration. We demonstrate opportunities of LLM-backed

systems to assist in a unique subspace of form lling tasks, then

describe our observations of users trying the prototype. We believe

this is a rich space for future system designers to explore, but care

must be taken to understand how people perceive and use such

systems, especially in a landscape of rapidly-expanding capabilities

and expectations for articial intelligence tools.

DISCLOSURE

The authors used GitHub Copilot v1.111.404 for code prediction in

the preparation of gure source code.

ACKNOWLEDGMENTS

We would like to thank Shm Garanganao Almeda, James Smith,

and Matthew Beaudouin-Lafon for their valuable insights that con-

tributed to the framing of this work.

REFERENCES

[1]

2023. Admidio – Free online membership management software. https://www.

admidio.org/. Accessed: 2023-09-14.

[2]

2023. EspoCRM.com: Free Self Hosted & Cloud CRM software. https://www.

espocrm.com/. Accessed: 2023-09-14.

[3]

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes,

Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus-

man, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan,

Eric Jang, Rosario Jauregui Ruano, Kyle Jerey, Sally Jesmonth, Nikhil Joshi, Ryan

Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao

Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao,

Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan,

Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu,

Mengyuan Yan, and Andy Zeng. 2022. Do As I Can and Not As I Say: Grounding

Language in Robotic Aordances. In arXiv preprint arXiv:2204.01691.

[4]

Kenneth C Arnold, Krysta Chauncey, and Krzysztof Z Gajos. 2020. Predictive text

encourages predictable writing. In Proceedings of the 25th International Conference

on Intelligent User Interfaces. 128–138.

[5]

Shaon Barman, Sarah Chasins, Rastislav Bodik, and Sumit Gulwani. 2016. Ringer:

Web Automation by Demonstration. In Proceedings of the 2016 ACM SIGPLAN In-

ternational Conference on Object-Oriented Programming, Systems, Languages, and

Applications (Amsterdam, Netherlands) (OOPSLA 2016). Association for Comput-

ing Machinery, New York, NY, USA, 748–764. https://doi.org/10.1145/2983990.

2984020

[6]

Holger Bast and Ingmar Weber. 2006. Type Less, Find More: Fast Autocompletion

Search with a Succinct Index. In Proceedings of the 29th Annual International ACM

SIGIR Conference on Research and Development in Information Retrieval (Seattle,

Washington, USA) (SIGIR ’06). Association for Computing Machinery, New York,

NY, USA, 364–371. https://doi.org/10.1145/1148170.1148234

[7]

Hichem Belgacem, Xiaochen Li, Domenico Bianculli, and Lionel Briand. 2023.

A Machine Learning Approach for Automated Filling of Categorical Fields in

Data Entry Forms. ACM Trans. Softw. Eng. Methodol. 32, 2, Article 47 (apr 2023),

40 pages. https://doi.org/10.1145/3533021

[8]

Eric A Bier, Edward W Ishak, and Ed Chi. 2006. Entity quick click: rapid text

copying based on automatic entity extraction. In CHI’06 Extended Abstracts on

Human Factors in Computing Systems. 562–567.

[9]

Vishwanath Bijalwan, Vinay Kumar, Pinki Kumari, and Jordan Pascual. 2014. KNN

based machine learning approach for text and document mining. International

Journal of Database Theory and Application 7, 1 (2014), 61–70.

[10]

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora,

Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma

Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon,

Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Dem-

szky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John

Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren

Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori

Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu,

Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth

Karamcheti, Geo Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark

Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina

Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu

Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele

Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman,

Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut,

Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance,

Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong,

Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori

Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Ro-

han Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang,

Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Ji-

axuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui

Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2022. On the Opportunities

and Risks of Foundation Models. arXiv:2108.07258 [cs.LG]

[11]

Raymond R. Bond, Tomas Novotny, Irena Andrsova, Lumir Koc, Martina Sisakova,

Dewar Finlay, Daniel Guldenring, James McLaughlin, Aaron Peace, Victoria

McGilligan, Stephen J. Leslie, Hui Wang, and Marek Malik. 2018. Automation

bias in medicine: The inuence of automated diagnoses on interpreter accuracy

and uncertainty when reading electrocardiograms. Journal of Electrocardiology 51,

6, Supplement (2018), S6–S11. https://doi.org/10.1016/j.jelectrocard.2018.08.007

[12]

Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. 2000. Automat-

ically extracting structure from free text addresses. IEEE Data Eng. Bull. 23, 4

(2000), 27–32.

[13]

José Cambronero, Sumit Gulwani, Vu Le, Daniel Perelman, Arjun Radhakrishna,

Clint Simon, and Ashish Tiwari. 2023. FlashFill++: Scaling Programming by Ex-

ample by Cutting to the Chase. In Principles of Programming Languages. ACM SIG-

PLAN, ACM. https://www.microsoft.com/en-us/research/publication/ashll-

scaling-programming-by-example-by-cutting-to-the-chase/

[14]

Sarah E. Chasins, Maria Mueller, and Rastislav Bodik. 2018. Rousillon: Scrap-

ing Distributed Hierarchical Web Data. In Proceedings of the 31st Annual ACM

Symposium on User Interface Software and Tec hnology (Berlin, Germany) (UIST

’18). Association for Computing Machinery, New York, NY, USA, 963–975.

https://doi.org/10.1145/3242587.3242661

[15]

John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar,

and Minsuk Chang. 2022. TaleBrush: Visual Sketching of Story Generation with

Pretrained Language Models. In Extended Abstracts of the 2022 CHI Conference

on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22).

Association for Computing Machinery, New York, NY, USA, Article 172, 4 pages.

https://doi.org/10.1145/3491101.3519873

[16]

Morgan Dixon and James Fogarty. 2010. Prefab: implementing advanced behav-

iors using pixel-based reverse engineering of interface structure. In Proceedings

of the SIGCHI Conference on Human Factors in Computing Systems. 1525–1534.

[17]

Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Siva Reddy. 2022. On

the Origin of Hallucinations in Conversational Models: Is it the Datasets or the

Models?. In Proceedings of the 2022 Conference of the North American Chapter

of the Association for Computational Linguistics: Human Language Technologies.

Association for Computational Linguistics, Seattle, United States, 5271–5285.

https://doi.org/10.18653/v1/2022.naacl-main.387

[18]

Steven M. Goodman, Erin Buehler, Patrick Clary, Andy Coenen, Aaron Donsbach,

Tianie N. Horne, Michal Lahav, Robert MacDonald, Rain Breaw Michaels, Ajit

arXiv Preprint, October 2023 Aveni et al.

Narayanan, Mahima Pushkarna, Joel Riley, Alex Santana, Lei Shi, Rachel Sweeney,

Phil Weaver, Ann Yuan, and Meredith Ringel Morris. 2022. LaMPost: Design and

Evaluation of an AI-Assisted Email Writing Prototype for Adults with Dyslexia.

In Proceedings of the 24th International ACM SIGACCESS Conference on Computers

and Accessibility (Athens, Greece) (ASSETS ’22). Association for Computing

Machinery, New York, NY, USA, Article 24, 18 pages. https://doi.org/10.1145/

3517428.3544819

[19]

Melanie Hartmann and Max Muhlhauser. 2009. Context-Aware Form Filling for

Web Applications. In 2009 IEEE International Conference on Semantic Computing.

221–228. https://doi.org/10.1109/ICSC.2009.83

[20]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings

of the SIGCHI conference on Human Factors in Computing Systems. 159–166.

[21]

Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman.

2023. Co-writing with opinionated language models aects users’ views. In

Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.

1–15.

[22]

Tae Soo Kim, DaEun Choi, Yoonseo Choi, and Juho Kim. 2022. Stylette: Styling the

Web with Natural Language. In Proceedings of the 2022 CHI Conference on Human

Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association

for Computing Machinery, New York, NY, USA, Article 5, 17 pages. https:

//doi.org/10.1145/3491102.3501931

[23]

Heidi Horstmann Koester and Simon P Levine. 1994. Modeling the speed of

text entry with a word prediction interface. IEEE transactions on rehabilitation

engineering 2, 3 (1994), 177–187.

[24]

Ranjitha Kumar, Arvind Satyanarayan, Cesar Torres, Maxine Lim, Salman Ahmad,

Scott R. Klemmer, and Jerry O. Talton. 2013. Webzeitgeist: Design Mining the

Web. In ACM Human Factors in Computing Systems (CHI). https://doi.org/10.

1145/2470654.2466420

[25]

Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. 2008. CoScripter:

Automating & Sharing How-to Knowledge in the Enterprise. In Proceedings of

the SIGCHI Conference on Human Factors in Computing Systems (Florence, Italy)

(CHI ’08). Association for Computing Machinery, New York, NY, USA, 1719–1728.

https://doi.org/10.1145/1357054.1357323

[26]

Itay Levy, Ben Bogin, and Jonathan Berant. 2023. Diverse Demonstrations Im-

prove In-context Compositional Generalization. arXiv:2212.06800 [cs.CL]

[27]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,

Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel,

Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for

Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]

[28]

Greg Little, Tessa A Lau, Allen Cypher, James Lin, Eben M Haber, and Eser

Kandogan. 2007. Koala: capture, share, automate, personalize business processes

on the web. In Proceedings of the SIGCHI conference on Human factors in computing

systems. 943–946.

[29]

Mikaël Mayer, Gustavo Soares, Maxim Grechkin, Vu Le, Mark Marron, Oleksandr

Polozov, Rishabh Singh, Benjamin Zorn, and Sumit Gulwani. 2015. User Inter-

action Models for Disambiguation in Programming by Example. In Proceedings

of the 28th Annual ACM Symposium on User Interface Software & Technology

(Charlotte, NC, USA) (UIST ’15). Association for Computing Machinery, New

York, NY, USA, 291–301. https://doi.org/10.1145/2807442.2807459

[30]

Arnab Nandi and HV Jagadish. 2007. Assisted querying using instant-response

interfaces. In Proceedings of the 2007 ACM SIGMOD international conference on

Management of data. 1156–1158.

[31]

Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s

code suggestions. In Proceedings of the 19th International Conference on Mining

Software Repositories. 1–5.

[32] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

[33]

Kseniia Palin, Anna Maria Feit, Sunjun Kim, Per Ola Kristensson, and Antti

Oulasvirta. 2019. How do people type on mobile devices? Observations from a

study with 37,000 volunteers. In Proceedings of the 21st International Conference

on Human-Computer Interaction with Mobile Devices and Services. 1–12.

[34]

Peter Pirolli and Stuart Card. 1999. Information foraging. Psychological review

106, 4 (1999), 643.

[35]

Kevin Pu, Jim Yang, Angel Yuan, Minyi Ma, Rui Dong, Xinyu Wang, Yan Chen,

and Tovi Grossman. 2023. DiLogics: Creating Web Automation Programs With

Diverse Logics. arXiv preprint arXiv:2308.05828. To appear: UIST 2023 (2023).

[36]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli,

Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer:

Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]

[37]

Jerey Stylos, Brad A. Myers, and Andrew Faulring. 2004. Citrine: Providing

Intelligent Copy-and-Paste. In Proceedings of the 17th Annual ACM Symposium on

User Interface Software and Technology (Santa Fe, NM, USA) (UIST ’04). Association

for Computing Machinery, New York, NY, USA, 185–188. https://doi.org/10.

1145/1029632.1029665

[38]

S. Shyam Sundar, Hyunjin Kang, Mu Wu, Eun Go, and Bo Zhang. 2013. Unlocking

the Privacy Paradox: Do Cognitive Heuristics Hold the Key?. In CHI ’13 Extended

Abstracts on Human Factors in Computing Systems (Paris, France) (CHI EA ’13).

Association for Computing Machinery, New York, NY, USA, 811–816. https:

//doi.org/10.1145/2468356.2468501

[39]

Tim Teitelbaum and Thomas Reps. 1981. The Cornell Program Synthesizer: A

Syntax-Directed Programming Environment. Commun. ACM 24, 9 (sep 1981),

563–573. https://doi.org/10.1145/358746.358755

[40]

Paul Viola and Mukund Narasimhan. 2005. Learning to Extract Information

from Semi-Structured Text Using a Discriminative Context Free Grammar. In

Proceedings of the 28th Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval (Salvador, Brazil) (SIGIR ’05). Association

for Computing Machinery, New York, NY, USA, 330–337. https://doi.org/10.

1145/1076034.1076091

[41]

Bryan Wang, Gang Li, and Yang Li. 2023. Enabling conversational interaction with

mobile ui using large language models. In Proceedings of the 2023 CHI Conference

on Human Factors in Computing Systems. 1–17.

[42]

Shaohua Wang, Ying Zou, Bipin Upadhyaya, and Joanna Ng. 2013. An Intelligent

Framework for Auto-lling Web Forms from Dierent Web Applications. In

2013 IEEE Ninth World Congress on Ser vices. 175–179. https://doi.org/10.1109/

SERVICES.2013.19

[43]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia,

Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits

Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]

[44]

Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller. 2009. Sikuli: Using GUI

Screenshots for Search and Automation. In Proceedings of the 22nd Annual ACM

Symposium on User Interface Software and Technology (Victoria, BC, Canada)

(UIST ’09). Association for Computing Machinery, New York, NY, USA, 183–192.

https://doi.org/10.1145/1622176.1622213

[45]

J.D. Zamrescu-Pereira, Heather Wei, Amy Xiao, Kitty Gu, Grace Jung, Matthew G

Lee, Bjoern Hartmann, and Qian Yang. 2023. Herding AI Cats: Lessons from

Designing a Chatbot by Prompting GPT-3. In Proceedings of the 2023 ACM

Designing Interactive Systems Conference (Pittsburgh, PA, USA) (DIS ’23). As-

sociation for Computing Machinery, New York, NY, USA, 2206–2220. https:

//doi.org/10.1145/3563657.3596138

[46]

Albert Ziegler, Eirini Kalliamvakou, X. Alice Li, Andrew Rice, Devon Rifkin,

Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity

Assessment of Neural Code Completion. In Proceedings of the 6th ACM SIGPLAN

International Symposium on Machine Programming (San Diego, CA, USA) (MAPS

2022). Association for Computing Machinery, New York, NY, USA, 21–29. https:

//doi.org/10.1145/3520312.3534864