Integration Testing Thousands of Websites with Playwright

As I’ve ac­counted and dis­cussed in pre­vi­ous posts, one of the hard­est prob­lems Harper faces is that of the great di­ver­sity of the in­ter­net. There are a great va­ri­ety text ed­i­tors on the web, each used in a dif­fer­ent con­text. That’s a beau­ti­ful thing, but un­for­tu­nately our users ex­pect Harper to work in all of these places seam­lessly. I sup­pose you can view this post as a third part in this se­ries where I talk about sup­port­ing thou­sands of web­sites for po­ten­tially mil­lions of users (with zero server costs, I might add).

This jour­ney started in Discord (as many do), with a sat­is­fied user re­port­ing an is­sue. I’m quite for­tu­nate: they were tech­ni­cal. They’re re­port was de­tailed and in­cluded some ini­tial spec­u­la­tion on what the root cause could be.

A small part of a larger conversion about the problem at hand.

While the ac­tual un­der­ly­ing prob­lem was com­plex and dif­fi­cult to fix, that is not what this post is about. It is about Harper’s strat­egy for do­ing end-to-end test­ing on the many sites we sup­port.

It’s rel­e­vant be­cause this was a prob­lem that could have been dis­cov­ered through end-to-end test­ing. Since we have a di­verse set of users al­ready, they found the is­sue swiftly. That is far from ideal. Our test­ing suite should catch these prob­lems be­fore the PR is merged.

Why Playwright?

Playwright enables re­li­able end-to-end test­ing for mod­ern web apps.” At least—that’s what their site claims Playwright can do. I’m not sure if it lives up to this claim of re­li­a­bil­ity, at least not yet.

The de­ci­sion to use Playwright over al­ter­na­tive choices came down to a few key points:

  • It’s quite pol­ished and well sup­ported (Microsoft seems to be the main player).
  • While it is more com­plex to load a Chrome ex­ten­sion in Playwright than Puppeteer, I am also given a lot more con­trol.
  • We’re al­ready us­ing it for other in­te­gra­tions (but not for end-to-end tests).

The Game Plan

My goal is to build up a com­pre­hen­sive-enough test suite that I can catch in­te­gra­tion prob­lems in for­eign text ed­i­tors be­fore I merge PRs for logic that in­ter­acts with them.

Step one was to get Playwright in­stalled and run­ning on my ma­chine, re­pro­ducible with npm. Fortunately for me, this was as sim­ple as: pnpm create playwright. Step two was a lit­tle more com­plex: get our ex­ten­sion loaded within the head­less browser. I found this could be done by over­rid­ing Playwright’s de­fault Chrome prop­er­ties, in­struct­ing it to only in­stall the Harper plu­gin and noth­ing else:

export const test = base.extend<{
	context: BrowserContext;
	extensionId: string;
}>({
	// biome-ignore lint/correctness/noEmptyPattern: it's by Playwright. Explanation not provided.
	context: async ({}, use) => {
		const pathToExtension = path.join(import.meta.dirname, '../build');
		console.log(`Loading extension from ${pathToExtension}`);
		const context = await chromium.launchPersistentContext('', {
			channel: 'chromium',
			args: [
				`--disable-extensions-except=${pathToExtension}`,
				`--load-extension=${pathToExtension}`,
			],
		});
		await use(context);
		await context.close();
	},
	extensionId: async ({ context }, use) => {
		let [background] = context.serviceWorkers();
		if (!background) background = await context.waitForEvent('serviceworker');

		const extensionId = background.url().split('/')[2];
		await use(extensionId);
	},
});
export const expect = test.expect;

From there, it was pretty triv­ial to build out as­ser­tions and tools for in­ter­act­ing with ba­sic el­e­ments for the spe­cific text ed­i­tor I was in­ter­ested in (Slate).

The Cool Part

Most text ed­i­tors on the web ad­ver­tise them­selves in the DOM, usu­ally with a spe­cial at­tribute like data-lexical-editor="true" or data-slate-editor="true". This even hap­pens on world-class sites like LinkedIn or Instagram. I won­der if I can use this for some­thing?

I be­lieve this con­sis­tency in pro­duc­tion code is in­ten­tional. Making our tests eas­ier to write must be a side-ef­fect of mak­ing the ed­i­tor au­thor’s tests eas­ier to write.

This is great news for me. With just a few tweaks, I can use the same code to test Harper on Discord, Medium, Notion, Desmos, Asana—you get the point. Since they use just a small set of rich text ed­i­tors (which come pre-tagged), I can gen­er­ate au­to­mated tests to de­ter­mine whether Harper works prop­erly on their sites.

The Not-So-Cool Part

The bad news: this process is slow. Each page must be fetched from the net­work and op­er­ated on like a user. We might be able to fix the first prob­lem, but the sec­ond is in­ter­minable. My ini­tial ex­per­i­ments put the run­time of each test case around thirty sec­onds. If I’m test­ing hun­dreds or thou­sands of sites, this is a real prob­lem.

For the time be­ing, I’ll have these tests be their own work­flow in GitHub Actions and only run them when the ex­ten­sion code changes. I don’t see peo­ple other than my­self mess­ing with this code too much, so I am not wor­ried.

What Does Testing Thousands of Sites Look Like?

I’ll ad­mit, I’m not quite to the thousands of sites” ter­ri­tory just yet. The most frus­trat­ing part is get­ting started. Now that I’ve got a cou­ple of sites un­der my belt, with the tools ready and able , the next thou­sand will be a lot eas­ier.