Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Autotab – An AI-powered Chrome extension to create Selenium scripts (autotab.com)
376 points by jonasnelle on Oct 19, 2023 | hide | past | favorite | 126 comments
Autotab is a Chrome extension that writes Selenium code to mirror your actions as you navigate the browser. See it in action: https://youtu.be/UypAcozIaoo

Autotab lets you create browser automations that actually work. We designed it around two principles:

    1. Show, don’t tell: In a domain like web automation, it's often easier to *show* the model what you want rather than to explain it in sentences.

    2. Code is the best output: Code is easy to inspect and enables manual tweaking of the model’s suggested actions. On top of that, code output avoids lock in and is straightforward to extend and integrate with larger projects.
Autotab runs as a Chrome extension. As you navigate in the browser, autotab generates the Selenium code to reproduce your actions. You can copy that code into your own project or use our starter GitHub repo to get your automation up and running in <5 minutes: https://github.com/Planetary-Computers/autotab-starter.

We'd love to hear what you think!




Selenium project creator here...

Very cool! I totally expect testing to be one of the killer apps for AI. And this is an easy way to get there.

How do you ensure the AI isn't hallucinating any details in the generated code?


That's awesome! We couldn't have built any of this without Selenium, definitely standing on the shoulders of giants :)

Right now we have the basic sanity check that the Selenium xpath selects exactly one element in the DOM. Going forward I think the best way to do this is to have a live preview where you can see the mirrored browser copy your last action and debug if it errors or isn't what you wanted.


> and debug if it errors or isn't what you wanted.

That is a potential problem Because at this point the user will still need good HTML/Selenium/coding knowledge to debug. We ran into the same issue using chatgpt generated scripts for ui.vision. Once a QA person has to figure out why some generated code does not work, it becomes a hassle and removes any initial advantage over the classic record & replay approach.


I agree, to really debug they need coding knowledge today.

But there's something in between where you can say "try again" or even give high level feedback like "don't click the element with that exact title, just click whichever one is first in the list". My working hypothesis is that these are bigger than the ones where you need coding knowledge. Certainly this category will only grow as language models become better.


Can you have a feedback loop, for example trying the test in the background (headless) and if it doesn't work altering paths and retrying?


Yes! Two things we’re working on that would be very cool are 1) seeing the mirrored actions executed live as you record and 2) even if the website changes later, using AI at runtime to auto-heal the automation.


Cool! Good luck, it would help me with my work so I’m rooting for you :)


This is amazing. I will try to have it automate my system of agents web app in a meaningful way (turtles all the way down.) Shameless plug: https://github.com/agi-merge/waggle-dance

BTW I don’t normally use Chrome- I know why you would start with that, but it would be great to see a Firefox or even Safari version of this.


I checked out waggledance.ai - looks awesome what it can do and really interesting to me!

My question would be what does waggledance do that a GPT4 reply would not provide besides the (really well done) UI?

In my experience the way the tasks are generated and broken down is pretty much what I get when I ask ChatGPT.


Glad to hear it, let me know how it goes! And yes, other browsers are definitely in our future at some point


I got stuck with an `autotab record` error after following the README setup instructions:

selenium.common.exceptions.WebDriverException: Message: Service /opt/homebrew/bin/chromedriver unexpectedly exited. Status code was: -9


Hm, that unfortunately sounds like a chromium/Chromedriver error which can be somewhat finicky. I recommend check that the versions you're using match (`chromedriver --version` and at chrome://settings/help). Let me know how it goes and happy to help debug further (though GPT4 is probably more helpful than I will be)


Wow! I haven't played with it a much yet, but just wanted to say that waggledance.ai seems really, really cool. It seems like a really interesting idea, and I love the UI and general aesthetic!


Thank you! It’s been fun to build and is getting even more fun now that I am adding skills and inter-agent interactions.


This is amazing!! I've found myself writing selenium scripts to automate tasks for my dad's job (things such as getting a name from a spreadsheet, putting that name in a website's search box and from there repeating the same actions for 100s of names) and saved him a ton of time. Making browser automation more accessible by just showing the machine how to do it will definitely make lots of people's lives easier. Can't wait to mess around with it.


Interesting, sounds like the kind of stuff we're excited about helping with! Would love to hear what you think once you try it, email is in my bio if you're open to a chat.


This is very cool! I want to host this on a server and deliver it using BrowserBox so you can build automations on mobile.

This was what I originally created BrowserBox for--a delivery platform for web scraping authoring tools so you're not limited by extensions--but got into building the remote browser as a product, and haven't got around to the automation authoring tool yet, haha!

Thank you for creating this, and making it MIT. AI-augmented human guided web scraping authoring is definitely the future in this space I think!

Great name, too!

autotab

Very cool!

I was already calling my up-and-coming BrowserBox SaaS Cloudtabs so there's .... "synergies" hahaha! :)

https://github.com/BrowserBox/BrowserBox


Thanks! Excited to see what you do with it

BrowserBox looks cool, I couldn’t tell from the README what browser engine it uses. Did you build your own?


Thank you, Jonas!

No, building that wouldn't give us anything. We use Chrome. Also works with anything else derived from Chromium, so that's Edge, Brave, etc.

Chrome runs headless on the server, instrumented with Chrome DevTools Protocol^0. The cool thing about using the DevTools protocol is you can customize so much. CDP is basically like a superset of the extensions APIs. In our case, one thing it's useful for is fine control over live-streaming the tabs.

You must be super busy with everything you're doing, if you ever have any questions about anything feel free to launch an email to me at cris@dosyago.com -- I wish you the best with autotab!

0: https://chromedevtools.github.io/devtools-protocol/


I am building _exactly_ the same thing for Playwright over at https://ray.run/. I think this is the future of writing tests no doubt. Planning to launch next week.

By the looks of it, I am taking a slightly different approach than you. I am using LLM only to identify elements, but the actual generators are created using https://ray.run/browser-extension


FYI your animated background should be disabled when your visitor [prefers-reduced-motion](https://developer.mozilla.org/en-US/docs/Web/CSS/@media/pref...).

Edit: or just disabled in general cos it's super annoying


Ooh, ty! Going to ping you about this.


How is this different from the "Recorder" feature available in Chrome (Dev Tools > Recorder)?

It can record a user-journey and print out a puppeteer script.


The main difference is probably that autotab outputs a Python script, which is easier to integrate with existing Python codebases.

Going forward I think the flexibility of Python code & the Python ecosystem will make it easier to build a bunch more things we want to do. For example, a lot of browser-based work is not totally deterministic but rather requires more intelligence during the automation, e.g. summarizing information on a web page or dynamically filling in a text box with the right answer.


Wow! I can totally see that now. Basically you want to build a more intelligent browser-automation tool :)


Wow - did not know about that feature. Thanks for highlighting it!


I was curious about the YC backing and realised that you're the founders of https://www.ztool.co. Did you pivot or is this a complementary prodcut to ztool? Would love to hear more about your journey :)


Hey, nice detective work :)

This is definitely more of a pivot, though the ZTool product also went through a few bigger iterations itself. I think my learnings from building and talking to users the last few months boil down to two main things that led me to work on autotab.

    1. AI-generated software isn't ready for non-technical users. With ZTool I was initially focused on making it easy for people who don't know how to code to create automations. For the reasons we talked about above, I think having the model's output be code is best for now, so autotab is focused on users that can review and tweak Python code.
    2. How you communicate intent matters. There is a surprising amount of mental work required to go from "this task is annoying" to creating a structured representation of it (c.f. the whole world of process automation). AI demos have focused on short prompts to communicate intent, but those don't work that well for more complex domains. Using the browser to communicate intent seems really powerful because it's so intuitive/familiar.


> This will launch a Chrome session controlled by Selenium and then log you in to Google

Why is logging into Google a requirement?


Fyi we fixed this so now you can get an API key at autotab.com/dashboard and then that is sufficient to authenticate when you run `autotab record`, no more logging in to Google in the Selenium browser!


We currently only allow log in with Google for logging in to autotab itself, but will add login with username/password and other options soon! Depending on feedback we might also build a bring your own Open AI API key option


Ah that makes sense. I missed the API auth part, assumed BYO(OAAK) was the default


Sorry if this is self-explanatory to most, but what does `BYO(OAAK)` mean? Thanks!


Im also seeing it for first time.

But i assume it means Bring Your Own (Open AI Authentication Key).


That makes sense, thanks! I assumed BYO was for "Bring your own" but wasn't sure about the 2nd part.


Does this allow me to record something like a workflow or macro I can use as a shortcut in an open browser window or is it restricted an automated process of opening a browser window and executing some actions?

Specifically, assume I have a bunch of open tabs, each with an identical element, say the Full Screen button on Youtube video pages. Can I record "Exit Full Screen, move to next tab, enter Full Screen of the video on the page" with Autotab?


You could do a version of this if you always open your Youtube videos in the browser instance your autotab script spins up, but short answer is not really. Macros like this are super interesting though, and we've talked a lot about ways browsers could be designed for AI and automation like this.


This is very interesting, may I ask what macros you would define? Definitely something I'd like to learn more about


The example I've given is really the only situation so far where I desired such functionality. I know very little about web stuff and I don't know if there is a term for what I'm looking for, so I can't give you any pointers. Macros and workflows are just similar concepts I know.


What would make this maximally useful is if an LLM with a ruleset would be available to parse and act on dynamic situations on pages.

For example, add to cart, if the LLM detect it's out of stock (a rule), then do some other action.

What's the possibility of LLM-reinforced, rules based branching logic like this being possible with your software in the future?


Couldn’t agree more! Working on this actively. Interesting questions include 1) what format is most effective for the user to convey their intent (maybe not point & click) and 2) how to represent the model output such that it is auditable and editable.


Awesome to hear, it's already useful now for us to get started with using it, but this kind of evolution would be panacea.

I suggest a Discord group for Autotab to start building up a community! Looking forward.


We just setup a discord! https://discord.gg/seGGxSUgzM


Curious which part of this uses AI. Doesnt it just track mouse / keyboard events across the session?


AI looks to be involved in creating element selectors.


Selectors have been our primary focus so far – they're notoriously finicky! Our roadmap includes more extensive use of AI, both as embedded intelligence, and in the code generation process. For example, one thing we've heard from heavy users of browser automation is that maintenance becomes the largest cost. Self-healing automations will be able to either fix themselves, our give you an alert with a suggested fix to work off of.


The "self-healing" sounds very interesting. I've tried to think, myself, how to approach this in a chrome extension running dom selectors in automations. Curious if you have any high-level thoughts/findings in this area?


We're just getting started on it ourselves but it's a really fun problem. I think the useful thing from our findings so far is that simplifying the DOM representation really helps the model reason about state.


I'm confused the demo shows typing in to select a element in a row which looks to be AI, I don't see anything that looks to be AI in the selectors? I'm not even sure how you would work with selectors unless you put the whole html into the context window or just ask which locator looks most reliable?


That’s exactly what we do - we sample relevant parts of the DOM and use the model to write the logic for selecting that element. This works pretty well and saves a lot of time that developers otherwise spend inspecting the html structure to write the selectors themselves.

Going forward we’re excited to experiment with more intelligence at runtime e.g. using AI to try to recover if the selector wasn’t found.


So I assume the video is the ground-truth, then the AI has access to the DOM and the video, and generates a selector based on the video during the test run (each time) in order to do avoid flakiness due to DOM/class/attribute changes?


Right now the generated script is the ground truth but we’ve been working on augmenting this with images & videos to fall back on. We think defaulting to code is good because it is faster, cheaper and more easy to reason about in the 95%+ of times it works. Plain old Selenium will get you pretty far, especially if creating scripts is much easier.


How comfortable should I feel about having this record the input of potentially sensitive information if it's going to send those inputs off to an LLM?

One way I can see around it is to not put anything sensitive in during the record, but sometimes I might need to enter a password etc.


You can pause recording at anytime. For passwords we recommend using the login tools in our starter repo, check out the example config here: https://github.com/Planetary-Computers/autotab-starter/blob/...

For sensitive information that appears in the DOM during recording, there is a chance it could be included in a prompt to the LLM. We're using OpenaAI via API, which is SOC 2 & 3 compliant and does not use data for model training (supposedly) https://trust.openai.com


This is a great use of the sidepanel API. Really like the idea and how you implemented it. The Firefox equivalent to that API isn’t identical to Chrome’s, and I believe that there is currently no way to trigger the side panel to open programmatically unless it’s in response to a user action. So it may take a bit of work, but would be exciting to see support for FF in the future.

A bit tangential but I was curious what you used to record the demo video on your landing page with those zoom-in animations during critical moments. I’d like to record something like that for some of my side projects and thought your video looked rather polished.


Thanks! It's still very much early days for the Sidepanel and the API feels very much in flux but an exciting new form factor for browser-integrated experiences. Makes total sense, supporting other browsers is likely something we will do in the future

I used https://www.screen.studio/ for the demo and past demos, it works quite well I find though it can get a bit too excited on the zoom ins and I have to cut out some of them


I'm working on something kind of similar but for Appium. This is excellent work!


Tell me more! Have a link?


Forgive my ignorance, but isn't the chrome extension market kind of dead?

I had some silly extension a long time ago but after receiving a lot of emails from Google saying that they were killing the ecosystem I've dropped them.


autotab is not an always on browser extension like you are probably used to. autotab is only added to the Selenium-controlled browser window you use to write the automation while you're recording.

Also my experience differs more broadly, I use browser extensions quite a bit, e.g. for ad blocking and password management. Afaik Honey was basically a browser extension and was bought by Paypal for $4B.


I think this is the email i got that led me to believe extensions were dead:

Now: You can no longer create new paid extensions or in-app items. This began as a temporary restriction in March 2020 due to the COVID-19 situation. This change is now made permanent. December 1, 2020: Free trials are disabled. The "Try Now" button in CWS will no longer be visible, and in-app free trial requests will result in an error. February 1, 2021: Your existing items and in-app purchases can no longer charge money with Chrome Web Store payments. You can still query license information for previously paid purchases and subscriptions. (The licensing API will accurately reflect the status of active subscriptions, but these subscriptions won’t auto-renew.) At some future time: The licensing API will no longer allow you to determine license status for your users.

I have the whole thing in my inbox if anyone is interested


This all just means that you need to roll your own monetization functionality. I think there are still people building businesses on top of browser extensions.


What distinguishes this from UI.Vision's (https://ui.vision/) session recorder?


I'm not familiar with UI.Vision but from looking at their website briefly it looks like the resulting automations they are part of their system/UI. We think that output as code is the way to go, so that you don't have to learn a new UI/language, aren't locked in and can integrate it into a larger project.


Thanks for sharing this! I'm interested in seeing what you've built and understanding it better. I think that your goals and philosophical guideposts make a lot of sense. I see the search application on your homepage, but I'd like to see other examples to spark imagination for how I might practically use this myself.

-However, I'm getting a 404 on your Github link -- is that `autotab-starter` repo private?- (n/m -- looks like it's working now!)


fixed! thanks for the heads up


Amusingly I've just started playing with llama.cpp, and the first successful test (after some issues with AVX instructions on my test VM) was getting Vicuna 30b to complete "The result of your test is":

> "The result of your test is: You should not use Selenium WebDriver for testing web applications. Other automated testing tools such as Cucumber, RSpec or Jest may be more suitable for your needs. [end of text]

Seemed a little mean. :P


My understanding is some of the LLM tooling is heading in this direction like Langchain and Open Interpreter are working on a screenshot to agent action function capability. Hot space !!


Hopefully you're still monitoring this page:

Do the HTML contents of the page get sent to your server for processing? What guarantees are there over privacy?

One of the uses I see for this is in helping users scrape their own transactions from banking and other financial websites.


Yes, we process the contents of the page when users are recording on our servers but don’t store the page HTML. I don’t see any world in which we would sell data, but in case you’re worried about the security of our servers I’m not sure there’s much I can say other than that we use standard security best practices including SSL, 2FA on cloud accounts etc. Does that answer your question?


It does, thanks.



All names are taken, you just have to become more popular to win it.


Unless it's copyrighted/trademarked, in which case you have legal ground to stand on (assuming both are proprietary services, which may not apply here).


Nice work! What are some differences between this project and Katalon Recorder? Katalon Recorder is also a chrome extension, allows recording of actions, and export to Python.


Katalon looks like a cool project, hadn’t heard of them before! We’re more focused on automations as opposed to testing, which is most important for where we take the product next. More focus for us on handling tasks that are less deterministic/require a bit more intelligence.


Congrats on the launch! This definitely seems useful for some stuff I've been doing, like filling out a CRM spreadsheet with the right user data.


Thanks so much! Glad to hear it


Hey, it looks very neat. I was wondering if Autotab currently supports logging into websites that require 2FA. (Your readme mentions Google only.)


Hey, we're looking to build out our library of authentication plugins. Many websites offer "Sign in with Google" which makes things easier, you can just specify that in the autotab.yaml file. If you have a specific website in mind, I'd be happy to add support for it!


Thanks for the reply! I was thinking about an internal website. I don't know if this is a reasonable suggestion, but maybe you could focus on common SSOs and 2FA providers like Duo and OneLogin?


Interesting! I think the way we’ll handle this is the sane way we handle Google 2FA atm: We have you log in and do 2FA manually once, then we save the logged out cookies.

Why the logged out cookies? That way even if someone gets the cookies they still need the password, but the automation can log in with just your password, no 2FA needed.


I'd love if it could output Playwright code as well. I haven't tried it but if this works well it's a great concept!


Good to know! The comments have definitely made us update towards looking into Playwright again quite seriously


Yeah - I moved my entire automation library over at the beginning of the year. The docker integrations and ease of switching out browsers without having to manage webdrivers individually sealed the deal. I'm also tentatively using their selectors a bit, although find I have to retreat to Xpath quite a bit. Playwright seems to be the obvious choice for testing - but automation who knows.

If you can be playwright for automation instead of testing - that'd be a biiiig market. Lot's and lot's of scrapers out there and it's a market microsoft is not interested in.


Interesting, can you say more about why you prefer Playwright’s selectors? Helpful to have some anecdotes from real experience!


I don’t prefer I’m just playing. If they work, the readability of the code is just simpler vs xpath.

But it’s less than 5% of selectors across the code base


Playwright is definitely the industry standard testing tool for those who are likely to be interested in Autotab too - I'm a test engineer and the last time I used Selenium was probably 2017. Playwright is everywhere that's comfortable using AI-tooling!


Interesting! We are more focused on automations as opposed to testing - would you say the same applies there as well?

Also Selenium does have 10x the downloads of Playwright on PyPI - is Python different or do you think that metric is misleading?


Selenium IS the more popular library, but it’s mostly used in more old-school places like banks and large corporations. This is fairly anecdotal but companies up for AI-guided paths in prod will more likely be on Cypress or Playwright, and from that Playwright fits in for your case because it’s also webdriver driven. The package actually includes a webdriver btw, so you don’t have to ask users to manage that themselves (for example brew chromedriver has to have permission changes to run if you follow your readme, which playwright would avoid).

Edit: as they both use the same interface it might not be too bad to support both?


How is this different from Playwright? There, You can record your steps and get a python script?


This is super cool!

What are some use cases that you recommend this for, outside of testing browser workflows?


We’re actually mainly focused on automations more than testing. We want to build autotab into your intern/assistant for browser-based tasks that you would rather have someone else do.

For example, autotab helped a designer automate client handoffs. He set up an automation that exported assets from Figma and uploaded them to Google Drive, updated his CRM and notified the client.


Replace the GitHub repo with an executable Electron app, beat the world.


How’d you guess my secret plan? ;) Definitely in the cards going forward


Awesome. I'm mostly curious what the path to monetization for this is


I'm more interested in making it useful than extracting value.

Longer term I think you'd monetize the infrastructure to run the automations at scale in the cloud.


What AI does this use?


We use OpenAI's GPT4 to understand the position of the element being interacted with relative to the structure of the website overall. Lots of cool things we want to try out with open source models, but for now GPT4 is good enough.


Could this work for other languages, for example, R?


Unfortunately we have no plans to support R at the moment, though in my experience GPT4 is pretty good at translating code between languages


macOS only? Wanted to try this on Windows


why does it require logging into google to use?


We plan to allow users to login with other accounts / signup with email + pw, just haven't set that up yet :-)


I don't get why people still use XPaths, CSS selectors or HTML IDs to identify elements, even when they are "recorded". Please please please just use my https://github.com/mherrmann/selenium-python-helium instead. It makes so much more sense.


Let me have the opportunity to really thank you for this it has been my goto library for scraping, while allowing also selenium syntax, the idea behind it is really great. Also the “elements below …” very nice

Please make the switch to selenium 4, so that it is kept up to date with the rest of the selenium ecosystem e.g. undetectable browsers


I haven't found anything that made me really need to try other things, but the fact that it's just a wrapper and sounds like it handles nested iframes better means I'm trying this next time I need to make a quick selenium script. Thank you


This is the idea behind Testing-Library from what I understand.

https://testing-library.com/


Why wouldn't you use CSS selectors? They are unambiguous, concise and most importantly, a standard. They should be the conventional way to refer to elements.


The CSS selector isn't the problem, it's how you identify it. Many sites have dynamically named selectors, including class names. Even depending on ordering is fraught. Learning how to create robust selectors is about 1/2 the battle in writing a good Selenium/Puppeteer/Playwright/etc script. (been doing this as a major part of my day job for about 14 years)


The CSS structure of a web site is an implementation detail and on a global average, CSS selectors break much more often than user-visible labels.


> and most importantly, a standard.

A standard for what, exactly? Selecting HTML elements? You could say the same about XPath.

Both are implementation details tightly coupled to the code instead of being driven by the “human” elements of the UI, such as the visible text, labels, etc.


Helium seems great! Handling popups sounds really clever, how do you pull that off?


Playwright is another, really good alternative


It is, but the selector issues are the same. (though it does make it much easier to execute scripts in the context of the page when standard selectors don't work)


It is, but the selector issues are the same

I disagree. Playwright's locators are pretty powerful compared to standard css/xpath selectors.

For example it includes layout based selectors.

https://playwright.dev/docs/other-locators#css-matching-elem...


This looks great, thanks


Related: https://playwright.dev/ has a built-in code generator. In *my opinion*, it's also more pleasant to work with than Selenium.


Interesting, what features are most impactful in making it more pleasant to work with?

We picked Selenium because it seems much more widely used at least in the Python world (selenium has about 10x more downloads on PyPI than playwright). My guess is that's at least partly because Playwright is more focused on the JS/TS ecosystem and testing.


I recently finished moving fr Python-selenium to Python-playwright. Though my outcomes might be different, it is such a significant improvement that I would strongly, strongly recommend at least spending a couple days trying it out if you are familiar with selenium.

To my team, selenium’s only advantage in the python ecosystem is that it is easier to hire people with experience. However, anyone familiar with selenium is likely to pick up playwright extremely quickly anyways

Playwright does ship a pytest integration but it is not required

Some highlights:

1. Better waiting — everything is auto-waited and auto-retried by default

2. Easy to install browsers — no need to get separate webdriver browsers. Just run “playwright install chromium”

3. Full, accurate typing

4. Trace viewing to step through a script execution

5. Async support

6. (Arguably) more pythonic syntax and easier to pick up/ train people

I really enjoyed mastering selenium over the years, but I struggle to think of a use case for it outside of maintaining legacy script suites anymore. Playwright just does it better.


Have been digging into Playwright a bit - can you say more about the trace viewing to step through a script execution? Very curious!


That’s super helpful, thanks for explaining!


Wildly better developer experience. Playwright is also much less flakey and easier to work with than cypress, another alternative to selenium.

Selenium is quite outdated, though still widely used. It's much older. I would look at downloads over time / rate of change, not total downloads.

My hot take- don't hook yourselves to outdated tech.

Don't take my word for it- go try out selenium, Cypress, and playwright and draw your own conclusions.


fwiw, the Selenium project is now old enough to go to college (19 years this month!); it's not ready to retire just yet. (Side-note: Google Chrome is 15 years old. Age is just a number.) Selenium is still learning from other projects and implementing new things. Specifically, check out the WebDriver BiDi project, adding a bidirectional protocol which was the core of what made Playwright and Puppeteer faster. Also, Selenium devs are working with the W3C to make this work for everyone.

https://github.com/w3c/webdriver-bidi

https://w3c.github.io/webdriver-bidi/

Google wrote a good explainer about WebDriver BiDi here:

https://developer.chrome.com/articles/webdriver-bidi/



True, but like the Chrome extension in the sibling comment, it often hard-codes references, which may not be the same in the next run.


[deleted]




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: