How to SourceExtract the entire html code?

torok · June 3, 2018, 12:31pm

I want to capture the whole html code of a page and put it in a csv file. I didn’t find a command like CaptureEntireHTMLpage

It sounds simple, just use SourceExtract with a regex like .* but it doesnt work on special caracters and linefeeds.

I managed a this regex : [<>=-_.:;"’/û,€éèêçàïëü&|()%!+*’\’?$\w\s]*
but it doesn’t work for all web pages

So is it possible to have a regex which select all kind of characters or to have all the html code extracted and saved in a csv file or directly on the hard drive like the LocalStorageExport command?

TechSupport · June 4, 2018, 7:04am

Thanks for the good problem description. As you said, doing this reliable for all kind of websites via sourceExtract can be tricky. => It sounds like what you need is a function “Save Page As” (HTML only), like Chrome and Firefox offer it.

This is not yet available, but is on our todo list for the July upgrade. And once we have it, LocalStorageExport will allow you to export it to a local file.

If you want, you can stay informed about a9t9 software updates with
our free a9t9 newsletter.

mm-a9t9 · March 9, 2019, 7:57am

Hi, has this been added? I have the same question, but can’t find any option to achieve this. Thanks.

admin · March 9, 2019, 9:09am

The feature “sourceExtract for the complete HTML page” has not been added yet.

But if your main goal is to save the website HTML, you can do this meanwhile by simulating “Ctrl+S, Enter” with XType. The “DemoXType” macro that ships with Kantu shows this feature. With a few more keystrokes you can also select the “Webpage, Complete” option. Then you save all images, too.

mm-a9t9 · March 10, 2019, 12:17am

Thanks, I made that work, though it was difficult to determine where the file would get saved, so I had to go hunting for it, and when I found it, it wasn’t really in a place where I would expect it. If there are any hints about how to determine/control where it goes, others may appreciate it. Thanks.

admin · March 10, 2019, 9:46am

The rule is: Everything is saved to the download folder

For more details see: Where does Kantu save files? - HowTo - UI.Vision RPA Software Forum | Discuss RPA Automation, Selenium IDE and OCR API Text Recognition

inaspin · June 15, 2021, 2:12pm

I know this is an old topic, but I needed to accomplish this also. I found a solution tha works by the following

anselm.scholz · October 18, 2021, 11:08am

Hi,

any news if the SavePageAs is implemented yet?

Thanks

Anselm

upa · February 14, 2022, 5:43pm

It will be great if this is implemented!

A simple way to save the entire webpage is very helpful!

admin · November 18, 2024, 12:02pm

The solution from @inaspin works well. However, sometimes web pages contain tons of scripts that are not needed for text web scraping. Here is an extended executeScript command that removes scripts from the returned HTML code:

    {
      "Command": "executeScript",
      "Target": "var str = document.body.innerHTML; // Get page source\n\n//Next: Clean up HTML source before further processing  \n\n//First remove scripts and style tags with their content\nstr = str.replace(/<script\\b[^<]*(?:(?!<\\/script>)<[^<]*)*<\\/script>/gi, '');\nstr = str.replace(/<style\\b[^<]*(?:(?!<\\/style>)<[^<]*)*<\\/style>/gi, '');\n   \n//Then remove all remaining tags but keep their content\nstr = str.replace(/<[^>]+>/g, '');\n   \n//Clean up whitespace\nstr = str.replace(/\\s+/g, ' ').trim();\n   \nreturn str;",
      "Value": "html",
      "Description": "Extract HTML code of website. Remove scripts."
    }