Best practice for continuous, repeated searches and storing of results

msbt · April 25, 2019, 9:29am

Hi there!
I’ve been trying out kantu for quite a while now and I’m almost where I want to be. This is what I want to do:

Visit a website
Login (sometimes tricky because it has a recaptcha, partly solved after logging in to google before, worst case would be a manual interaction at this point to proceed)
Visit another page, toggle a radio button, trigger a search button
Grab headlines of the search results and write them into a csv
Notify me if any of the headlines changed/new ones arrived
Wait a few minutes and repeat steps 3-5

I just ran the macro after a few days and to my surprise it just worked the first time, but I can’t imagine it to be the best way to do that.

Either way, my current issues/questions are:

It’s no loop yet, how can I keep the browser open and run the script on a schedule every few minutes (only during business hours)?
Step 5 will be handled by a python script which checks the content of the saved csv file and deletes everything but the last line (thanks to storage mode), is there an internal way to compare lines and do something if they don’t match?
Right now I’m scraping the headlines with a loopcounter, which doesn’t seem to be the greatest idea, is there a way to just grab all elements with class xyz?
Sometimes the session is being closed and I need to relogin, I tried using a gotoIf, would appreciate if someone could check if that’s a decent solution.

Here is the script, would appreciate some comments if that’s how you would have solved it and if not, what’s a better way to do it?

{
  "Name": "search",
  "CreationDate": "2019-4-25",
  "Commands": [
    {
      "Command": "store",
      "Target": "2",
      "Value": "!timeout_wait"
    },
    {
      "Command": "store",
      "Target": "true",
      "Value": "!ErrorIgnore"
    },
    {
      "Command": "open",
      "Target": "https://website-to-scrape.com/?page=search",
      "Value": ""
    },
    {
      "Command": "click",
      "Target": "xpath=//*[@id=\"search-form\"]/div[3]/div/label[1]",
      "Value": ""
    },
    {
      "Command": "clickAndWait",
      "Target": "name=action",
      "Value": ""
    },
    {
      "Command": "gotoIf",
      "Target": "!${!statusOK}",
      "Value": "RELOGIN"
    },
    {
      "Command": "label",
      "Target": "RELOGIN",
      "Value": ""
    },
    {
      "Command": "open",
      "Target": "https://website-to-scrape.com/",
      "Value": ""
    },
    {
      "Command": "clickAndWait",
      "Target": "xpath=//*[@id=\"loginform\"]/fieldset/div[3]/div/button",
      "Value": ""
    },
    {
      "Command": "open",
      "Target": "https://website-to-scrape.com/?page=search",
      "Value": ""
    },
    {
      "Command": "click",
      "Target": "xpath=//*[@id=\"search-form\"]/div[3]/div/label[1]",
      "Value": ""
    },
    {
      "Command": "clickAndWait",
      "Target": "name=action",
      "Value": ""
    },
    {
      "Command": "store",
      "Target": "2",
      "Value": "loopcounter"
    },
    {
      "Command": "while",
      "Target": "(${loopcounter} <=5)",
      "Value": ""
    },
    {
      "Command": "echo",
      "Target": "Loop = $(loopcounter)",
      "Value": ""
    },
    {
      "Command": "store",
      "Target": "true",
      "Value": "!ErrorIgnore"
    },
    {
      "Command": "storeText",
      "Target": "xpath=//*[@id=\"headlinescontainer\"]/div[${loopcounter}]/div[2]/h4",
      "Value": "!csvLine"
    },
    {
      "Command": "storeEval",
      "Target": "${loopcounter} + 1",
      "Value": "loopcounter"
    },
    {
      "Command": "endWhile",
      "Target": "",
      "Value": ""
    },
    {
      "Command": "csvSave",
      "Target": "results",
      "Value": ""
    },
    {
      "Command": "echo",
      "Target": "all done!",
      "Value": ""
    }
  ]
}

Best regards

ulrich · April 26, 2019, 7:46pm

It’s no loop yet, how can I keep the browser open and run the script on a schedule every few minutes (only during business hours)?

Maybe use the task scheduler?

Step 5 will be handled by a python script which checks the content of the saved csv file and deletes everything but the last line (thanks to storage mode), is there an internal way to compare lines and do something if they don’t match?

In the CSV file? Not that I know

Right now I’m scraping the headlines with a loopcounter, which doesn’t seem to be the greatest idea, is there a way to just grab all elements with class xyz?

Maybe with sourceExtract and regular expressions. But if your solution works, why change it?

Sometimes the session is being closed and I need to relogin, I tried using a gotoIf, would appreciate if someone could check if that’s a decent solution.

I use this approach as well to catch this “no logged in” issue.

msbt · April 29, 2019, 2:03pm

Hi ulrich,
thanks for the reply!

Those scripts are running on a linux machine, is there an alternative for that?

Best regards

commensal · April 30, 2019, 12:50am

Cron?

[Post must be at least 20 characters]

msbt · April 30, 2019, 5:45pm

right thanks, already got that working yesterday. had to add export DISPLAY=:0 && /opt/google/... before to get it to work properly

JeasonChen · September 7, 2021, 9:12am

Use 2captcha.com to solve