Best practice for continuous, repeated searches and storing of results

csvsave
web-scraping

#1

Hi there!
I’ve been trying out kantu for quite a while now and I’m almost where I want to be. This is what I want to do:

  1. Visit a website
  2. Login (sometimes tricky because it has a recaptcha, partly solved after logging in to google before, worst case would be a manual interaction at this point to proceed)
  3. Visit another page, toggle a radio button, trigger a search button
  4. Grab headlines of the search results and write them into a csv
  5. Notify me if any of the headlines changed/new ones arrived
  6. Wait a few minutes and repeat steps 3-5

I just ran the macro after a few days and to my surprise it just worked the first time, but I can’t imagine it to be the best way to do that.

Either way, my current issues/questions are:

  • It’s no loop yet, how can I keep the browser open and run the script on a schedule every few minutes (only during business hours)?
  • Step 5 will be handled by a python script which checks the content of the saved csv file and deletes everything but the last line (thanks to storage mode), is there an internal way to compare lines and do something if they don’t match?
  • Right now I’m scraping the headlines with a loopcounter, which doesn’t seem to be the greatest idea, is there a way to just grab all elements with class xyz?
  • Sometimes the session is being closed and I need to relogin, I tried using a gotoIf, would appreciate if someone could check if that’s a decent solution.

Here is the script, would appreciate some comments if that’s how you would have solved it and if not, what’s a better way to do it?

{
  "Name": "search",
  "CreationDate": "2019-4-25",
  "Commands": [
    {
      "Command": "store",
      "Target": "2",
      "Value": "!timeout_wait"
    },
    {
      "Command": "store",
      "Target": "true",
      "Value": "!ErrorIgnore"
    },
    {
      "Command": "open",
      "Target": "https://website-to-scrape.com/?page=search",
      "Value": ""
    },
    {
      "Command": "click",
      "Target": "xpath=//*[@id=\"search-form\"]/div[3]/div/label[1]",
      "Value": ""
    },
    {
      "Command": "clickAndWait",
      "Target": "name=action",
      "Value": ""
    },
    {
      "Command": "gotoIf",
      "Target": "!${!statusOK}",
      "Value": "RELOGIN"
    },
    {
      "Command": "label",
      "Target": "RELOGIN",
      "Value": ""
    },
    {
      "Command": "open",
      "Target": "https://website-to-scrape.com/",
      "Value": ""
    },
    {
      "Command": "clickAndWait",
      "Target": "xpath=//*[@id=\"loginform\"]/fieldset/div[3]/div/button",
      "Value": ""
    },
    {
      "Command": "open",
      "Target": "https://website-to-scrape.com/?page=search",
      "Value": ""
    },
    {
      "Command": "click",
      "Target": "xpath=//*[@id=\"search-form\"]/div[3]/div/label[1]",
      "Value": ""
    },
    {
      "Command": "clickAndWait",
      "Target": "name=action",
      "Value": ""
    },
    {
      "Command": "store",
      "Target": "2",
      "Value": "loopcounter"
    },
    {
      "Command": "while",
      "Target": "(${loopcounter} <=5)",
      "Value": ""
    },
    {
      "Command": "echo",
      "Target": "Loop = $(loopcounter)",
      "Value": ""
    },
    {
      "Command": "store",
      "Target": "true",
      "Value": "!ErrorIgnore"
    },
    {
      "Command": "storeText",
      "Target": "xpath=//*[@id=\"headlinescontainer\"]/div[${loopcounter}]/div[2]/h4",
      "Value": "!csvLine"
    },
    {
      "Command": "storeEval",
      "Target": "${loopcounter} + 1",
      "Value": "loopcounter"
    },
    {
      "Command": "endWhile",
      "Target": "",
      "Value": ""
    },
    {
      "Command": "csvSave",
      "Target": "results",
      "Value": ""
    },
    {
      "Command": "echo",
      "Target": "all done!",
      "Value": ""
    }
  ]
}

Best regards


#2

It’s no loop yet, how can I keep the browser open and run the script on a schedule every few minutes (only during business hours)?

Maybe use the task scheduler?

Step 5 will be handled by a python script which checks the content of the saved csv file and deletes everything but the last line (thanks to storage mode), is there an internal way to compare lines and do something if they don’t match?

In the CSV file? Not that I know

Right now I’m scraping the headlines with a loopcounter, which doesn’t seem to be the greatest idea, is there a way to just grab all elements with class xyz?

Maybe with sourceExtract and regular expressions. But if your solution works, why change it? :wink:

Sometimes the session is being closed and I need to relogin, I tried using a gotoIf, would appreciate if someone could check if that’s a decent solution.

I use this approach as well to catch this “no logged in” issue.


#3

Hi ulrich,
thanks for the reply!

Those scripts are running on a linux machine, is there an alternative for that?

Best regards


#4

Cron?

[Post must be at least 20 characters]


#5

right thanks, already got that working yesterday. had to add export DISPLAY=:0 && /opt/google/... before to get it to work properly