Requesting Suggestions on Downloading Hong Kong Regulations in PDF Format

I am currently conducting research on the Hong Kong regulations and I would like to kindly inquire if there is a way to download all of these regulations in PDF format by utilizing the provided PDF button. I am interested in exploring the possibility of developing a tool using UI vision to download each regulation individually until reaching the last one. I greatly appreciate any suggestions or guidance from the experts in this matter.


Link:https://www.elegislation.gov.hk/index/chapternumber?TYPE=1&TYPE=2&TYPE=3&LANGUAGE=E

2 Likes

Hello, thanks for posting this very interesting question. We just had two interns in our office and gave them both the task to automate this file download.

Below you find their solutions :slight_smile:

1 Like

Intern1, first version. This approach uses “storeXpathCount” to count how many downloads there are on each page, and then downloads them step by step.

The drawback is that with this website each click on the download button opens a new window. So 1000 downloads = 1000 windows open.

{
  "Name": "Get All PDF - Intern1 - Click Link ",
  "CreationDate": "2024-4-6",
  "Commands": [
    {
      "Command": "open",
      "Target": "https://www.elegislation.gov.hk/index/chapternumber?TYPE=1&TYPE=2&TYPE=3&LANGUAGE=E",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "label",
      "Target": "Repeat",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "store",
      "Target": "1",
      "Value": "Global_i",
      "Description": ""
    },
    {
      "Command": "executeScript",
      "Target": "return parseInt(${Global_i})",
      "Value": "Global_i",
      "Description": ""
    },
    {
      "Command": "storeXpathCount",
      "Target": "xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div[2]/table/tbody/tr",
      "Value": "Global_max_rows",
      "Description": ""
    },
    {
      "Command": "while",
      "Target": "(${Global_i} < ${Global_max_rows})",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "echo",
      "Target": "Counter is ${Global_I}",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "store",
      "Target": "xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div[2]/table/tbody/tr[${Global_i}]/td[3]/div/div/div/a/img",
      "Value": "Global_xpath",
      "Description": ""
    },
    {
      "Command": "click",
      "Target": "${Global_xpath}",
      "Value": "",
      "Targets": [
        "xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div[2]/table/tbody/tr/td[3]/div/div/div/a/img",
        "xpath=//a/img",
        "css=#CHAPTER_NO_INDEX_GRID > div.grid-content > table > tbody > tr.even.{_GRID_FIRST_ROW:'true'} > td:nth-child(3) > div > div > div:nth-child(1) > a > img"
      ],
      "Description": ""
    },
    {
      "Command": "executeScript",
      "Target": "return parseInt(${Global_i}) + 1",
      "Value": "Global_i",
      "Description": ""
    },
    {
      "Command": "end",
      "Target": "",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "click",
      "Target": "xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div/a[5]/span",
      "Value": "",
      "Targets": [
        "xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div/a[5]/span",
        "xpath=//a[5]/span",
        "css=#CHAPTER_NO_INDEX_GRID > div:nth-child(1) > a.grid-link.grid-page.grid-page-next > span:nth-child(1)"
      ],
      "Description": ""
    },
    {
      "Command": "onError",
      "Target": "#goto",
      "Value": "Complete",
      "Description": ""
    },
    {
      "Command": "gotoLabel",
      "Target": "Repeat",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "label",
      "Target": "Complete",
      "Value": "",
      "Description": ""
    }
  ]
}

Intern 1, second solution. This version avoids opening all the download windows. Instead it extracts each file URL and then downloads them using the OPEN command. That works well. This is our recommended solution.

Video (First method 1A above, then the URL extraction method at t=58s):

For all solutions remember to set PDF to automatic download in the browser settings:

{
  "Name": "Get All PDF - Intern1 - Extract URL ",
  "CreationDate": "2024-4-6",
  "Commands": [
    {
      "Command": "open",
      "Target": "https://www.elegislation.gov.hk/index/chapternumber?TYPE=1&TYPE=2&TYPE=3&LANGUAGE=E",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "label",
      "Target": "Repeat",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "store",
      "Target": "1",
      "Value": "Global_i",
      "Description": ""
    },
    {
      "Command": "executeScript",
      "Target": "return parseInt(${Global_i})",
      "Value": "Global_i",
      "Description": ""
    },
    {
      "Command": "storeXpathCount",
      "Target": "xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div[2]/table/tbody/tr",
      "Value": "Global_max_rows",
      "Description": ""
    },
    {
      "Command": "while",
      "Target": "(${Global_i} < ${Global_max_rows})",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "echo",
      "Target": "Counter is ${Global_I}",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "comment",
      "Target": "store // xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div[2]/table/tbody/tr[${Global_i}]/td[3]/div/div/div/a/img",
      "Value": "Global_xpath",
      "Description": ""
    },
    {
      "Command": "storeAttribute",
      "Target": "xpath=(//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div[2]/table/tbody/tr[${Global_i}]/td[3]/div/div/div/a)@href",
      "Value": "Global_URL",
      "Description": ""
    },
    {
      "Command": "store",
      "Target": "https://www.elegislation.gov.hk",
      "Value": "Global_Initial_URL",
      "Description": ""
    },
    {
      "Command": "store",
      "Target": "${Global_Initial_URL}${Global_URL}",
      "Value": "Global_download_URL",
      "Description": ""
    },
    {
      "Command": "onError",
      "Target": "#goto",
      "Value": "Next",
      "Description": ""
    },
    {
      "Command": "open",
      "Target": "${Global_download_URL}",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "label",
      "Target": "Next",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "comment",
      "Target": "click // ${Global_xpath}",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "executeScript",
      "Target": "return parseInt(${Global_i}) + 1",
      "Value": "Global_i",
      "Description": ""
    },
    {
      "Command": "end",
      "Target": "",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "onError",
      "Target": "#goto",
      "Value": "Complete",
      "Description": ""
    },
    {
      "Command": "click",
      "Target": "xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div/a[5]/span",
      "Value": "",
      "Targets": [
        "xpath=//*[@id=\"CHAPTER_NO_INDEX_GRID\"]/div/a[5]/span",
        "xpath=//a[5]/span",
        "css=#CHAPTER_NO_INDEX_GRID > div:nth-child(1) > a.grid-link.grid-page.grid-page-next > span:nth-child(1)"
      ],
      "Description": ""
    },
    {
      "Command": "label",
      "Target": "Complete",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "selectWindow",
      "Target": "tab=0",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "selectWindow",
      "Target": "TAB=CLOSEALLOTHER",
      "Value": "",
      "Description": ""
    }
  ]
}

Intern2 used a Javascript heavy approach. He created a Javascript function for executeScript that downloads all files on the page at once, and then uses Ui.Vision “only” to go from page to page. That works, too, but is rather tricky to debug if something goes wrong (e. g. website down/slow).

{
  "Name": "Get All PDF - Intern2 - Javascript",
  "CreationDate": "2024-4-6",
  "Commands": [
    {
      "Command": "store",
      "Target": "https://www.elegislation.gov.hk/index/chapternumber?TYPE=1&TYPE=2&TYPE=3&LANGUAGE=E&p0=",
      "Value": "baseUrl",
      "Description": ""
    },
    {
      "Command": "store",
      "Target": "https://www.elegislation.gov.hk/",
      "Value": "basePdfLink",
      "Description": ""
    },
    {
      "Command": "selectWindow",
      "Target": "TAB=OPEN",
      "Value": "https://www.elegislation.gov.hk/index/chapternumber?TYPE=1&TYPE=2&TYPE=3&LANGUAGE=E&p0=",
      "Description": ""
    },
    {
      "Command": "store",
      "Target": "1",
      "Value": "pageNum",
      "Description": ""
    },
    {
      "Command": "storeAttribute",
      "Target": "xpath=(//a[@class='grid-link grid-page grid-page-last'])[1]@href",
      "Value": "lastPageLink",
      "Description": ""
    },
    {
      "Command": "executeScript_Sandbox",
      "Target": "return ${lastPageLink}.replace('#p0=','')",
      "Value": "extracted_text",
      "Description": ""
    },
    {
      "Command": "executeScript_Sandbox",
      "Target": "return ${extracted_text}.replace('&a0=&gl=0','')",
      "Value": "lastPageNumber",
      "Description": ""
    },
    {
      "Command": "while",
      "Target": "${pageNum} <= ${lastPageNumber}",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "echo",
      "Target": "${baseUrl}${pageNum}",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "selectWindow",
      "Target": "TAB=OPEN",
      "Value": "https://www.elegislation.gov.hk/index/chapternumber?TYPE=1&TYPE=2&TYPE=3&LANGUAGE=E&p0=",
      "Description": ""
    },
    {
      "Command": "open",
      "Target": "${baseUrl}${pageNum}",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "pause",
      "Target": "15000",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "executeScript",
      "Target": "async function downloadAllPDFs() {\n  const pdfLinks = Array.from(document.querySelectorAll('a[href$=\"en-zh-Hant-HK.pdf\"], a[href$=\"en-zh-Hant-HK.assist.pdf\"]')).map(a => a.href);\n  console.log('Number of PDFs found:', pdfLinks.length);\n  \n  for (let pdf of pdfLinks) {\n    const pdfWindow = window.open(pdf, '_blank');\n    \n    //await new Promise(resolve => setTimeout(resolve, 15000));\n    //pdfWindow.close();\n  }\n \n  \n  \n}\ndownloadAllPDFs();\n\nwindow.close();\nwindow.open('https://www.elegislation.gov.hk/index/chapternumber?TYPE=1&TYPE=2&TYPE=3&LANGUAGE=E');\n\n",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "store",
      "Target": "const pdfLinks = Array.from(document.querySelectorAll('a[href$=\".pdf\"]')).map(a => a.href); console.log('Number of PDFs found:', pdfLinks.length); for (let pdf of pdfLinks) { const pdfWindow = window.open(pdf, '_blank'); await new Promise(resolve => setTimeout(resolve, 5000)); pdfWindow.close(); } pdfLinks.length;",
      "Value": "javascript",
      "Description": "Define JavaScript to find and download PDFs, then return count"
    },
    {
      "Command": "executeScript_Sandbox",
      "Target": "${javascript}",
      "Value": "pdfCount",
      "Description": "Execute the JavaScript and store PDF count"
    },
    {
      "Command": "echo",
      "Target": "Number of PDFs downloaded: ${pdfCount}",
      "Value": "",
      "Description": "Echo the PDF count"
    },
    {
      "Command": "comment",
      "Target": "pause // 1500",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "executeScript",
      "Target": "return Number (${pageNum}) + 1;",
      "Value": "pageNum",
      "Description": ""
    },
    {
      "Command": "echo",
      "Target": "${pageNum})",
      "Value": "",
      "Description": ""
    },
    {
      "Command": "endWhile",
      "Target": "",
      "Value": "",
      "Description": ""
    }
  ]
}
1 Like

Hello,

Thank you for your thoughtful response and for sharing the solutions your interns developed. I’m eager to review their approaches to automating the file download.

Best regards,
Samuel Liu

1 Like