Extracting Site URLs from Google-Search-Results Page
Can anyone advise me on the best way to extract a list of Site URLs from a Google Search return?
To clarify, here's the scenario:
01. I have a google-search-return that looks something like this:

... and I want to extract the URLs (site-links) from the first 10 results which Google has returned. (These are highlighted in red in the above image).
02. So, I execute the following code:
03. ...which returns a list of the URLs.
Looks like this:

Not too bad. It's managed to grab the site-URL for each of the 10 search results that Google has listed. Good stuff!
But - it's also grabbed 2 superfluous URLs which are highlighted in red - and which I don't want.
You can see from my code that I'm using the string.gmatch function (together with pattern-matching) to grab the URLs. And that the pattern has been pre-fixed with the 'a href' tag to ensure that it only grabs clickable URLs.
But the problem is that I can't see any way to instruct Lua to distinguish between the site URLs and other types of superfluous URLs. It'd be easy if the site URLs had unique HTML tags - but they don't.
So, can anyone advise me on a better way to go about doing this?
Have attached a copy of my APZ so far, to make things easier.
Can anyone advise me on the best way to extract a list of Site URLs from a Google Search return?
To clarify, here's the scenario:
01. I have a google-search-return that looks something like this:

... and I want to extract the URLs (site-links) from the first 10 results which Google has returned. (These are highlighted in red in the above image).
02. So, I execute the following code:
Code:
-- Read webpage to string strHTML = TextFile.ReadToString("AutoPlay\\Docs\\sample_html.htm"); -- Use a pattern-matching formula to match for URLs with (a href=") prefix for strURL in string.gmatch (strHTML, '((a href="https?://[%w_.~!*:@&+$/?%%#-]-)(%w[-.%w]*%.)(%w%w%w?%w?)(:?)(%d*)(/?)([%w_.~!*:@&+$/?%%#=-]*))') do -- Trim (a href=") prefix from each URL strURL = String.Replace(strURL, 'a href="', "", false); -- Write each URL to textfile, in list format TextFile.WriteFromString(_DesktopFolder.."\\Output.txt", strURL.."\n", true); end
03. ...which returns a list of the URLs.
Looks like this:

Not too bad. It's managed to grab the site-URL for each of the 10 search results that Google has listed. Good stuff!
But - it's also grabbed 2 superfluous URLs which are highlighted in red - and which I don't want.
You can see from my code that I'm using the string.gmatch function (together with pattern-matching) to grab the URLs. And that the pattern has been pre-fixed with the 'a href' tag to ensure that it only grabs clickable URLs.
But the problem is that I can't see any way to instruct Lua to distinguish between the site URLs and other types of superfluous URLs. It'd be easy if the site URLs had unique HTML tags - but they don't.
So, can anyone advise me on a better way to go about doing this?
Have attached a copy of my APZ so far, to make things easier.
Comment