Announcement

Collapse
No announcement yet.

Extracting Site URLs from Google-Search-Results Page

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting Site URLs from Google-Search-Results Page

    Extracting Site URLs from Google-Search-Results Page

    Can anyone advise me on the best way to extract a list of Site URLs from a Google Search return?
    To clarify, here's the scenario:


    01. I have a google-search-return that looks something like this:



    ... and I want to extract the URLs (site-links) from the first 10 results which Google has returned. (These are highlighted in red in the above image).


    02. So, I execute the following code:
    Code:
    -- Read webpage to string
    strHTML = TextFile.ReadToString("AutoPlay\\Docs\\sample_html.htm");
    
    -- Use a pattern-matching formula to match for URLs with (a href=") prefix
    for strURL in string.gmatch (strHTML, '((a href="https?://[%w_.~!*:@&+$/?%%#-]-)(%w[-.%w]*%.)(%w%w%w?%w?)(:?)(%d*)(/?)([%w_.~!*:@&+$/?%%#=-]*))') do
    
        -- Trim (a href=") prefix from each URL
        strURL = String.Replace(strURL, 'a href="', "", false);
        
        -- Write each URL to textfile, in list format
        TextFile.WriteFromString(_DesktopFolder.."\\Output.txt", strURL.."\n", true);
    end

    03. ...which returns a list of the URLs.
    Looks like this:



    Not too bad. It's managed to grab the site-URL for each of the 10 search results that Google has listed. Good stuff!
    But - it's also grabbed 2 superfluous URLs which are highlighted in red - and which I don't want.

    You can see from my code that I'm using the string.gmatch function (together with pattern-matching) to grab the URLs. And that the pattern has been pre-fixed with the 'a href' tag to ensure that it only grabs clickable URLs.

    But the problem is that I can't see any way to instruct Lua to distinguish between the site URLs and other types of superfluous URLs. It'd be easy if the site URLs had unique HTML tags - but they don't.

    So, can anyone advise me on a better way to go about doing this?
    Have attached a copy of my APZ so far, to make things easier.




    Attached Files

  • #2
    Hi Bio,
    Looking at site URLs you don't want it seems "google" string is common in both so

    Code:
    -- Read webpage to string
    strHTML = TextFile.ReadToString("AutoPlay\\Docs\\sample_html.htm");
    
    -- Use a pattern-matching formula to match for URLs with (a href=") prefix
    for strURL in string.gmatch (strHTML, '((a href="https?://[%w_.~!*:@&+$/?%%#-]-)(%w[-.%w]*%.)(%w%w%w?%w?)(:?)(%d*)(/?)([%w_.~!*:@&+$/?%%#=-]*))') do
    
        -- Trim (a href=") prefix from each URL
        strURL = String.Replace(strURL, 'a href="', "", false);
    
        strbad = String.Find(strURL, "google", 1, false);
    if strbad == -1 then --string google was not found
    
        -- Write each URL to textfile, in list format
        TextFile.WriteFromString(_DesktopFolder.."\\Output.txt", strURL.."\n", true);
    end
    end
    if we search for unwanted string and then parse to text file
    PS: When I tested your code only produced six lines in text file then with mod it produced five with unwanted google removed

    Hope this is of help or idea for you
    Cheers

    Comment


    • #3
      G'day colc. Thanks for taking a look at this.

      Mmm, yes - I did consider using 'google' or 'google.com' as the exception-string to purge the superfluous URLs. Obviously though, this becomes problematic as soon as the initial Google search starts turning up search-results that have actual Google URLs as the site-links. Even though this would be infrequent, it's an issue.

      I've started notcing though, that both of the superfluous URLs are turning up in every single Google search - one athe beginning and one at the end. They're common to every Google search. Which makes it super-easy to purge them, just by instructing Lua to ignore the first and last strings as it iterates the for/do loop. Didn't notice this before - so problem now solved.

      I'm positive though, that there's a much better way to go about this.. After scouring the web for an alternate solution, I did come across a method that uses LuaRocks to parse HTML, but it's way over my head. And I just can't be arse-d trying to get my brain around it. I've seen some backend solutions, too (using PHP or Python) but they're too complex to implement. So guess I'll just stick with this solution for the moment. Nb. If anyone has a moment of insight into this, I'm open to suggestions.


      Originally posted by colc View Post
      When I tested your code only produced six lines in text file then with mod it produced five with unwanted google removed
      Can't see any reason why you'd only be getting 5 or 6 URLs returned. I'm guessing you've probably opened the Output.txt directly in Notepad (which doesn't display them in a list properly - think maybe it's an Ansii/Unicode thing). Even though all 12 URLs are actually there, at first glance, it kind of looks like they're not. Try opening the Output.txt in Wordpad instead (or Notepad++ or with the AMS Script Editor) and you should then see all 12 URLs (or 10 with superfluous URLs purged) listed clearly.

      Comment


      • #4
        Update:

        @colc,
        Okay, I figured out why Notepad wasn't 'listing' in the Output.txt properly. You were probably seeing something like this, right?
        (Whereby each URL is separated by the square boxes?)




        It was because I inadvertently coded the "new line" directive as "\n", instead of "\r\n".when writing textfile from string.
        With the correction, it should now display as intended, like this:




        Have also now modified the for/do loop, based on your suggestion. Thanks, mate.
        It will now purge the superfluous URLs on all Google search-returns (LOL, in theory anyway). And list the results in the Output.txt correctly.

        (Modded APZ attached).


        Cheers.

        Nb.
        Please note that if any of the 10 items displayed on Google's search-return page include PageAnchors or other URLS formatted with the <a href> tag, then these will show up in the Output.txt, too. Haven't figured out a way to generically differentiate from these yet. It's a work in progress, I guess.
        Attached Files

        Comment

        Working...
        X