Announcement

Collapse
No announcement yet.

Pattern-matching for valid URLs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pattern-matching for valid URLs

    Anyone know of a good regex / pattern-matching formula to check a given string for valid URLs via the str:match (or other) function?

    eg. I'm trying to get something similar to this to work:
    Code:
    str = "blah, blah, blah ...https://forums.indigorose.com ...blah,blah, blah"    
    print(str:match("http://www.|https://www.|http://|https://)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$"))

  • #2
    Okay, I've worked out this Pattern:
    Code:
    ^[http://][https://][http://www.][https://www.]+%w+%.%w+[/%w%.]+$
    ...which correctly identifies any valid URL structure, such as:

    http://www.example.com
    https://www.example.com
    https://www.example.com/test.php


    However, it incorrectly identifies URLs with missing TLDs
    eg. http://www.example
    I've attached an example. Can someone tell me what's wrong with my Pattern structure?


    And:
    There's also a very good Regex for matching URLs at: https://code.tutsplus.com/tutorials/...know--net-6149 which goes like this:
    Code:
    ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
    ...but I can't figure out how to express this one as a Pattern. Anyone know how to do it?
    Attached Files

    Comment


    • #3
      Update:
      Hmmm, seems I'm getting a little ahead of myself here (apologies for the duplicate posting). But it seems the following simplified pattern does the same job:
      Code:
      ^[https//:]+%w+%.%w+[/%w%.]+$
      Sample is attached again.

      But it's still incorrectly identifying URLs with missing TLDs.
      This has me stumped?
      Attached Files

      Comment


      • #4
        Addendum:
        I've also just noticed that this Pattern is failing to correctly identify URLs with blank-spaces between characters
        eg. https://www.example.com/test%20this%20script.php

        Arghh! My brain is starting to go into meltdown.

        Pattern-matching is HARD, man!

        Comment


        • #5
          And around, around we go ...
          Okay, have now fixed the problem with actual blank spaces with this ammended Pattern:
          Code:
          ^[http://][https://][http://www.][https://www.]+%w+%.%w+[/%w%.%s*]+$
          ..so it will work if the URL looks like: https://www.example.com/test this script.php

          but still no solution if the URL is actually coded like this:https://www.example.com/test%20this%20script.php

          Have I confused everyone, yet?
          I'm slowly driving myself crazy with this.

          Comment


          • #6
            Just a head sup I didn't read the full thread but lua only supports pattens not full regex from what I can remember
            Plugins or Sources MokoX
            BunnyHop 2021 Here

            Comment


            • #7
              Yes, I know. The regex example in the 2nd posting is there because I was asking if anyone knew how to re-express it as a pattern. (Re-expressed as a pattern, it would have provided a very thorough solution).

              Regardless, I'm now at the point where I've refined my original pattern down to this:
              Code:
              https//:]+%w+%.%w+[/%w%.%s*]+$
              which is working - except on URLs that use "%20" to fill in blanks spaces.

              So, the pattern will work, except in cases when the URLs look like this: https://www.example.com/test%20this%20script.php
              At this point it becomes problematic, and is the part I'm now trying to figure out.

              Comment


              • #8

                Oops, typo there! That pattern should read as:
                Code:
                ^[https//:]+%w+%.%w+[/%w%.%s*]+$

                Comment


                • #9
                  Hi m8 not sure if these will help in your quest - only going by name haven't really checked - from my archive
                  RegXmATCH.apt
                  PatternMatching.apz
                  luasocket-2.0.2-lua-5.1.2-Win32-vc8.zip

                  Cheers

                  Comment


                  • #10
                    Ah hah, BINGO!
                    Finally worked it out - I think?

                    Good 'ol Lua.org to the rescue. The Patterns chapter at https://www.lua.org/pil/20.2.html seems to have been hiding the solution all along.

                    This seems to be the magic pattern which will include those pesky URLs formatted with "%20" for blank spaces:
                    Code:
                    ^[https?://]+%w+%.%w+[/%w_%.%s*(%%20)]+$
                    So, attached is the final apz example, showing how the pattern can be applied. (I'm sure a more thorough and robust pattern can be configured - so I'm still open to improvements).

                    @colc,
                    Thanks anyway, dude.
                    Attached Files

                    Comment


                    • #11
                      Update:
                      Final Updated .apz - to include matching on URLs with hyphenated domain names.
                      Code:
                      if (strCombo:match("^[https?://]+%w+%.%w+[/%w_%.%s*(%%20)(%-)]+$")) then
                      Attached Files

                      Comment


                      • #12
                        Hi Bio,
                        Can you try this pattern?

                        Pattern 1 it's yours and 2 it's another one i adapt from the web.

                        Match a valid URL (2patterns).apz

                        Alain.

                        Comment


                        • #13
                          Hi Alain,

                          Yes, that's awesome, buddy! Much better.

                          I was running my 'final' pattern again and have started seeing problems with it - that I missed before. Fails with unusual (but valid) URL structures.

                          Yours however, seems to capture every single URL variation that I throw at it. So KUDOS to you!
                          Many thanks - this one's going straight to the pool-room, too!

                          Cheers!

                          Nb.
                          I'm trying to go through the pattern-string (to break into its component parts for 'commenting') so that newbies'can reference the meaning of each component. Am going back and forward to Lua.org's pattern-library to correctly identify each part of the pattern - but is very brain-straining! LOL, I'll get there with persistence, though.

                          Thanks again, buddy!

                          Comment


                          • #14
                            Edit,
                            One thing it's still failing at though, is broken URL strings. (ie. Partial URLs with their TLD missing):

                            eg.
                            http://google
                            http://google.


                            ...both come up as valid URLs.

                            Also, it's correctly flagging something like http://google.ca as valid, but failing to flag http://google.c as invalid.
                            So, I'm still working on a way to refine this error from the pattern. From a utilitarian perspective, it's not all that important - but it'd be nice to get it perfect.

                            Anyway, for anyone else looking for a pattern-match formula to identify valid URLs, here's the new pattern (thanks to Alain):
                            Code:
                            https?://(([%w_.~!*:@&+$/?%%#-]-)(%w[-.%w]*%.)(%w+)(:?)(%d*)(/?)([%w_.~!*:@&+$/?%%#=-]*))

                            Comment


                            • #15
                              Edit Again,

                              Okay, scratch the initial part of my last observation. It IS correctly flagging URLs with their TLDs missing.
                              LOL, this thing has me losing my mind to the point where I'm not even seeing straight anymore!

                              Sorry for the bum steer.

                              So to reiterate, the only thing now failing, is catching partial TLDs like http://google.c
                              LOL, almost there!

                              Comment

                              Working...
                              X