Announcement

Collapse
No announcement yet.

question about regular expression

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question about regular expression

    Hi guys,

    I'm currently working on a regular expression and it's almost working... can't find the solution to my problem and i now start to beleve that there's no way to make it. Any what i basically need to do is a regular expression that would match all <a href> tags in a page and return the url contained in it. The problem is that it matchs things i don't want, and i stuck on finding a solution.

    here's the php command i use.

    preg_match("/[hH][rR][eE][fF]=\"?'?[^\s\"']*\"?'? ?/", $in[$i], $out);

    This command works, but it matchs things like href="mailto: ..." and href="javascript: ... ". I could remove them by doing a strstr on 'em but i'd really prefer that the preg_match would'nt return 'em to me.

    Does anyone know's if it's possible to do?

    Spazm
    P3-667@810 retail, Asus CUSL2-C, 2*128 mb PC-133(generic), G400DH 16mb, SBLive value, HollyWood+, 1*Realtek 8029(AS) and 1*Realtek 8039C, Quantum 30g, Pioneer DVD-115f

  • #2
    The colon should help you out. mailto<B>:</B>, java script<B>:</B>.

    So, add in that the colon should have http<B>:</B>, or so.
    Gigabyte P35-DS3L with a Q6600, 2GB Kingston HyperX (after *3* bad pairs of Crucial Ballistix 1066), Galaxy 8800GT 512MB, SB X-Fi, some drives, and a Dell 2005fpw. Running WinXP.

    Comment


    • #3
      Hi,

      I tryed it, but it doesn't do exactly what i what. The problem comes when the http: isn't present. Like href="/" or href=/index.html ... i'm really stuck with theses. I've started stripping them by with a substr, but if a solution is found to correct this i'll change the code.

      Spazm
      P3-667@810 retail, Asus CUSL2-C, 2*128 mb PC-133(generic), G400DH 16mb, SBLive value, HollyWood+, 1*Realtek 8029(AS) and 1*Realtek 8039C, Quantum 30g, Pioneer DVD-115f

      Comment


      • #4
        I've just talked with my boss and it seem's your solution will do in the end. He changed his mind about parsing all the urls, now he only want those that start with http: ... anyway thanks Wombat.

        Spazm
        P3-667@810 retail, Asus CUSL2-C, 2*128 mb PC-133(generic), G400DH 16mb, SBLive value, HollyWood+, 1*Realtek 8029(AS) and 1*Realtek 8039C, Quantum 30g, Pioneer DVD-115f

        Comment


        • #5
          You should still be able to write it so that <I>if</I> it has a colon then it must be http:, otherwise just take it.

          I could do it easily in Perl, but don't know php's rules.
          Gigabyte P35-DS3L with a Q6600, 2GB Kingston HyperX (after *3* bad pairs of Crucial Ballistix 1066), Galaxy 8800GT 512MB, SB X-Fi, some drives, and a Dell 2005fpw. Running WinXP.

          Comment


          • #6
            Yeah, that's an idea... i'll have to test it later. Anyway since my boss decided to drop all the url that doesn't contain http: it simplify my job alot and it's been a good thing for me since it's my first try with regular expression. Everything's been fixed with this.

            $content = preg_replace("/(<[Aa][^>]*[hH][rR][eE][fF]=.{0,1})([hH][tT]{2}[pP]:[^\s\"']*)([^>]*>[^>]*<\/[aA]>)/", "$1TEST$2$3", $content);

            Good thing that php and perl has basically the same processing 'cuz i had to read a perl book to get enough info to build that thing.

            Thanks a lot.

            Spazm
            P3-667@810 retail, Asus CUSL2-C, 2*128 mb PC-133(generic), G400DH 16mb, SBLive value, HollyWood+, 1*Realtek 8029(AS) and 1*Realtek 8039C, Quantum 30g, Pioneer DVD-115f

            Comment

            Working...
            X