Image extraction via RegEx

I recently was working on a project where I needed to write some RegEx (Regular Expressions) to extract image tags and the image’s source URLs out of RSS feeds.  Though RegEx can do some really impressive things with finding specific strings in a mass of text, it’s also a royal pain in the ass to write, especially if you don’t work in RegEx everyday.  For those that never heard of RegEx, I point you to Wikipedia, they can better explain what it’s all about way better then I.

As I mentioned my goal was to extract image tags and it’s source URL out of RSS feeds.  Part of this wouldn’t be neccessary if people actually used the image node within RSS rather then adding the image to the description’s contents…but I degress.  Knowing that these expressions may fulfill a common need I am posting them below.  Currently this is separated into two separate expressions, making it a two step process to extract the image’s source URL.  Doing it this way was less complicated and easier for me to verify it’s functionality.  If there is someone more familiar with RegEx and can optimize this into a single expression I welcome your input.

Extracts complete image tag <img ***** />
    (<|&lt;)img([\s]*[\S]*)([\s]*|[\S]*)*/(>|&gt;)

Extracts source URL from image tag
    (?<=src=”)([^”]*)(?=”)|(?<=(src=’))([^’]*)(?=’)

These expressions could easily be editied for extracting other HTML elements from RSS or other texts.

Leave a Reply