Knowledgebase: Commands
UBot 3: Choose By Attribute - Regular Expression
Posted by Miriam M-B on 17 December 2010 12:04 PM

So let's say that I am looking for a few links in a search Engine about Home Made Candy.

I search for it, and now I would like to choose only the ones I want on the page. So I do not want links that start with:

http://bing.com/etc/etc/etc

 

I just want the search results for websites on Home Made candy and we would like to get this accomplished with regular expression. First, choose a link you would like to scrape.

Choose the link by attribute.

Choose Item

 

A parameter window will appear. Choose the link by href and select the Regular Expressions option on the far right corner of the parameter window.

Choose By Attribute Parameter

 

Now you will delete everything after HTTP: in the second column of the parameter window. You will then place your regular expression code in that field.

Regex

 

The code I built for choosing the links I want is the follow:

Expression

 

Now let's break the code apart a bit just so it makes sense:

http: is what the link will start with

http: [^0-9]{2} now means we are looking for two non digit items after the http: So this could be two any of these (./-+ etc) The reason why we do not just type in // is because the / sign already means something in regular expression.

http:[^0-9]{2}[a-zA-Z]{3,} means that after the http://, we are looking for 3 or more letters, lowercase and uppercase, after that.

http:[^0-9]{2}[a-zA-Z]{3,}[^0-9]means that we are looking for another non digit item after the word. So again that could a any or these signs (./-+ etc)

http:[^0-9]{2}[a-zA-Z]{3,}[^0-9][^bing] now means that we do not want our link to contain the word BING (more the letters, B or I or N or G. You could write it out as [^b][^i][^n][^g] but that would be long and tedious) which is our search engine at the moment. This is because that usually signals that the link is a stray link for ads and other links to different pages on the search engine website. The urls have nothing to do with what we want, which are the search results.

http:[^0-9]{2}[a-zA-Z]{3,}[^0-9][^bing][a-zA-Z]{3,} now means that we are looking for 3 or more letters between lowercase a-z and uppercase A-Z after the non digit item.

http:[^0-9]{2}[a-zA-Z]{3,}[^0-9][^bing][a-zA-Z]{3,}[^0-9][a-zA-Z]{3,} Notice that we start repeating codes here, because we are basically dealing with scenarios like http://homemade.com/apples/pies/etc

 

And so finally, we end up with this altogether:

http:[^0-9]{2}[a-zA-Z]{3,}[^0-9][^bing][a-zA-Z]{3,}[^0-9][a-zA-Z]{3,}[^0-9][a-zA-Z]{3,}[^0-9][a-zA-Z]{3,}[^0-9]s*h*t*m*l*

 

 

This regular expression is going to match and find links like these:

http://homemadecandyideas.com/
http://www.homemadecandy.info/
http://www.wchstv.com/gmarecipes/homemadecandy.shtml


But it will ignore links like this one:

http://www.bing.com/explore?q=home+made+candy&FORM=BXFD

 

After inserting your Regular Expression into the parameter window of the Choose By Attribute command, click ok, and add a save to file command or an add to list command, and insert a scrape chosen attribute to scrape the items by href, like the following:

Script

 

The results of your scrape will look like this, with all the stray unnecessary links removed from the bin:

Results


Help Desk Software by Kayako Resolve