Page Scraping the SELENIUM WAY.

So you want data off a webpage. Lots of data. Theoretically, you could get it manually by going through it but by bit. Practically, I FALL ASLEEP. What you need to do is WEB SCRAPE!

What is web scraping? Lemme google. K, it’s a fancy name for ways of getting data off of websites. So, for example, I wanted the coordinates of the London subway stations. I found this handy page; has exactly the information I need. Problem? THERE’S TOO MUCH. I am not going through that table one by one and storing the data manually.

“But, Sci, how else are you gonna get the data?”, I hear you asking. Lemme tell you. I’m gonna web scrape it like it’s never been web scraped before, and I’m going to use Selenium to do it.

Why Selenium? Because I KNOW Selenium. This means I’m saving oodles of time by not having to learn something completely new for a relatively minor task. Lazy, right? Right, but sometimes results are more important than the method. THIS IS NOT TO SAY YOU SHOULDN’T LEARN NEW THINGS, DEAR READER. Oh no. In fact, I came across Python’s Beautiful Soup and Ruby’s Nokogiri and Mechanize while looking for the best way to web scrape. Because they’re libraries SPECIFIC to web scraping, they’re probably waaaayyyyy better at the task than Selenium. So, read up on them. I plan to. Especially Nokogiri, because I want someone to ask me “how did you do this?” and I want to be able to scream “NOKOGIRI”.

SO. ONTO THE CODE. These are the pretty lines I came up with:

public static void main(String[] args)
{
//Create the Selenium WebDriver object
WebDriver driver = new FirefoxDriver();

//Navigate to the subway locations page
driver.get(“http://wiki.openstreetmap.org/wiki/List_of_London_Underground_stations”);

//Find and store the table element
WebElement table = driver.findElement(By.className(“wikitable”));

//Store all the rows in the table
List<WebElement> list = table.findElements(By.xpath(“/html/body/div[3]/div[2]/div[4]/table[2]/tbody/tr”));

//Go through each row in the table
for(int x = 0; x<list.size(); x++)
{
//Store the columns of each row in a new list
List<WebElement> newList = list.get(x).findElements(By.xpath(“td”));
//Print out the data of every column in the current row
for(int y = 0; y<newList.size(); y++)
{
System.out.println(y + ” : ” + newList.get(y).getText());
}
}

//Hit it and quit it. I’M NOT A DOUCHE.
driver.quit();
}

And that, as they so annoyingly say, is that. For my next feat, I’m gonna try and figure out how to store this stuff as GeoJSON data. IS GON’ BE A REAL GOOD TIME. Lemme know if you want me to expand on the code above. I’ll do my best to explanatory.

Now excuse me while I find a WordPress template that doesn’t make code look like some sort of abstract painting.

EDIT:

See here for code that you can read without going blind. I tried including it within the post, but APPARENTLY WordPress filters “unwanted” code to protect the user. BAH. I WILL FIGURE YOU OUT WORDPRESS!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: