Archive

selenium

I do miss Pokemon. ANYWAYS, onto the meat of the post. Last time, I spoke about organising data into JSON objects that I could work with. I also told you I was only doing that so I could understand JSON before moving onto GeoJSON. Well. Guess what. GeoJSON IS JUST A JSON OBJECT WITH A PARTICULAR STRUCTURE. Talk about an anticlimax.

HERE is the final code for scraping HTML and storing it as JSON objects. And HERE is the new code for doing the same thing but storing it as a GeoJSON object. For a detailed breakdown of what a GeoJSON object looks like, go HERE. If you can’t be bothered to read through it and figure it out, usually I would tell you to leave me alone. But I’m in a giving mood so I’m gonna explain it.

  • A GeoJSON object is ONE JSON object.
  • It is made up of TWO properties:
    • A “features” property
    • A “type” property
  • The “type” property is always set to “FeatureCollection” (side not: I’m not sure how important capitalisation is in JSON, but err on the side of caution until you find out)
  • The “features” property is always an array containing objects that represent your geographical data
  • Each object representing your geographical data contains three properties:
    • A “type” property
    • A “geometry” property
    • A “properties” property
  • The “type” property is always set to “Feature” as that is what each geographical object is called
  • The “geometry” property always contains an object that represents your exact geographical data and has two properties:
    • A “type” property that defines the type of data (point, multipoint, coordinates, etc.)
    • A “coordinates” property that contains the numbers representing your geographical data. It is either an array of numbers or an array of arrays of numbers. Easy on the brain and tongue, right?
  • The “properties” property represents an object that contains any properties you want this specific feature to have a la typical JSON fashion

DAS IT. Again, look at the spec page I linked to for more detailed information. If you look at my code, all I’m doing is translating the above into computer speak. It feels awkward at first because you’re creating an object that represents a property and an array, and the array represents objects that represent a property, an object that represents more properties and arrays, and an object that represents even more properties.

Please go away.

So you want data off a webpage. Lots of data. Theoretically, you could get it manually by going through it but by bit. Practically, I FALL ASLEEP. What you need to do is WEB SCRAPE!

What is web scraping? Lemme google. K, it’s a fancy name for ways of getting data off of websites. So, for example, I wanted the coordinates of the London subway stations. I found this handy page; has exactly the information I need. Problem? THERE’S TOO MUCH. I am not going through that table one by one and storing the data manually.

“But, Sci, how else are you gonna get the data?”, I hear you asking. Lemme tell you. I’m gonna web scrape it like it’s never been web scraped before, and I’m going to use Selenium to do it.

Why Selenium? Because I KNOW Selenium. This means I’m saving oodles of time by not having to learn something completely new for a relatively minor task. Lazy, right? Right, but sometimes results are more important than the method. THIS IS NOT TO SAY YOU SHOULDN’T LEARN NEW THINGS, DEAR READER. Oh no. In fact, I came across Python’s Beautiful Soup and Ruby’s Nokogiri and Mechanize while looking for the best way to web scrape. Because they’re libraries SPECIFIC to web scraping, they’re probably waaaayyyyy better at the task than Selenium. So, read up on them. I plan to. Especially Nokogiri, because I want someone to ask me “how did you do this?” and I want to be able to scream “NOKOGIRI”.

SO. ONTO THE CODE. These are the pretty lines I came up with:

public static void main(String[] args)
{
//Create the Selenium WebDriver object
WebDriver driver = new FirefoxDriver();

//Navigate to the subway locations page
driver.get(“http://wiki.openstreetmap.org/wiki/List_of_London_Underground_stations”);

//Find and store the table element
WebElement table = driver.findElement(By.className(“wikitable”));

//Store all the rows in the table
List<WebElement> list = table.findElements(By.xpath(“/html/body/div[3]/div[2]/div[4]/table[2]/tbody/tr”));

//Go through each row in the table
for(int x = 0; x<list.size(); x++)
{
//Store the columns of each row in a new list
List<WebElement> newList = list.get(x).findElements(By.xpath(“td”));
//Print out the data of every column in the current row
for(int y = 0; y<newList.size(); y++)
{
System.out.println(y + ” : ” + newList.get(y).getText());
}
}

//Hit it and quit it. I’M NOT A DOUCHE.
driver.quit();
}

And that, as they so annoyingly say, is that. For my next feat, I’m gonna try and figure out how to store this stuff as GeoJSON data. IS GON’ BE A REAL GOOD TIME. Lemme know if you want me to expand on the code above. I’ll do my best to explanatory.

Now excuse me while I find a WordPress template that doesn’t make code look like some sort of abstract painting.

EDIT:

See here for code that you can read without going blind. I tried including it within the post, but APPARENTLY WordPress filters “unwanted” code to protect the user. BAH. I WILL FIGURE YOU OUT WORDPRESS!