Archive

Monthly Archives: April 2013

I do miss Pokemon. ANYWAYS, onto the meat of the post. Last time, I spoke about organising data into JSON objects that I could work with. I also told you I was only doing that so I could understand JSON before moving onto GeoJSON. Well. Guess what. GeoJSON IS JUST A JSON OBJECT WITH A PARTICULAR STRUCTURE. Talk about an anticlimax.

HERE is the final code for scraping HTML and storing it as JSON objects. And HERE is the new code for doing the same thing but storing it as a GeoJSON object. For a detailed breakdown of what a GeoJSON object looks like, go HERE. If you can’t be bothered to read through it and figure it out, usually I would tell you to leave me alone. But I’m in a giving mood so I’m gonna explain it.

  • A GeoJSON object is ONE JSON object.
  • It is made up of TWO properties:
    • A “features” property
    • A “type” property
  • The “type” property is always set to “FeatureCollection” (side not: I’m not sure how important capitalisation is in JSON, but err on the side of caution until you find out)
  • The “features” property is always an array containing objects that represent your geographical data
  • Each object representing your geographical data contains three properties:
    • A “type” property
    • A “geometry” property
    • A “properties” property
  • The “type” property is always set to “Feature” as that is what each geographical object is called
  • The “geometry” property always contains an object that represents your exact geographical data and has two properties:
    • A “type” property that defines the type of data (point, multipoint, coordinates, etc.)
    • A “coordinates” property that contains the numbers representing your geographical data. It is either an array of numbers or an array of arrays of numbers. Easy on the brain and tongue, right?
  • The “properties” property represents an object that contains any properties you want this specific feature to have a la typical JSON fashion

DAS IT. Again, look at the spec page I linked to for more detailed information. If you look at my code, all I’m doing is translating the above into computer speak. It feels awkward at first because you’re creating an object that represents a property and an array, and the array represents objects that represent a property, an object that represents more properties and arrays, and an object that represents even more properties.

Please go away.

Advertisements

Sooo, if you read this post, you’ll know that I scraped some subway information off the Internets. Now that I have the data, I need to organise it all pretty like. How do I do this? JSON!

Why JSON, you ask? Decent question. I chose JSON because I want to use GeoJSON for a project. Because I don’t know enough about GeoJSON yet, I’m just gonna store the data as a JSON object for now and then convert it later when I’m more comfortable in these JSON waterfalls.

SCI, SLOW YOUR HORSES DOWN AND EXPLAIN WHAT THE HECK JSON IS IN THE FIRST PLACE. Oof, fiiiineee. So touchy. Let’s go to the Internet’s Hal 9000 for the answer:

JSON, or JavaScript Object Notation, is a text-based open standard designed for human-readable data interchange.

In layman’s terms, it’s the JavaScript way to represent objects. Yippee. I’m too lazy to go through an example explaining the structure, so Google if you’re BOVVERED.

Now. From my previous post, I have a list of station objects. Each object contains various buckets of data, but I only need three parts: station name, latitude, and longitude. So that’s exactly what I did in my code – extract the desired data, store it as a JSON object, and add that JSON object to a JSON array. I’m sure you want to see the code that does this. And I really want to show it to you. But I just realised that if I update the gist from the previous post it might make the previous post nonsensical. ALL THIS because WordPress doesn’t provide code formatting. Sigh. Here’s the new JSON specific code in all its ugly format glory…NEVERMIND. I have updated the gist and it turns out I can link to specific revision. THANK YOU GITHUB. Go HERE for the code. Let me break down the new parts:

  • First of all, I used the JSON.simple library to do all of this. Go HERE to download the jar. I also used mkywong’s tutorial to figure out what goes where.
  • OKAY. So. I created a JSONArray object called stations. This is where I’m going to store each station object.
  • NEXT. Every time I looped through a loop in the table with the station data, I created a new JSONObject object and used it to store the station name, latitude, and longitude. I then added this station object to the stations array.
  • Lastly, I created a FileWriter object and a File object and wrote my stations array to the file by using the toJSONString( ) method from the JSON.simple library. This converts your JSONObject or JSONArray into readable JSON format.

DAS IT. DONE.

I really hate the template I’m using.

So you want data off a webpage. Lots of data. Theoretically, you could get it manually by going through it but by bit. Practically, I FALL ASLEEP. What you need to do is WEB SCRAPE!

What is web scraping? Lemme google. K, it’s a fancy name for ways of getting data off of websites. So, for example, I wanted the coordinates of the London subway stations. I found this handy page; has exactly the information I need. Problem? THERE’S TOO MUCH. I am not going through that table one by one and storing the data manually.

“But, Sci, how else are you gonna get the data?”, I hear you asking. Lemme tell you. I’m gonna web scrape it like it’s never been web scraped before, and I’m going to use Selenium to do it.

Why Selenium? Because I KNOW Selenium. This means I’m saving oodles of time by not having to learn something completely new for a relatively minor task. Lazy, right? Right, but sometimes results are more important than the method. THIS IS NOT TO SAY YOU SHOULDN’T LEARN NEW THINGS, DEAR READER. Oh no. In fact, I came across Python’s Beautiful Soup and Ruby’s Nokogiri and Mechanize while looking for the best way to web scrape. Because they’re libraries SPECIFIC to web scraping, they’re probably waaaayyyyy better at the task than Selenium. So, read up on them. I plan to. Especially Nokogiri, because I want someone to ask me “how did you do this?” and I want to be able to scream “NOKOGIRI”.

SO. ONTO THE CODE. These are the pretty lines I came up with:

public static void main(String[] args)
{
//Create the Selenium WebDriver object
WebDriver driver = new FirefoxDriver();

//Navigate to the subway locations page
driver.get(“http://wiki.openstreetmap.org/wiki/List_of_London_Underground_stations”);

//Find and store the table element
WebElement table = driver.findElement(By.className(“wikitable”));

//Store all the rows in the table
List<WebElement> list = table.findElements(By.xpath(“/html/body/div[3]/div[2]/div[4]/table[2]/tbody/tr”));

//Go through each row in the table
for(int x = 0; x<list.size(); x++)
{
//Store the columns of each row in a new list
List<WebElement> newList = list.get(x).findElements(By.xpath(“td”));
//Print out the data of every column in the current row
for(int y = 0; y<newList.size(); y++)
{
System.out.println(y + ” : ” + newList.get(y).getText());
}
}

//Hit it and quit it. I’M NOT A DOUCHE.
driver.quit();
}

And that, as they so annoyingly say, is that. For my next feat, I’m gonna try and figure out how to store this stuff as GeoJSON data. IS GON’ BE A REAL GOOD TIME. Lemme know if you want me to expand on the code above. I’ll do my best to explanatory.

Now excuse me while I find a WordPress template that doesn’t make code look like some sort of abstract painting.

EDIT:

See here for code that you can read without going blind. I tried including it within the post, but APPARENTLY WordPress filters “unwanted” code to protect the user. BAH. I WILL FIGURE YOU OUT WORDPRESS!

In the original version of Selenium, you started the selenium server and used it to check if an element is present by calling its isElementPresent( ) method. An example:

assertTrue(selenium.isElementPresent(“pageBanner”));

In Selenium 2, you call the WebElement method isDisplayed( ) on the element itself, found using the web driver. An example:

WebElement banner = driver.findElement(By.id(“pageBanner”));

assertTrue(banner.isDisplayed( ));