|
1 | | -# Scrape_ADC |
| 1 | +Austrian aid scraper |
| 2 | +============================== |
| 3 | +The scraper extracts information from the EU arms export reports between 2005 and 2013, which is very hard to read for machines. The automatically extracted information is then stored in different data structures (network, country specific) and file formats (CSV, JSON), which are relevant for the next steps, like network analysis, visualization and statistical analysis. |
| 4 | + |
2 | 5 | Scrapes data for all development cooperation projects on Austrian Development Cooperation website |
| 6 | + |
| 7 | +- Team: Gute Taten für gute Daten Project (Open Knowledge Austria) |
| 8 | +- Status: Production |
| 9 | +- Documentation: English |
| 10 | +- License: [MIT License](http://opensource.org/licenses/MIT) |
| 11 | +- Website: [Gute Taten für gute Daten project](http://okfn.at/gutedaten/) |
| 12 | + |
| 13 | +**Used software** |
| 14 | +The sourcecode is written in Python 2. It was created with use of [iPython](http://ipython.org/), [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) and [urllib2](https://docs.python.org/2/library/urllib2.html). |
| 15 | + |
| 16 | +**Copyright** |
| 17 | + |
| 18 | +All content is openly licensed under the [Creative Commons Attribution 4.0](http://creativecommons.org/licenses/by/4.0/) license, unless otherwisely stated. |
| 19 | + |
| 20 | +All sourcecode is openly licensed under the [MIT license](http://opensource.org/licenses/MIT), unless otherwisely stated. |
| 21 | + |
| 22 | +## SCRAPER |
| 23 | + |
| 24 | +**Description** |
| 25 | + |
| 26 | +The scraper fetches the html passed in as urls from a csv file and stores them locally. The html is then parsed with BeautifulSoup4. Every table between the requested start country and end country is parsed out row by row, cell by cell and stored into a JSON structure with importing countries -> exporting countries -> arms classes -> data. The data structure is then used to create nodes and edges files as JSON and CSV. This can also be used to extract country specific data to understand imports and exports from a country's perspective. |
| 27 | + |
| 28 | +**Run scraper** |
| 29 | +Go into the root folder of this repository and execute following commands in your terminal: |
| 30 | +``` |
| 31 | +cd code |
| 32 | +python aid-scraper.py |
| 33 | +``` |
| 34 | + |
| 35 | +## DATA INPUT |
| 36 | +The original data is from the project list of the austrian development agency (ADA) [published on their website](http://www.entwicklung.at/nc/zahlen-daten-und-fakten/projektliste/?tx_sysfirecdlist_pi1[test]=test&tx_sysfirecdlist_pi1[mode]=1&tx_sysfirecdlist_pi1[sort]=uid%3A1&tx_sysfirecdlist_pi1[pointer]=). The data consists of all contracts approved since January 1st of 2010 in chronological order. The date of the last update can be found on the first table page as "Datum der letzten Aktualisierung". |
| 37 | + |
| 38 | +### The Tables |
| 39 | +The tables are the basic data, where most of the data is parsed out. The data is published in the following structure (e. g. first project in the list). |
| 40 | + |
| 41 | + |
| 42 | +| Vertragsnummer | Vertragstitel | Land/Region | OEZA/ADA-Vertragssumme | Vertragspartner | |
| 43 | +|----------------|---------------|-------------|------------------------|-----------------| |
| 44 | +| 2325-02/2016 | Programm zum Schutz der MenschenrechtsverteidigerInnen in der westlichen Region Guatemalas | Guatemala | EUR 64.300,00 | HORIZONT3000 - Österreichische Organisation für Entwicklungszusammena | |
| 45 | + |
| 46 | +**Attributes** |
| 47 | +- Vertragsnummer: contract number of project. |
| 48 | +- Vertragstitel: title of project. |
| 49 | +- Land/Region: country or region, where project takes place at. |
| 50 | +- OEZA/ADA-Vertragssumme: amount of money granted by contract. |
| 51 | +- Vertragspartner: partner(s) in the project. |
| 52 | + |
| 53 | + |
| 54 | +### The project pages |
| 55 | +When you click on the contract titel in a table you get to the project page. It consists of the same data as the table view, except the additional description text (named "Beschreibung"). |
| 56 | + |
| 57 | +### Soundness |
| 58 | +- So far, we can not say anything about the data quality (completeness, accurateness, etc.), but there are also so far no reaseons to doubt the quality. |
| 59 | + |
| 60 | +## DATA OUTPUT |
| 61 | + |
| 62 | +**raw html** |
| 63 | + |
| 64 | +The scraper downloads all raw html of each table and each project page. |
| 65 | + |
| 66 | +**aid data JSON** |
| 67 | + |
| 68 | +The parsed data is stored in an easy-to-read JSON file for further usage. |
| 69 | +``` |
| 70 | +[ |
| 71 | + { |
| 72 | + 'contract-number': contract number of the project |
| 73 | + 'contract-title': title of the project |
| 74 | + 'country-region': country and/or region, where the project takes place |
| 75 | + 'OEZA-ADA-contract-volume': amount of funding by austrian development agency |
| 76 | + 'contract-partner': partner organisation(s) |
| 77 | + 'description': description text of the project |
| 78 | + 'url': url of the project page |
| 79 | + }, |
| 80 | +] |
| 81 | +``` |
| 82 | + |
| 83 | +**aid data csv** |
| 84 | + |
| 85 | +The parsed data is stored in a human-readable CSV file for further usage. |
| 86 | + |
| 87 | +columns (see attribute description above): |
| 88 | +- contract-number |
| 89 | +- contract-title |
| 90 | +- OEZA-ADA-contract-volume |
| 91 | +- contract-partner |
| 92 | +- country-region |
| 93 | +- description |
| 94 | +- url |
| 95 | + |
| 96 | +row: one project each row. |
| 97 | + |
| 98 | +## STRUCTURE |
| 99 | +- [README.md](README.md): Overview of repository |
| 100 | +- [CHANGELOG.md](README.md): Overview of repository |
| 101 | + |
| 102 | +## TODO |
| 103 | +**important** |
| 104 | +- verify the data |
| 105 | +- research: is there a difference between approved funding and paid one? |
| 106 | + |
| 107 | +**optional** |
| 108 | +- create dataset for network analyses: json, csv for gephi and networkX |
| 109 | +- update code to Python3 |
| 110 | +- compare data from tables with data from project pages. |
| 111 | + |
| 112 | +**new features** |
| 113 | +- analyze and visualize the data: networkX, maps, Gephi |
| 114 | +- add country namecodes for easier combinating with other data |
| 115 | + |
| 116 | +## CHANGELOG |
| 117 | +### Version 0.2 - 2015-10-29 |
| 118 | +**init repo** |
| 119 | + |
| 120 | + |
| 121 | + |
| 122 | + |
| 123 | + |
0 commit comments