Skip to content

Commit bfd7517

Browse files
committed
extended scraper; created README.md, add LICENSE
1 parent 62fdbba commit bfd7517

4 files changed

Lines changed: 479 additions & 71 deletions

File tree

LICENSE

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
The MIT License (MIT)
2+
3+
Copyright (c) 2016 Stefan Kasberger
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.
22+

README.md

Lines changed: 122 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,123 @@
1-
# Scrape_ADC
1+
Austrian aid scraper
2+
==============================
3+
The scraper extracts information from the EU arms export reports between 2005 and 2013, which is very hard to read for machines. The automatically extracted information is then stored in different data structures (network, country specific) and file formats (CSV, JSON), which are relevant for the next steps, like network analysis, visualization and statistical analysis.
4+
25
Scrapes data for all development cooperation projects on Austrian Development Cooperation website
6+
7+
- Team: Gute Taten für gute Daten Project (Open Knowledge Austria)
8+
- Status: Production
9+
- Documentation: English
10+
- License: [MIT License](http://opensource.org/licenses/MIT)
11+
- Website: [Gute Taten für gute Daten project](http://okfn.at/gutedaten/)
12+
13+
**Used software**
14+
The sourcecode is written in Python 2. It was created with use of [iPython](http://ipython.org/), [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) and [urllib2](https://docs.python.org/2/library/urllib2.html).
15+
16+
**Copyright**
17+
18+
All content is openly licensed under the [Creative Commons Attribution 4.0](http://creativecommons.org/licenses/by/4.0/) license, unless otherwisely stated.
19+
20+
All sourcecode is openly licensed under the [MIT license](http://opensource.org/licenses/MIT), unless otherwisely stated.
21+
22+
## SCRAPER
23+
24+
**Description**
25+
26+
The scraper fetches the html passed in as urls from a csv file and stores them locally. The html is then parsed with BeautifulSoup4. Every table between the requested start country and end country is parsed out row by row, cell by cell and stored into a JSON structure with importing countries -> exporting countries -> arms classes -> data. The data structure is then used to create nodes and edges files as JSON and CSV. This can also be used to extract country specific data to understand imports and exports from a country's perspective.
27+
28+
**Run scraper**
29+
Go into the root folder of this repository and execute following commands in your terminal:
30+
```
31+
cd code
32+
python aid-scraper.py
33+
```
34+
35+
## DATA INPUT
36+
The original data is from the project list of the austrian development agency (ADA) [published on their website](http://www.entwicklung.at/nc/zahlen-daten-und-fakten/projektliste/?tx_sysfirecdlist_pi1[test]=test&tx_sysfirecdlist_pi1[mode]=1&tx_sysfirecdlist_pi1[sort]=uid%3A1&tx_sysfirecdlist_pi1[pointer]=). The data consists of all contracts approved since January 1st of 2010 in chronological order. The date of the last update can be found on the first table page as "Datum der letzten Aktualisierung".
37+
38+
### The Tables
39+
The tables are the basic data, where most of the data is parsed out. The data is published in the following structure (e. g. first project in the list).
40+
41+
42+
| Vertragsnummer | Vertragstitel | Land/Region | OEZA/ADA-Vertragssumme | Vertragspartner |
43+
|----------------|---------------|-------------|------------------------|-----------------|
44+
| 2325-02/2016 | Programm zum Schutz der MenschenrechtsverteidigerInnen in der westlichen Region Guatemalas | Guatemala | EUR 64.300,00 | HORIZONT3000 - Österreichische Organisation für Entwicklungszusammena |
45+
46+
**Attributes**
47+
- Vertragsnummer: contract number of project.
48+
- Vertragstitel: title of project.
49+
- Land/Region: country or region, where project takes place at.
50+
- OEZA/ADA-Vertragssumme: amount of money granted by contract.
51+
- Vertragspartner: partner(s) in the project.
52+
53+
54+
### The project pages
55+
When you click on the contract titel in a table you get to the project page. It consists of the same data as the table view, except the additional description text (named "Beschreibung").
56+
57+
### Soundness
58+
- So far, we can not say anything about the data quality (completeness, accurateness, etc.), but there are also so far no reaseons to doubt the quality.
59+
60+
## DATA OUTPUT
61+
62+
**raw html**
63+
64+
The scraper downloads all raw html of each table and each project page.
65+
66+
**aid data JSON**
67+
68+
The parsed data is stored in an easy-to-read JSON file for further usage.
69+
```
70+
[
71+
{
72+
'contract-number': contract number of the project
73+
'contract-title': title of the project
74+
'country-region': country and/or region, where the project takes place
75+
'OEZA-ADA-contract-volume': amount of funding by austrian development agency
76+
'contract-partner': partner organisation(s)
77+
'description': description text of the project
78+
'url': url of the project page
79+
},
80+
]
81+
```
82+
83+
**aid data csv**
84+
85+
The parsed data is stored in a human-readable CSV file for further usage.
86+
87+
columns (see attribute description above):
88+
- contract-number
89+
- contract-title
90+
- OEZA-ADA-contract-volume
91+
- contract-partner
92+
- country-region
93+
- description
94+
- url
95+
96+
row: one project each row.
97+
98+
## STRUCTURE
99+
- [README.md](README.md): Overview of repository
100+
- [CHANGELOG.md](README.md): Overview of repository
101+
102+
## TODO
103+
**important**
104+
- verify the data
105+
- research: is there a difference between approved funding and paid one?
106+
107+
**optional**
108+
- create dataset for network analyses: json, csv for gephi and networkX
109+
- update code to Python3
110+
- compare data from tables with data from project pages.
111+
112+
**new features**
113+
- analyze and visualize the data: networkX, maps, Gephi
114+
- add country namecodes for easier combinating with other data
115+
116+
## CHANGELOG
117+
### Version 0.2 - 2015-10-29
118+
**init repo**
119+
120+
121+
122+
123+

Scrape_ADC.py

Lines changed: 0 additions & 70 deletions
This file was deleted.

0 commit comments

Comments
 (0)