I am trying to build a topic hierarchy by following the below mentioned two DBpedia properties.
- skos:broader property
- dcterms:subject property
My intention is to given the word identify the topic of it. For example, given the word; 'suport vector machine', I want to identify topics from it such as classification algorithm, machine learning etc.
However, sometimes I am bit confused as how to build a topic hierarchy as I am getting more than 5 URIs for subject and many URIs for broader properties. Is there a way to measure strength or something and reduce the additional URIs that I get from DBpedia and to assign only the highest probable URI?
It seems there are two questions there.
- How to limit the number of DBpedia Spotlight results.
- How to limit the number of subjects and categories for a particular result.
My current code is as follows.
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = 'First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89. Following German reunification in 1990, the city regained its status as the capital of Germany, hosting 147 foreign embassies.'
CONFIDENCE = '0.5'
SUPPORT = '120'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['@URI'])
for url in all_urls:
sparql.setQuery("""
SELECT * WHERE {<"""
+url+
""">skos:broader|dct:subject ?resource
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print('resource ---- ', result['resource']['value'])
I am happy to provide more examples if needed.
It seems you are trying to retrieve Wikipedia categories relevant to a given paragraph.
Minor suggestions
First, I'd suggest you to perform a single request, collecting DBpedia Spotlight results into
VALUES, for example, in this way:Second, if you are talking about topic hierarchy, you should use SPARQL 1.1 property paths.
These two suggestions are slightly incompatible. Virtuoso is very inefficient, when a query contains both multiple starting points (i. e.
VALUES) and arbitrary length paths (i. e.*and+operators).Here below I'm using the
dct:subject/skos:broaderproperty path, i.e. retrieving the 'next-level' categories.Approach 1
The first way is to order resources by their general popularity, e. g. their PageRank:
Results are:
Approach 2
The second way is to calculate category frequency a given text...
Results are:
With
dct:subjectinstead ofdct:subject/skos:broader, results are better:Conclusion
Results are not very good. I see two reasons: DBpedia categories are quite random, tools are quite primitive. Perhaps it is possible to achieve better results, combining approaches 1 and 2. Anyway, experiments with a large corpus are needed.