Posts Tagged ‘json’

Exctract the significant parts of a web page

October 1, 2017 Leave a comment

From a web page you want to extract the significant parts: title, author, date of publication, body, etc.

Mercury Web Parser does exactly this. It’s free. After registration you get an API key. Their web service returns a structured JSON response. I tried it with my previous post:

curl -H "x-api-key: <my_api_key>" "" | python3 -m json.tool


    "title": "Re-run a command in the terminal every X\u00a0seconds",
    "author": "Jabba Laci",
    "date_published": "2017-09-30T22:02:19.000Z",
    "dek": null,
    "lead_image_url": "",
    "content": "<div class=\"content\"> <p><strong>Problem</strong><br>\nYou want re-execute a command in the terminal every X seconds. For instance, you copy a lot of big files to a partition and you want to monitor the size of the free space on that partition.</p>\n<p><strong>Solution</strong><br>\nA naive and manual approach to the problem mentioned above is to execute the commands “<code>clear; df -h</code>” regularly, say every 2 seconds.</p>\n<p>A better way is to use the command “<code>watch</code>“. Usage:</p>\n<pre> watch -n 2 df -h </pre>\n<p>That is: execute “<code>df -h</code>” every two seconds. <code>watch</code> will also clear the screen and print the result to the top. You can quit with <code>Ctrl + c</code>.</p>\n<p>Tip from <a href=\"\">here</a>.</p> </div>",
    "next_page_url": null,
    "url": "",
    "domain": "",
    "excerpt": "Problem You want re-execute a command in the terminal every X seconds. For instance, you copy a lot of big files to a partition and you want to monitor the size of the free space on that partition.\u2026",
    "word_count": 108,
    "direction": "ltr",
    "total_pages": 1,
    "rendered_pages": 1

Pretty impressive.

Categories: web Tags: , ,


May 20, 2017 1 comment

I wrote a command-line program that outputs the full path of every key / value in a JSON file.


$ ./ sample.json
root.a => 1
root.b.c => 2
root.b.friends[0].best => Alice
root.b.friends[1].second => Bob
root.b.friends[2][0] => 5
root.b.friends[2][1] => 6
root.b.friends[2][2] => 7
root.b.friends[3][0].one => 1
root.b.friends[3][1].two => 2

More information at the project’s github page.

Categories: bash, python Tags: ,

Detailed Twitter info in JSON: an undocumented feature

October 24, 2016 Leave a comment

Using a script, I wanted to figure out the number of my followers on Twitter. Here is my (mostly abandoned) Twitter page: . I didn’t want to use any API since I didn’t want to register for an API key so I went on the easy way: let’s scrape the necessary data out :) Digging in the HTML code I found the number of followers, but I also found a hidden treasure!

And the hidden treasure is a long json string that contains all kinds of information about a twitter user:


Here on the screenshot you can see just an extract, the json string is much longer. Fine, let’s get it!

#!/usr/bin/env python3
# coding: utf-8

import json
import readline
import sys
from pprint import pprint

import requests
from bs4 import BeautifulSoup

def main():
    url = input("Full twitter URL: ")
    html = requests.get(url).text
    soup = BeautifulSoup(html, "lxml")

    tag = soup.find('input', {'class': 'json-data'})
    j = tag['value']
    d = json.loads(j)
    json_out = json.dumps(d, indent=4)

    # followers = d['profile_user']['followers_count']
    # print(followers)


if __name__ == "__main__":

If you want the number of followers for instance, then uncomment the last two lines.

Thank you Twitter! It’s really nice of you to provide all these data in JSON!

The JSON that I could extract from my page is 743 lines long! Here is an extract of it:

"profile_image_url": "",
"business_profile_state": "none",
"url": null,
"profile_background_image_url_https": "",
"screen_name": "szathmar",
"is_translator": false,
"friends_count": 123,
"followers_count": 70,
"profile_text_color": "333333",
"profile_link_color": "FF3300",
"translator_type": "none",
"profile_background_color": "709397",
Categories: python Tags: , , ,

show my position on the map

June 21, 2016 Leave a comment

The site gives you back not only your IP address, but your geolocation too. Example (with a fake IP):

$ curl
  "ip": "734.675.653.542",
  "hostname": "No Hostname",
  "city": "Debrecen",
  "region": "Debrecen",
  "country": "HU",
  "loc": "47.5333,21.6333",
  "org": "...",
  "postal": "..."

Let’s visualize my location:

<img src=",21.6333&zoom=9&size=480x240&sensor=false">

Debrecen, Hungary, center of the world :)

Categories: Uncategorized Tags: , , ,

Firefox: restore your lost tabs

April 30, 2014 Leave a comment

Over the last 1.5-2 years, I collected 700+ tabs in my Firefox :) Maybe this summer I will have some time to sort them out. However, today when I switched my computer on, all my tabs were gone and I got a clean Firefox instance with one tab only. Hmm… I had a similar problem once and then I installed an add-on called “Session Manager”. In this add-on I made the setting to offer the list of previous sessions upon restart but it didn’t do anything! Damn, how to get back my tab collection?

In the .mozilla directory there is a file called sessionstore.js that stores — among others — the opened tabs. However, this file was very small, my previous tabs were clearly not in it. Thank God there was a backup copy of this file next to it called sessionstore.bak. It was a big file and the timestamp of the file indicated that it was created 2 days ago when everything was OK with my tabs.

So, how to extract the old tabs from sessionstore.bak?

This is a JSON file, but it’s not pretty printed. I suggest copying this file to somewhere else where you can experiment with it. First, let’s make it readable:

$ python -m json.tool sessionstore.bak > session.json

Now you can open session.json with a text editor. You will find lines with a “url” key, but the number of these rows is huge. I had 731 tabs (that I lost) but this file contained 6500+ URLs. As I noticed, it also contains the URLs of closed tabs. How to extract the URLs of the opened tabs only?

Again, Python came to my rescue. After analyzing the structure of this JSON file, I could extract the tab URLs the following way:

$ python  # version 2.7
>>> import json
>>> f = open('session.json')  # input file
>>> g = open('tabs.txt', 'w')  # output file
>>> d = json.load(f)
>>> tabs = d["windows"][0]["tabs"]
>>> cnt = 0
>>> for t in tabs:
...     print >>g, t["entries"][0]["url"]
...     cnt += 1
>>> cnt
731    # Yeah! All of them are here!
>>> g.close()
>>> f.close()

The URLs of the lost tabs are now in the tabs.txt file.

I didn’t make a script of it but feel free to do it. From now on I will make regular backups of my opened tabs with the URL Lister add-on.

Categories: firefox, python Tags: , , , ,

Google’s URL shortener

June 25, 2013 Leave a comment

You want to shorten a long URL from the command line / from a script.

There are lots of URL shorteners. With the Google URL shortener you can do it like this:

curl -H 'Content-Type: application/json' -d '{"longUrl": ""}'

Sample output:

    "kind": "urlshortener#url",
    "id": "",
    "longUrl": ""

Let’s do it in Python using the requests module:

import requests
import json

url = ""
data = {"longUrl": ""}
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
r =, data=json.dumps(data), headers=headers)
print r.text
print 'Short URL:', r.json()["id"]


jq — a lightweight and flexible command-line JSON processor

October 21, 2012 Leave a comment

jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.

jq is written in portable C, and it has zero runtime dependencies.

jq can mangle the data format that you have into the one that you want with very little effort…” (link)

Check out the tutorial here.

You can also use jq to pretty print an ugly JSON file:

cat ugly_one_liner.json | jq '.'
Categories: bash Tags: , , ,