Archive

Archive for the ‘python’ Category

Java profiling

March 18, 2017 Leave a comment

YourKit is a great Java profiling tool. It is a commercial software but you can get a free 15-day evaluation license key for a fully functional version of the profiler.

The past week I was working on a Java project and the software was very slow. Using YourKit I could easily find the bottleneck. It turned out that 94% of the time was spent in a function that I implemented in a naive way. Choosing a better algorithm the software got much faster. A profiler is really useful…

For Java I use Eclipse. YourKit integrates in Eclipse perfectly through an Eclipse plugin.

Python profiling

Here is an excellent post about Python profiling.

Categories: java, python Tags: , ,

grab a Twitch video in mp3

January 17, 2017 Leave a comment

Problem
You want to grab a Twitch video in mp3. For instance, you want to listen to it offline.

Solution
You need two programs for it: youtube-dl and ffmpeg. Let’s take a concrete example:

$ youtube-dl -g "https://www.twitch.tv/wearethevr/v/115335579"
https://vod067-ttvnw.akamaized.net/v1/AUTH_system/vods_c631/wearethevr_24261824064_585034506/chunked/index-dvr.m3u8
$ ffmpeg -i "https://vod067-ttvnw.akamaized.net/v1/AUTH_system/vods_c631/wearethevr_24261824064_585034506/chunked/index-dvr.m3u8" -f mp3 out.mp3

Where https://www.twitch.tv/wearethevr/v/115335579 is the URL of this particular Twitch video.

I wrote a script for it to automate the whole process: twitch2mp3.

Categories: bash, python Tags: , ,

Detailed Twitter info in JSON: an undocumented feature

October 24, 2016 Leave a comment

Problem
Using a script, I wanted to figure out the number of my followers on Twitter. Here is my (mostly abandoned) Twitter page: https://twitter.com/szathmar . I didn’t want to use any API since I didn’t want to register for an API key so I went on the easy way: let’s scrape the necessary data out :) Digging in the HTML code I found the number of followers, but I also found a hidden treasure!

Solution
And the hidden treasure is a long json string that contains all kinds of information about a twitter user:

hidden_json2

Here on the screenshot you can see just an extract, the json string is much longer. Fine, let’s get it!

#!/usr/bin/env python3
# coding: utf-8

import json
import readline
import sys
from pprint import pprint

import requests
from bs4 import BeautifulSoup

def main():
    url = input("Full twitter URL: ")
    html = requests.get(url).text
    soup = BeautifulSoup(html, "lxml")

    tag = soup.find('input', {'class': 'json-data'})
    j = tag['value']
    d = json.loads(j)
    json_out = json.dumps(d, indent=4)
    print(json_out)

    # followers = d['profile_user']['followers_count']
    # print(followers)

##############################################################################

if __name__ == "__main__":
    main()

If you want the number of followers for instance, then uncomment the last two lines.

Thank you Twitter! It’s really nice of you to provide all these data in JSON!

Sample
The JSON that I could extract from my page is 743 lines long! Here is an extract of it:

...
"profile_image_url": "http://pbs.twimg.com/profile_images/459783802395430912/vcMT0CGX_normal.png",
"business_profile_state": "none",
"url": null,
"profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme6/bg.gif",
"screen_name": "szathmar",
"is_translator": false,
"friends_count": 123,
"followers_count": 70,
"profile_text_color": "333333",
"profile_link_color": "FF3300",
"translator_type": "none",
"profile_background_color": "709397",
...
Categories: python Tags: , , ,

[wordpress] using the old-style editor

October 16, 2016 Leave a comment

Problem
Wordpress.com introduced a while ago a new-style editor for writing posts. However, I really hate it, it’s unusable. How to get back to the old-style editor?

Solution
Transforming the URL you can get back to the old-style editor. For instance:

new style: https://wordpress.com/post/ubuntuincident.wordpress.com/5865
old style: https://ubuntuincident.wordpress.com/wp-admin/post.php?post=5865&action=edit

Let’s automate the task with Python:

#!/usr/bin/env python3
# coding: utf-8

import readline
import webbrowser

def main():
    url = input("New style URL: ")
    parts = url.split("/")
    new = "{0}//{1}/wp-admin/post.php?post={2}&action=edit".format(
        parts[0], parts[4], parts[-1]
    )
    print("New style:", new)
    webbrowser.open_new_tab(new)

##############################################################################

if __name__ == "__main__":
    main()

Screenshots

New-style shit.

New-style shit.

Old-style goodie.

Old-style goodie.

purge a reddit account

August 2, 2016 Leave a comment

Problem
You have a reddit account that you want to empty, i.e. delete all the posts and comments you have made.

Solution
Use Shreddit. It deletes a limited number of posts/comments in a session, so you may have to re-run it several times. When it cannot remove anything, then it’s done.

Categories: python Tags: ,

Jinja2-like template for PHP

August 1, 2016 Leave a comment

Problem
My primary language is Python. When I need to do a simple webpage or a REST API, I use Flask with its built-in Jinja2 template engine.

However, I started to work on a project with some friends and our UI developer chose PHP for the frontend. As I also want to contribute to the UI, I looked around the PHP template engines if there is someting similar to Jinja2.

Solution
It turned out that Jinja2 was ported to PHP! It’s called Twig and it’s almost the same. So if you use Flask, Twig is a natural choice for PHP.

There are also several MVC frameworks for PHP but I don’t use any (yet?). I have a PHP file (the controller), and a corresponding HTML file (the view, i.e. the template). Let’s see a simple example:

index.php:

<?php
require_once 'vendor/twig/lib/Twig/Autoloader.php';
Twig_Autoloader::register();

$loader = new Twig_Loader_Filesystem('templates');
$twig = new Twig_Environment($loader, array(
    // 'cache' => 'compilation_cache',
));

$context = array(
    'name' => 'Twig',
);

echo $twig->render('index.html', $context);
?>

index.html:

Hello {{ name }}!

It will print the text “Hello Twig!” to the screen.

What happens? The index.php file is the controller. Here you collect all the data that you want to print in the resulting HTML output. These data are put in a hash table (dictionary), and it’s passed to the template file index.html.

You can enable the cache in the index.php file. In this case the view will be “compiled” to a PHP file, making it faster. However, during the development you’d better switch it off. As I noticed, when I change the source code, the cache is not always updated automatically. So if you enable the cache and change the source, don’t forget to purge the cache.

Project layout
My project structure looks like this:

.
├── compilation_cache
├── index.php
├── templates
│   └── index.html
└── vendor
    └── twig
        └── lib
            └── Twig
                └── Autoloader.php
                └── ... (other files of the Twig template engine)

For security reasons, I think it’s a good idea to move the “vendor” folder somewhere else that is not accessible via the http protocol. That is, if your project is served from your “~/public_html” folder, move the “vendor” folder outside of “~/public_html“. I’m not sure but it may be true for the “compilation_cache” folder too.

Categories: php, python Tags: , ,

Scraping AJAX web pages (Part 5.5)

July 13, 2016 Leave a comment

Don’t forget to check out the rest of the series too!

This post is very similar to the previous one (Part 5), which scraped a webpage using PhantomJS from the command line and sent the output to the stdout.

This time we use PhantomJS again, but we do it from a Python script and wrap Selenium around PhantomJS. The generated HTML source will be available in a variable. Here is the source:

#!/usr/bin/env python3
# encoding: utf-8

"""
required packages:
* selenium
optional packages:
* bs4
* lxml
"""

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
# from bs4 import BeautifulSoup

url = "http://simile.mit.edu/crowbar/test.html"

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
html = driver.page_source
print(html)
# soup = BeautifulSoup(driver.page_source, "lxml") #page_source fetches page after rendering is complete
# driver.save_screenshot('screen.png') # save a screenshot to disk

driver.quit()

The script sets the user agent (optional but recommended). The source is captured in a variable. The last two lines are in comments but they would work. You could feed the source to BeautifulSoup and then you could extract part of the HTML source. If you uncomment the last line, then you can create a screenshot of the webpage.