GnuPG small guide

GnuPG, or Gnu Privacy Guard, is a software that implements the OpenPGP standard — in which PGP stands for Pretty Good Privacy.

It is a tool used for — crash the drums — privacy. By privacy, you should understand sending private messages between users, to store any data safely, being sure that a message has a given origin and others.

For those who think that privacy is a waste of time, stuff for paranoid tech guys, I suggest reading this post in Stack Exchange.

Although we live in a free era, the freedom is sometimes menaced by dictatorships and state surveillance and we do not know when these privacy knowledge will go from “nice to have” to a “must have”.

PGP was created in 1991 by Phil Zimmermann, in order to securely store messages and files, and no license was required for its non-commercial use. In 1997 it was proposed as a standard in the IETF, and thus emerged OpenPGP.

Uses for the GPG

GPG have some nice features, some used more frequently, others not so:

  • Signing and verifying a commit: With this, you can always know the owner of a commit, as well as let the others be sure you wrote some piece of code. The Git Horror Story is a tale that tries to instill the sense for sign/verify commits into the greenhorn programmers. Besides it all, there are some people that have its own issues with commit signatures.
  • Send and receive cryptographed messages: Remember when as a kid you created encoded messages and thought no one would never understand¹? Well, bad news, everyone could read them. But now you can create messages that no one will understand², for real.
  • Checking signatures: It is possible to know (with a great confidence degree) that some file is from someone specific. For instance, consider downloading Tor, it is a secure browser, but if you download a hacked software, it will serve for the exact contrary of its purpose. But calm down, no need to cry, with GPG you can check the integrity of any software.

Come with us to learn cool stuff!

¹ No? Oh… sorry for your childhood. :'(
² Maybe it is possible to break a 4096 RSA encryption.

Creating your key

First of all, you must download and install the GPG tools. Then you can check the installation with

> gpg --version
gpg (GnuPG) 1.4.20
License GPLv3+: GNU GPL version 3 or later [...]

The next step is to create a key for your use, which is very easy! The step-by-step can be seen here or here. Briefly, it is

> gpg --gen-key

And some things you must pay attention to the creation of the key:

  • I would rather choose RSA kind of key. I suggest you to read this to understand why.
  • Be sure to choose the maximum bit length for your key, if you want it to be safer. At the time this article is being written, the maximum is 4096.
  • It is nice to set an expiration date. It must happen to you to lose your key or die (as we all some day) and then is a good practice to let others know that key is not being used anymore.
  • Avoid comments on your key, as it is redundant at almost every time.
  • It is nice to use strong digest algos like SHA512. I suggest you understand and create a nice gpg.conf to assure you are using the best configuration.

Here you can find a lot of other good practices for your GPG day-to-day use.

Assuming you finished the creation of your key, you can check it all with

> gpg --list-keys
pub   4096R/746F8A65 2017-05-24 [expires: 2018-05-24]
      Key fingerprint = 014C F6E9 C2E0 12A2 4187  F108 178A C6CD 746F 8A65
      uid                  Lucas Almeida Aguiar <lucas.tamoios@gmail.com>
      sub   4096R/AFC85A01 2017-05-24 [expires: 2018-05-24]

As a brief summary, pub stands for “public” key, then you have the key length (4096 bits) with the R from RSA, a slash, and the short fingerprint, then the creation and expiration date. The short fingerprint takes the last 8 digits from your actual fingerprint. The uid is what you wrote a few minutes ago when I told you to do not write a comment. Let’s talk about the subs later.

For now, pay attention: the uid is not enough for you to believe someone is who he/she is telling you he/she is. Anyone can create a key with any name or e-mail. To be sure someone is really who he/she is telling you, you must check its fingerprint. We will cover it deeper when discussing web of trust.

Working with keys

Generally, you have not only your keys but also other people’s public keys, that you use to verify signatures and to send them encoded stuff. You have the power to edit your key and change how you see the other’s keys with

> gpg --edit-keys 746F8A65

This hash in front of the command is just the short fingerprint of the uuid you want to edit.

> gpg --edit-keys 746F8A65
gpg (GnuPG) 1.4.20; Copyright (C) 2015 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Secret key is available.

pub  4096R/746F8A65  created: 2017-05-24  expires: 2018-05-24  usage: SC  
                     trust: ultimate      validity: ultimate
sub  4096R/AFC85A01  created: 2017-05-24  expires: 2018-05-24  usage: E   
sub  4096R/B2CD6DC9  created: 2017-05-24  expires: 2018-05-24  usage: S   
[ultimate] (1). Lucas Almeida Aguiar <lucas.tamoios@gmail.com>

Pay attencion to the letters in the usage attribute, they mean
S for sign
E for encrypt
C for certify

The certify usage is the most powerful of them because it can create, trust and revoke keys.

The gpg --edit-keys allows you to change passwords, trust keys, sign keys, change expire date, and other.

Trusting keys

The PGP have a decentralized trust model, called web of trust. It allows you to trust keys even without a central server, as it is on X.509. The kinds of trust you can set to keys are:

  • Ultimate: is only used for your own keys.
  • Full(I trust fully): is used for keys which you really trust. Anyone who trust you, will also trust all of your fully trusted keys. Take care with it.
  • Marginal(I trust marginally): if you set a key as marginal trusted, it is like you only trust 1/3 of its trusted keys. An example took from here: If you set Alice’s, Bob’s and Peter’s key to ‘Marginal’ and they all sign Ed’s key, Ed’s key will be valid.
  • Unknown: is the default state.
  • Undefined(Don’t know): has the same meaning as ‘Unknown’ but actively set by the user.
  • Never(I do NOT trust): same of ‘Unknown / Undefined’, but meaning you know that the key owner is not accurately verifying other keys before signing them.

To trust someone you must first import a key. You can import the raw file with
the public key with

> gpg --import john_doe.asc

or download it from a keyserver

> gpg --keyserver hkp://pgp.mit.edu --search-keys "john_doe@example.com"

Then when you --list-keys, the key you imported should be there.

Trusting keys is a serious issue. If you start to trust everyone, without checking, you will probably end up being yourself trusted as “Never”. I recommend you to only trust keys when you are sure it belongs to its owner, and checked with him the key’s fingerprint.

Check the fingerprint of the key you want to trust

> gpg --fingerprint jonh_doe@example.com

Edit the key you want to trust

> gpg --list-keys
pub   3072D/930F2A9E 2017-06-20 [expires: 2017-06-21]
      Key fingerprint = ...
> gpg --edit-keys 930F2A9E
...
pub  4096R/930F2A9E  created: 2017-06-20 expires: 2018-06-20 usage: SC  
...
gpg> trust

Then you must set the trust level, according to the list described above. See here with more details.

Conclusion

The aim of this post was to give an overview about GPG, since the links I put
here can serve as a bootstrap to further learnings.

Linux Asynchronous I/O

Disclaimer: This post is a compilation of my study notes and I am not an expert on this subject (yet). I put the sources linked for a deeper understanding.

When we talk about asynchronous tasks what comes to our minds is almost always a task running in a separated thread. But, as I will clear below, this task usually is blocking, and not async at all.

Another interesting misunderstanding occurs between nonblocking and asynchronous (or blocking and synchronous as well), and people use to think these pair of words are always interchangeable. We will discuss below why this is wrong.

The models

So here we will discuss the 04 available I/O models under Linux, they are:

  • blocking I/O
  • nonblocking I/O
  • I/O multiplexing
  • asynchronous I/O

Blocking I/O

The most common way to get information from a file descriptor is a synchronous and blocking I/O. It sends a request and waits until data is copied from the kernel to the user.

An easy way to implement multitasking is to create various threads, each one with blocking requests.

Non-blocking I/O

This is a synchronous and non-blocking way to get data.

When a device is open with the option O_NONBLOCK, any unsuccessful try to read it return an exception EWOULDBLOCK or EAGAIN. The application then retries to read the data until the file descriptor is ready for reading.

This method is very wasteful and maybe that’s the reason to be rarely used.

I/O multiplexing or select/poll

This method is an asynchronous and blocking way to get data.

The multiplexing, in POSIX, is done with the functions select() and poll(), that registers one or more file descriptors to be read, and then blocks the thread. As been as the file becomes available, the select returns and is possible to copy the data from the kernel to the user.

I/O multiplexing is almost the same as the blocking I/O, with the difference that is possible to wait for multiple file descriptors to be ready at a time.

Asynchronous I/O

This one is the method that is, at the same time, asynchronous and non-blocking.

The async functions tell the kernel to do all the job and report only when the entire message is ready in the kernel for the user (or already in the user space).

There are two asynchronous models:

  • The all asynchronous model, that the POSIX functions that begin with aio_ or lio_;
  • The signal-driven model, that uses SIGIO to signal when the file descriptor is ready.

One of the main difference between these two is that the first copy data from kernel to the user, while the seconds let the user do that.

The confusion

The POSIX states that

Asynchronous I/O Operation is an I/O operation that does not of itself cause the thread requesting the I/O to be blocked from further use of the processor. This implies that the process and the I/O operation may be running concurrently.

So the third and fourth models are, really, asynchronous. The third being blocker, since after the register of the functions it waits for the FDs.

I believe that there is almost always space to a discussion when it comes to the use of any terms. But only the fact that there is divergence whether blocking I/O and synchronous I/O are the same thing shows us that we have to be cautious when we use these terms.

To finish, an image worths a thousand of words, and a table even more, so let us look at this:

Further words

When dealing with asynchronous file descriptors, it is important to make account of how many of them the application can handle open at a time. This is easily checked with

$ cat /proc/sys/fs/file-max

Be sure to check it to the right user.

Other References

I/O Multiplexing, by shichao
Boost application performance using asynchronous I/O, by M. Jones
Asynchronous I/O and event notification on linux
The Open Group Base Specifications Issue 7, by The IEEE and The Open Group

Web crawlers 101

Neste post mostrarei algumas ferramentas interessantes para desenvolver web crawlers.

Pré-requisitos:
– Conhecimento básico em python
– Conhecimento básico em HTML
– Conhecimento básico em HTTP

Para quem não sabe, um web crawler é basicamente um software que visita várias páginas web em busca de algum tipo de informação.

Por exemplo, suponha que você queira encontrar um emprego como freelancer, digamos, para desenvolver crawlers. Talvez você já tenha outros projetos, e não queria ficar passeando entre os sites com as vagas todos os dias. Não seria legal automatizar esse processo e ainda torná-lo divertido? E ainda teria uma boa experiência para citar em sua entrevista! 😉

Iremos fazer aqui um crawler simples para o 99freelas, e pegaremos todas as vagas que tenham algumas palavras chaves que definiremos. Sim, eu sei que existe um filtro de pesquisas para isso, mas lembre-se que este post tem apenas fins didáticos.

Vamos começar?

Para começar, é importante ter alguma ferramenta para inspecionar o código que iremos buscar. Particularmente, eu gosto de usar as DevTools do Chrome, que você pode entender muito melhor aqui, mas sinta-se à vontade para utilizar as ferramentas de desenvolvimento web de sua preferência.

Usando o DevTools do Chrome, vá para a página do 99freelas e pressione F12 no teclado. Essa nova janela que apareceu mostram as DevTools de que falei. Antes de prosseguir, tente entender um pouco de como funciona essa ferramenta!

Mas e que horas que começaremos a extrair os dados?

Calma! Antes vamos instalar as dependências! Aconselho a usar um ambiente virtual do python:

$ pip install requests beautifulsoup4  

Agora sim podemos começar nosso script para buscar os dados do site:

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.99freelas.com.br/projects'

# Parâmetros da query string da URL, utilizados como filtro
params = {
    'categoria': 'web-e-desenvolvimento',
}

# Fazendo a requisição
resposta = requests.get(URL, params)

Antes de prosseguir, vamos brincar um pouco com a resposta do sistema.

resposta.status_code
resposta.content
resposta.text

O primeiro irá mostrar o status HTTP de resposta da requisiçao. O resposta.content e o resposta.text ambos retornam a informação buscada no site, mas a primeira dá o resultado em bytes e o segundo em unicode.

Utilizaremos o BeautifulSoup para explorar o HTML da página. Que é bem mais fácil que fazer um parser de texto puro para o resposta.content.

Vamos buscar a primeira entrada da tabela no site 99freelas. Pressionando Ctrl+Shift+C e clicando na primeira entrada da tabela, o inspetor de elementos provavelmente irá aparecer na tela. No HTML vamos buscar pela div <div class="projects-result">, e dentro desse tag, buscar cada elemento da lista:

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.99freelas.com.br/projects'

# Filtros da busca no site, que são a query string da URL
params = {
    # Na URL isso estará como ?categoria=web-e-desenvolvimento
    'categoria': 'web-e-desenvolvimento',
}

# Fazendo a requisição para a URL completa:
# https://www.99freelas.com.br/projects?categoria=web-e-desenvolvimento
resposta = requests.get(URL, params)

# Porque com o BeautifulSoup fica mais fácil analisar a página
site = bs(resposta.content, 'html.parser')

# Extraímos a tabela com o BeautifulSoup
tabela = site.find(attrs={'class': 'projects-result'})

# E agora extraímos todos os elementos da lista na tabela
vagas = tabela.find_all('li')

Cada elemento o BeautifulSoup é explorável. Aconselho a bincar um pouco com os
elementos do BeautifulSoup antes de prosseguirmos…

Vamos selecionar todos os clientes com quatro estrelas ou mais que tenham a palavra “crawler” na descrição.

# ... Continuando...
vagas = tabela.find_all('li')

# Quais termos serão buscados?
termos = ['crawler', 'spider']
vagas_interessantes = []

# E varremos todas as vagas para encontrar os termos selecionados
for vaga in vagas:
    # Busca pela classe description (repare na conversão para texto!)
    descricao = vaga.find(attrs={'class': 'description'}).text
    avaliacao = float(vaga.find(attrs={'class': 'avaliacoes-star'})['data-score'])
    link_vaga = vaga.find(attrs={'class': 'title'}).find('a')['href']

    if any(termo for termo in termos if termo in descricao) and avaliacao > 4:
        vagas_interessantes.append({
            'link_vaga': link_vaga,
            'descricao': descricao
        })

Todas as vagas interessantes estarão (com seus links) na lista vagas_interessantes. Podemos agora paginar a busca, criar algum tipo de notificação quando houver uma vaga interessante, dentre outras inúmeras possibilidades. Divirta-se!